Overview of the Google Cloud Platform (GCP) Cloud Bigtable Product

Below is an overview of the Google Cloud Platform (GCP) Cloud Bigtable product.

*Double-tap to expand/collapse an item. Left-tap & drag to move. Mouse-wheel/pinch to zoom.
Knowledge Graph |  Text |  Top
Google Cloud Platform (GCP) Cloud Bigtable Product Overview Google Cloud Platform (GCP) Features Stores data in the Colossus file system Stores Tablets about data in the Colossus file system Stores metadata about the Tablets on the VMs in the Bigtable cluster (3 replicas with HDFS) Rebalancing Moving Tablets from one node to another is fast because only the pointers are updated Hot stop detection - split Tablets if too much activity in one Tips Use up to 100 Column Families without losing performance Design consideration - Wide/Dense tables with every column value populated vs Narrow tables for sparse data Row Key Design Tips Choose a Row Key that minimises sorting/searching - make your most common queries simple scans When commonly want to query on the latest few records, add a Reverse Timestamp to the Row Key (long.MaxValue - timestamp) Avoid row key hotspots so that work is distributed across nodes and tablets - e.g. avoid internet domains (some more popular), sequential ids (newer users may be more active), static repeatedly updated ids for statistics Avoid starting row key with a timestamp - doesn't distribute across nodes and tablets Row Key - want both distributed reading and writing - want associated data sequentially together Performance Better for streaming performance to add nodes to a single cluster in a single zone rather than introducing overhead of replication Fewer cells equals faster performance GCP Bigtable Performance doco Performance estimate: 10-node SSD cluster, 1kb rows, write only workload = 10k rows per second at a 6ms delay (HDD has 50ms delay) Bigtable learns, configures and optimises itself - let it warm up for a few hours, and you need to run performance tests on at least 300GB of data to get a valid result; and over a long enough time period for Bigtable to learn usage patterns and perform internal optimisations Expected performance: SSD 220 MB/s scans; HDD 180 MB/s scans Performance increases linearly with the number of nodes - node count controls throughput Change the schema to minimise data skew Have the clients in the same zone as Bigtable Concepts Tablets - data structures to identify and manage data Bigtable Cluster with HDFS stores metadata Rebalancing Row Key - the single index on a table; no other primary/secondary indexes; sorted lexicographically, ascending order NoSQL Column Families - groups columns to make them available without having to pull all the row data Periodic table compaction - removes any rows that were marked for deletion or as marked as updated and replaced with a new appended row Compression - works best if identical values are near each other - same row or in adjoining - so works best if you arrange your row keys so identical data is adjacent Access Pattern Summary: Use for structured append-only data (e.g. IoT data - NOT transactional), no-SQL, millisecond latency, real-time, high-throughput applications, time-series data, >300GB Capacity: Petabytes Access metaphor: Key-value pairs, HBase API Read: Scan rows Write: Put row Update granularity: Row Usage: Managed, high throughput, scalable, flattened data Min of US$1400 per month How To: Create cluster: gcloud bigtable instances create INSTANCE --cluster=CLUSTER --cluster-zone=... --instance-type=... --cluster-num-nodes=... --description=... --async --cluster-storage-type Create table in Python: from google.cloud import bigtable ; client = bigtable.Client(project, admin=True) ; instance = client.instance(instance_id) ; table = instance.table(table_id) ; table.create() ; columnfamily = table.column_family(col_family_id) ; columnfamily.create() Stream from Dataflow: Authenticate; Get/create table; Convert object to write into Mutation(s); Write mutations to Bigtable