Google Cloud Platform (GCP) Cloud Bigtable Product Overview
Google Cloud Platform (GCP)
Features
Stores data in the Colossus file system
Stores Tablets about data in the Colossus file system
Stores metadata about the Tablets on the VMs in the Bigtable cluster (3 replicas with HDFS)
Rebalancing
Moving Tablets from one node to another is fast because only the pointers are updated
Hot stop detection - split Tablets if too much activity in one
Tips
Use up to 100 Column Families without losing performance
Design consideration - Wide/Dense tables with every column value populated vs Narrow tables for sparse data
Row Key Design Tips
Choose a Row Key that minimises sorting/searching - make your most common queries simple scans
When commonly want to query on the latest few records, add a Reverse Timestamp to the Row Key (long.MaxValue - timestamp)
Avoid row key hotspots so that work is distributed across nodes and tablets - e.g. avoid internet domains (some more popular), sequential ids (newer users may be more active), static repeatedly updated ids for statistics
Avoid starting row key with a timestamp - doesn't distribute across nodes and tablets
Row Key - want both distributed reading and writing - want associated data sequentially together
Performance
Better for streaming performance to add nodes to a single cluster in a single zone rather than introducing overhead of replication
Fewer cells equals faster performance
GCP Bigtable Performance doco
Performance estimate: 10-node SSD cluster, 1kb rows, write only workload = 10k rows per second at a 6ms delay (HDD has 50ms delay)
Bigtable learns, configures and optimises itself - let it warm up for a few hours, and you need to run performance tests on at least 300GB of data to get a valid result; and over a long enough time period for Bigtable to learn usage patterns and perform internal optimisations
Expected performance: SSD 220 MB/s scans; HDD 180 MB/s scans
Performance increases linearly with the number of nodes - node count controls throughput
Change the schema to minimise data skew
Have the clients in the same zone as Bigtable
Concepts
Tablets - data structures to identify and manage data
Bigtable Cluster with HDFS stores metadata
Rebalancing
Row Key - the single index on a table; no other primary/secondary indexes; sorted lexicographically, ascending order
NoSQL
Column Families - groups columns to make them available without having to pull all the row data
Periodic table compaction - removes any rows that were marked for deletion or as marked as updated and replaced with a new appended row
Compression - works best if identical values are near each other - same row or in adjoining - so works best if you arrange your row keys so identical data is adjacent
Access Pattern
Summary: Use for structured append-only data (e.g. IoT data - NOT transactional), no-SQL, millisecond latency, real-time, high-throughput applications, time-series data, >300GB
Capacity: Petabytes
Access metaphor: Key-value pairs, HBase API
Read: Scan rows
Write: Put row
Update granularity: Row
Usage: Managed, high throughput, scalable, flattened data
Min of US$1400 per month
How To:
Create cluster: gcloud bigtable instances create INSTANCE --cluster=CLUSTER --cluster-zone=... --instance-type=... --cluster-num-nodes=... --description=... --async --cluster-storage-type
Create table in Python: from google.cloud import bigtable ; client = bigtable.Client(project, admin=True) ; instance = client.instance(instance_id) ; table = instance.table(table_id) ; table.create() ; columnfamily = table.column_family(col_family_id) ; columnfamily.create()
Stream from Dataflow: Authenticate; Get/create table; Convert object to write into Mutation(s); Write mutations to Bigtable