Overview of the Google Cloud Platform (GCP) Cloud Bigtable Product

Below is an overview of the Google Cloud Platform (GCP) Cloud Bigtable product.

Knowledge Graph | Text | Top

^*Double-tap to expand/collapse an item. Left-tap & drag to move. Mouse-wheel/pinch to zoom.

Knowledge Graph | Text | Top

Google Cloud Platform (GCP) Cloud Bigtable Product Overview

Google Cloud Platform (GCP)

Features

Stores data in the Colossus file system

Stores Tablets about data in the Colossus file system

Stores metadata about the Tablets on the VMs in the Bigtable cluster (3 replicas with HDFS)

Rebalancing

Moving Tablets from one node to another is fast because only the pointers are updated

Hot stop detection - split Tablets if too much activity in one

Tips

Use up to 100 Column Families without losing performance

Design consideration - Wide/Dense tables with every column value populated vs Narrow tables for sparse data

Row Key Design Tips

Choose a Row Key that minimises sorting/searching - make your most common queries simple scans

When commonly want to query on the latest few records, add a Reverse Timestamp to the Row Key (long.MaxValue - timestamp)

Avoid row key hotspots so that work is distributed across nodes and tablets - e.g. avoid internet domains (some more popular), sequential ids (newer users may be more active), static repeatedly updated ids for statistics

Avoid starting row key with a timestamp - doesn't distribute across nodes and tablets

Row Key - want both distributed reading and writing - want associated data sequentially together

Performance

Better for streaming performance to add nodes to a single cluster in a single zone rather than introducing overhead of replication

Fewer cells equals faster performance

GCP Bigtable Performance doco

Performance estimate: 10-node SSD cluster, 1kb rows, write only workload = 10k rows per second at a 6ms delay (HDD has 50ms delay)

Bigtable learns, configures and optimises itself - let it warm up for a few hours, and you need to run performance tests on at least 300GB of data to get a valid result; and over a long enough time period for Bigtable to learn usage patterns and perform internal optimisations

Expected performance: SSD 220 MB/s scans; HDD 180 MB/s scans

Performance increases linearly with the number of nodes - node count controls throughput

Change the schema to minimise data skew

Have the clients in the same zone as Bigtable

Concepts

Tablets - data structures to identify and manage data

Bigtable Cluster with HDFS stores metadata

Rebalancing

Row Key - the single index on a table; no other primary/secondary indexes; sorted lexicographically, ascending order

NoSQL

Column Families - groups columns to make them available without having to pull all the row data

Periodic table compaction - removes any rows that were marked for deletion or as marked as updated and replaced with a new appended row

Compression - works best if identical values are near each other - same row or in adjoining - so works best if you arrange your row keys so identical data is adjacent

Access Pattern

Summary: Use for structured append-only data (e.g. IoT data - NOT transactional), no-SQL, millisecond latency, real-time, high-throughput applications, time-series data, >300GB

Capacity: Petabytes

Access metaphor: Key-value pairs, HBase API

Read: Scan rows

Write: Put row

Update granularity: Row

Usage: Managed, high throughput, scalable, flattened data

Min of US$1400 per month

How To:

Create cluster: gcloud bigtable instances create INSTANCE --cluster=CLUSTER --cluster-zone=... --instance-type=... --cluster-num-nodes=... --description=... --async --cluster-storage-type

Create table in Python: from google.cloud import bigtable ; client = bigtable.Client(project, admin=True) ; instance = client.instance(instance_id) ; table = instance.table(table_id) ; table.create() ; columnfamily = table.column_family(col_family_id) ; columnfamily.create()

Stream from Dataflow: Authenticate; Get/create table; Convert object to write into Mutation(s); Write mutations to Bigtable