Google Cloud Platform (GCP) BigQuery Product Overview
Google Cloud Platform (GCP)
A petabyte-scale, fully-managed data warehouse and analytics engine
Documentation, Tutorials & Labs
GCP BigQuery doco
GCP BigQuery Query Reference Doco
GCP BigQuery Tutorials doco
GCP BigQuery Client Libraries doco
GCP doco - bq command-line tool reference
Codelab: Loading and Querying BigQuery Data with the "bq" Command-Line Tool
Tips
Control resource consumption and cost - filter with WHERE asap - choose subsets of data at the start of the SELECT rather than using LIMIT at the end of the job
Specify the columns in queries so that you process the least amount of data in order to be charged the least cost
Click Validate in order to get an idea about how much data will be processed - to get an idea of the cost
Do biggest joins first - with pre-join filters if possible
Avoid self joins - it squares the number of rows processed
Consider how many rows (the cardinality) are in each group key in a group by - more groups equals more shuffling; key skew can lead to increased tail latency
Use a HAVING with group bys
All built-in functions are faster than JavaScript UDFs
Use an approximate function if possible as it is cheaper - e.g. APPROX_COUNT_DISTINCT instead of COUNT(DISTINC)
Filter first, then ORDER BY on the outermost part of the query
Use wildcards on tables if selecting from multiple tables
Use time-partitioned tables - similar to sharding
Monitor with Stackdriver - custom dashboards with up to 13 BigQuery metrics
Pricing - 3 categories
Storage - amount of data in table; ingest rate of streaming data; auto discount for old data
Processing - on demand or flat-rate plans; on-demand based on amount of data processed; 1TB/month free; opt-in to run high-compute queries
Free - loading; exporting; queries on metadata; per user cached queries; queries with errors
BigQuery Pricing doco
Features
2 services in one, joined by Petabit network
BigQuery Query Service - a fast SQL Query Engine
Interactive or batch queries/ingestion
Query through console, Web UI, CLI, or REST API, SDKs and connectors to other service
Query external/federated data sources
External Data Sources
Cloud Storage
Google Drive
Cloud Dataflow
Cloud Dataprep
Cloud Bigtable
CSV, JSON, Avro, and more formats
Useful for data that changes frequently or is managed by another team - e.g. a pricing Google Sheet
Disadvantages
Slower
Data consistency is NOT guaranteed if data is chanaged while being queried - consider Dataflow into BigQuery if needed
BigQuery Storage Service - managed storage; Project has Datasets with tables of highly compressed columns stored in Colossus file system
Useful for reference data that changes frequently or is managed by another team - e.g. a pricing Google Sheet
Automatic sharding and shuffling
Geospatial data types and functions - e.g. distances from lat/long
Flexible pricing model - default pay as you go plan; automatic caching of results so don't pay twice; OR flat tier pricing; queries are charged on compute/data processed; get storage discounts over time like Data Storage
Serverless
Data encryption - encrypted at rest by default
Data security - row and column security - e.g. tag a column as PII data; Works with Cloud IAM to specify policies at an organisation level
Geographic locality - can specify storage location for data sovereignty
Foundation for BI and AI - elastic data warehouse nature. Can write ML models in SQL.
1 TB of query bytes processed free per month + 10 GB storage free
Schedule queries (using the @run_time parameter or the query scheduler in the UI)
SQL 2011 query language and functions
Nested and repeated fields; User-Defined Functions in JavaScript
Immutable audit logs
Near real-time analysis of massive dataests
Sharing of your datasets with your team/company - for data mashups to derive insights
Time-partitioned tables: create as partitioned at creation time - use flag --time_partitioning_type; create as partitioned with expiration time for data - use flag --time_partitioning_expiration
SELECT * FROM table WHERE _PARTITIONTIME BETWEEN TIMESTAMP("20190101") AND TIMESTAMP("20190131")
Not obvious to people who write queries that it is partitioned... unlike a separate table per timestamp.
Concepts
BigQuery Storage Service
BigQuery Query Service
Project - top-level container that contains users and datasets; use for billing and limiting access to datasets and jobs
Datasets - contains tables and views, Access Control Lists (Reader/Writer/Owner) applied to all tables in dataset
Tables - compressed columns in Colossus file system columnar storage; encrypted; replicated 3+ times; no indexes, keys or partitions required; can be external (on Cloud Storage)
Job - query, import, export, copy; potentially a long running action; can be cancelled
Data Ingestion Options
Batch - e.g. uploading a CSV
Stream record inserts into BigQuery Managed Storage though API
Max row size: 1 Mb
Max throughput: 100k records per second, per Project (use Bigtable if need higher)
Data can be queried as it arrives - within seconds
For data consistency, enter an insertId for each inserted row - helps de-duplication
Load/Import Options into BigQuery
Import from Cloud Storage, Bigtable, Upload CSV, NEWLINE DELIMITED JSON, AVRO, Parquet, OCR or restore Cloud Datastore Backups
Tools
bq Tool CLI - from Cloud Storage
Web UI
BigQuery API - from Cloud Storage
gsutil Tool CLI - from CSV, NEWLINE DELIMITED JSON, AVRO, Parquet, ORC into Cloud Storage
Export Options
CSV, NEWLINE DELIMITED JSON, AVRO
Web UI Tips
If the data set project name contains hyphens, you need to enclose it within backticks
Hold down CTRL and click on the dataset name to have the schema appear
When looking at the dataset schema, look at the preview to inspect some data
Click on a column in the schema and it will add it to the query for you
More -> Format
Save queries inside the project for you and/or your team
CREATE OR REPLACE VIEW dataset_name.view_name AS SELECT blah
Use BigQuery Geo Viz (free) to visualise geographic information
Tools that query BigQuery
Cloud Datalab - for interactive data exploration
Tableau - for BI
Qlik - for BI
looker - for BI
Data Studio - for BI
Google Sheets - usage of pivot tables etc
Coworkers - can access the data for creating their own insights
BigQuery Geo Viz - free
Security
Features
Inherits Project Roles configured in IAM
Granular security control over rows and columns using the Data Catalog service or Authorised Views
IAM Project Roles
Owner Role
When a Dataset is created, the 'bigquery.dataOwner' predefined role is granted to all Project owner users
Can LIST all Datasets in the Project
Can DELETE any Dataset in the Project
Can LIST & GET all Jobs run on the Project, by any user
Viewer Role
Can START a Job in the Project. (Depending on the job type, additional Dataset roles are required)
Can LIST and GET all jobs, and UPDATE Jobs that they started in the Project
When a Dataset is created, the 'bigquery.dataViewer' predefined role is granted to all Project viewer users
Editor Role
Can CREATE a Dataset in the Project
When a Dataset is created, the 'bigquery.dataEditor' predefined role is granted to all Project editors
Functions
CAST(value AS STRING)
ST_X(object)
SELECT ANYVALUE(column) -
SELECT ARRAY_AGG(STRUCT(col1, col2)) FROM ... GROUP BY xyz - creates an array containing everything in the group by
SELECT ARRAY(SELECT AS STRUCT col1, col2 FROM ...) - creates an array containing everything filtered by the where clause
UNNEST(line) - unpack an array - or use myArrayColumn[OFFSET(rowIndex)] instead of UNNEST
JOIN `table` ON STARTS_WITH(mystring, column)
SELECT 1 BETWEEN 0 AND 1 - inclusive returns a boolean of true
Approximation Functions - within 1% of exact; cheaper to execute than the exact equivalent
APPROX_COUNT_DISTINCT()
APPROX_TOP_COUNT()
String Functions
SPLIT(text, delimiter) - split the text into an array, using the given delimiter
SUBSTR(line, first position, number of characters)
CONCAT(text, text, text)
WHERE ( REGEXP_CONTAINS(text, r'expression') )
Date and Time Functions - BigQuery uses Epoch time ; values based on UTC time zone by default
FORMAT_UTC_USEC(blah_timestamp_in_usec) as time
DATE(year, month, day) - constructs a DATE from INT64 values
DATE(timestamp) - converts a timestamp expression to a DATE and supports time zone
DATETIME(date, time) - constructs a DATETIME object using DATE and TIME objects
TIMESTAMP_SECONDS(seconds) - construct a TIMESTAMP out of time in epoch seconds
FROM TABLE_DATE_RANGE(myproject:applogs.event_, TIMESTAMP('2019-09-01'), TIMESTAMP('2019-09-15')) in order to query from all 'myproject:applogs.event_*' tables with a suffix of dates 20190901 to 20190915
User-Defined Functions
Why? loops, complex conditions, non-trivial string parsing
SQL or JavaScript
Can only use UDFs for the current query or command-line session
Standard SQL UDFs are scalar; legacy SQL UDFs are tabular
CREATE [TEMPORARY|TEMP] FUNCTION funcName(x INT64, y INT64) AS ( x/y )
CREATE [TEMPORARY|TEMP] FUNCTION funcName() RETURNS FLOAT64 LANGUAGE js AS """ return x*y; """;
UDF Limits
Output per input row should be less than 5mb
Each user can run 6 concurrent JavaScript UDF queries per project
Native code JavaScript functions are NOT supported
JavaScript handles only the most significant 32 bits
Query job can have a max of 50 JavaScript UDF resources; each inline code blob with a max of 32KB; each external code resource limited to max of 1MB
Analytic Window Functions - within a window or ordered group of records
Navigation Functions
LEAD() - returns the value of a row n rows ahead of the current row
LAG() - returns the value of a row n rows behind the current row
NTH_VALUE() - returns the value of the nth value in the window
Ranking and Numbering Functions - function_blah() OVER ( PARTITION BY col ORDER BY col2)
CUME_DIST() - returns the cumulative distribution of a value in a group
DENSE_RANK() - returns the integer rank of a value in a group
ROW_NUMBER() - returns the current row number of the query result
RANK() - returns the integer rank of a value in a group of values
PERCENT_RANK() - returns the rank of the current row, relative to the other rows in the partition
Logic Functions
IF(condition, value if true, value if false)
Geographic Functions
ST_GeogPoint(longitude, latitude) - a geographic point
ST_DWithin(point 1, point 2, point n, withinAmount) - boundary of a geographic area within the given points - to ensure the map can be focused in on correctly
ST_GeogFromText('POLYGON((-179 26, -179 48, -10 48, -10 26, -100 -10.1, -179 26))')
ST_MakePolygon
ST_MakeLine([geopoint 1, geopoint n])
Data Types in BigQuery Managed Storage
ARRAY (Mode REPEATED) - ordered list of 0 or more non-ARRAY type elements; lets you have parent/child data in the same row without having to join to another table
STRUCT (a RECORD) - a container of ordered fields each with a required type and an optional field name; a sub-object, that can be REPEATED, again letting you have parent/child data in the same row without having to join to another table
STRING - Variable-length unicode
INT64 - 64-bit integer
FLOAT64 - Double-precision decimal values
BOOL - case insensitive true or false
TIMESTAMP - absolute point in time; precision up to microseconds; years 1 to 9999 inclusive
Access Pattern
Summary: Use as a SQL data warehouse to power all your analytics; petabytes of structured data, latency in seconds
Capacity: Petabytes
Access metaphor: Data warehouse
Read: SELECT rows
Write: Batch / stream
Update granularity: Field
Usage: Interactive SQL; querying a fully managed data warehouse
Use Cases
BigQuery Solutions and Use Cases
Execution Plans
In UI graph, the first part of the graph is the Average time, next segment is Max time, and last segment is fraction of time
A significant difference between Average and Max? Possibly tail latency / data skew (check with APPROX_TOP_COUNT), or filter earlier
Most time spent reading from intermediate stages? Need to filter earlier
Most time spent on CPU tasks? Consider approx functions, inspect UDF usage, filter earlier
BigQuery Machine Learning
Overview
Write ML models with SQL
Experiment and iterate right where your data lives - in BigQuery data warehouse
Build classification and forecasting models
Advanced: inspect model weights and adjust hyper-parameters
BigQuery End-to-End ML Process
Step 1a: Create a Dataset - ETL into BigQuery - public data sources, Google Marketing Platform analytics/ads, YouTube, custom
Step 1b: Preprocess features - explore, join, create train/test tables
Step 2: Create a model - CREATE OR REPLACE MODEL xyzdataset.mymodel OPTIONS(model_type='linear_reg', input_label_cols=['mycolumn'], ls_init_learn_rate=0.15, l1_reg=1, max_iterations=5) AS WITH blah AS (SELECT... [data for training]
Need to alias your label column as 'label' or specify input_label_cols
Step 3: Evaluate model - look at metadata in UI, or SELECT roc_auc, accuracy, precision, recall FROM ML.EVALUATE(MODEL `xyzdataset.mymodel` ... [SQL query with evaluation data]
Step 4: Predict - write SQL predication query - SELECT predicted_value, col1, col2 FROM ml.PREDICT(MODEL `xyzdataset.mymodel`, (WITH blah AS (SELECT... [test data for prediction]
Features
Auto-tunes learning rate
Auto-splits data into training and test
Leverages BigQuery's processing power to build a model
Advanced Model Features
Set custom options in the OPTIONS of the model definition
Specify L1/L2 regularisation
3 strategies for train/test split: Random, Sequential, Custom
Set the learning rate
Use Standard SQL and UDFs within the ML queries to create training datasets
Supported Model Types
Linear Regression - for forecasting
Binary and Multi-Class Logistic Regression - for classification
More coming
Model evaluation functions - standard metrics: ROC, precision-recall curves - select from SQL or click on model in UI - SELECT * FROM ML.EVALUATE(MODEL `xyzdataset.mymodel`)
Model weight inspection
SELECT category, weight FROM UNNEST((SELECT category_weights FROM ML.WEIGHTS(MODEL `xyzdataset.mymodel`) WHERE processed_input = 'blah' ORDER BY weight DESC
SELECT * FROM ML.WEIGHTS(MODEL `xyzdataset.mymodel`, (SELECT ...))
Feature distribution analysis - standard functions
Get data about features of a trained model - SELECT * FROM ML.FEATURE_INFO(MODEL `xyzdataset.mymodel`)
Model object - click in UI to see number of training runs, when last updated, ...
View training progress - SELECT * FROM ML.TRAINING_INFO(MODEL `xyzdataset.mymodel`)
Terminology
Instance/Observation - is a record or row of data
Label - the correct answer (historically, and what is trying to be predicted)
Feature - all the other columns or values
Feature Engineering
One Hot Encoding categorical variables
Unsupervised
Clustering
Supervised learning - there is a historical right answer
Forecast a number
Linear Regression
Classify something
Binary Logistic Regression - for a binary choice
Multi-class Logistic Regression - more than 2 choices
Recommend something
Matrix Factorisation