Overview of the Google Cloud Platform (GCP) BigQuery Product

Below is an overview of the Google Cloud Platform (GCP) BigQuery Product.

*Double-tap to expand/collapse an item. Left-tap & drag to move. Mouse-wheel/pinch to zoom.
Knowledge Graph |  Text |  Top
Google Cloud Platform (GCP) BigQuery Product Overview Google Cloud Platform (GCP) A petabyte-scale, fully-managed data warehouse and analytics engine Documentation, Tutorials & Labs GCP BigQuery doco GCP BigQuery Query Reference Doco GCP BigQuery Tutorials doco GCP BigQuery Client Libraries doco GCP doco - bq command-line tool reference Codelab: Loading and Querying BigQuery Data with the "bq" Command-Line Tool Tips Control resource consumption and cost - filter with WHERE asap - choose subsets of data at the start of the SELECT rather than using LIMIT at the end of the job Specify the columns in queries so that you process the least amount of data in order to be charged the least cost Click Validate in order to get an idea about how much data will be processed - to get an idea of the cost Do biggest joins first - with pre-join filters if possible Avoid self joins - it squares the number of rows processed Consider how many rows (the cardinality) are in each group key in a group by - more groups equals more shuffling; key skew can lead to increased tail latency Use a HAVING with group bys All built-in functions are faster than JavaScript UDFs Use an approximate function if possible as it is cheaper - e.g. APPROX_COUNT_DISTINCT instead of COUNT(DISTINC) Filter first, then ORDER BY on the outermost part of the query Use wildcards on tables if selecting from multiple tables Use time-partitioned tables - similar to sharding Monitor with Stackdriver - custom dashboards with up to 13 BigQuery metrics Pricing - 3 categories Storage - amount of data in table; ingest rate of streaming data; auto discount for old data Processing - on demand or flat-rate plans; on-demand based on amount of data processed; 1TB/month free; opt-in to run high-compute queries Free - loading; exporting; queries on metadata; per user cached queries; queries with errors BigQuery Pricing doco Features 2 services in one, joined by Petabit network BigQuery Query Service - a fast SQL Query Engine Interactive or batch queries/ingestion Query through console, Web UI, CLI, or REST API, SDKs and connectors to other service Query external/federated data sources External Data Sources Cloud Storage Google Drive Cloud Dataflow Cloud Dataprep Cloud Bigtable CSV, JSON, Avro, and more formats Useful for data that changes frequently or is managed by another team - e.g. a pricing Google Sheet Disadvantages Slower Data consistency is NOT guaranteed if data is chanaged while being queried - consider Dataflow into BigQuery if needed BigQuery Storage Service - managed storage; Project has Datasets with tables of highly compressed columns stored in Colossus file system Useful for reference data that changes frequently or is managed by another team - e.g. a pricing Google Sheet Automatic sharding and shuffling Geospatial data types and functions - e.g. distances from lat/long Flexible pricing model - default pay as you go plan; automatic caching of results so don't pay twice; OR flat tier pricing; queries are charged on compute/data processed; get storage discounts over time like Data Storage Serverless Data encryption - encrypted at rest by default Data security - row and column security - e.g. tag a column as PII data; Works with Cloud IAM to specify policies at an organisation level Geographic locality - can specify storage location for data sovereignty Foundation for BI and AI - elastic data warehouse nature. Can write ML models in SQL. 1 TB of query bytes processed free per month + 10 GB storage free Schedule queries (using the @run_time parameter or the query scheduler in the UI) SQL 2011 query language and functions Nested and repeated fields; User-Defined Functions in JavaScript Immutable audit logs Near real-time analysis of massive dataests Sharing of your datasets with your team/company - for data mashups to derive insights Time-partitioned tables: create as partitioned at creation time - use flag --time_partitioning_type; create as partitioned with expiration time for data - use flag --time_partitioning_expiration SELECT * FROM table WHERE _PARTITIONTIME BETWEEN TIMESTAMP("20190101") AND TIMESTAMP("20190131") Not obvious to people who write queries that it is partitioned... unlike a separate table per timestamp. Concepts BigQuery Storage Service BigQuery Query Service Project - top-level container that contains users and datasets; use for billing and limiting access to datasets and jobs Datasets - contains tables and views, Access Control Lists (Reader/Writer/Owner) applied to all tables in dataset Tables - compressed columns in Colossus file system columnar storage; encrypted; replicated 3+ times; no indexes, keys or partitions required; can be external (on Cloud Storage) Job - query, import, export, copy; potentially a long running action; can be cancelled Data Ingestion Options Batch - e.g. uploading a CSV Stream record inserts into BigQuery Managed Storage though API Max row size: 1 Mb Max throughput: 100k records per second, per Project (use Bigtable if need higher) Data can be queried as it arrives - within seconds For data consistency, enter an insertId for each inserted row - helps de-duplication Load/Import Options into BigQuery Import from Cloud Storage, Bigtable, Upload CSV, NEWLINE DELIMITED JSON, AVRO, Parquet, OCR or restore Cloud Datastore Backups Tools bq Tool CLI - from Cloud Storage Web UI BigQuery API - from Cloud Storage gsutil Tool CLI - from CSV, NEWLINE DELIMITED JSON, AVRO, Parquet, ORC into Cloud Storage Export Options CSV, NEWLINE DELIMITED JSON, AVRO Web UI Tips If the data set project name contains hyphens, you need to enclose it within backticks Hold down CTRL and click on the dataset name to have the schema appear When looking at the dataset schema, look at the preview to inspect some data Click on a column in the schema and it will add it to the query for you More -> Format Save queries inside the project for you and/or your team CREATE OR REPLACE VIEW dataset_name.view_name AS SELECT blah Use BigQuery Geo Viz (free) to visualise geographic information Tools that query BigQuery Cloud Datalab - for interactive data exploration Tableau - for BI Qlik - for BI looker - for BI Data Studio - for BI Google Sheets - usage of pivot tables etc Coworkers - can access the data for creating their own insights BigQuery Geo Viz - free Security Features Inherits Project Roles configured in IAM Granular security control over rows and columns using the Data Catalog service or Authorised Views IAM Project Roles Owner Role When a Dataset is created, the 'bigquery.dataOwner' predefined role is granted to all Project owner users Can LIST all Datasets in the Project Can DELETE any Dataset in the Project Can LIST & GET all Jobs run on the Project, by any user Viewer Role Can START a Job in the Project. (Depending on the job type, additional Dataset roles are required) Can LIST and GET all jobs, and UPDATE Jobs that they started in the Project When a Dataset is created, the 'bigquery.dataViewer' predefined role is granted to all Project viewer users Editor Role Can CREATE a Dataset in the Project When a Dataset is created, the 'bigquery.dataEditor' predefined role is granted to all Project editors Functions CAST(value AS STRING) ST_X(object) SELECT ANYVALUE(column) - SELECT ARRAY_AGG(STRUCT(col1, col2)) FROM ... GROUP BY xyz - creates an array containing everything in the group by SELECT ARRAY(SELECT AS STRUCT col1, col2 FROM ...) - creates an array containing everything filtered by the where clause UNNEST(line) - unpack an array - or use myArrayColumn[OFFSET(rowIndex)] instead of UNNEST JOIN `table` ON STARTS_WITH(mystring, column) SELECT EXTRACT(DATE FROM col_timestamp) as date SELECT EXTRACT(DAYOFWEEK FROM col_timestamp) as daynumber - note: 1 is Sunday, 7 is Saturday SELECT 1 BETWEEN 0 AND 1 - inclusive returns a boolean of true Approximation Functions - within 1% of exact; cheaper to execute than the exact equivalent APPROX_COUNT_DISTINCT() APPROX_TOP_COUNT() String Functions SPLIT(text, delimiter) - split the text into an array, using the given delimiter SUBSTR(line, first position, number of characters) CONCAT(text, text, text) WHERE ( REGEXP_CONTAINS(text, r'expression') ) REGEXP_EXTRACT(test, r'expression') Date and Time Functions - BigQuery uses Epoch time ; values based on UTC time zone by default FORMAT_UTC_USEC(blah_timestamp_in_usec) as time DATE(year, month, day) - constructs a DATE from INT64 values DATE(timestamp) - converts a timestamp expression to a DATE and supports time zone DATETIME(date, time) - constructs a DATETIME object using DATE and TIME objects TIMESTAMP_SECONDS(seconds) - construct a TIMESTAMP out of time in epoch seconds FROM TABLE_DATE_RANGE(myproject:applogs.event_, TIMESTAMP('2019-09-01'), TIMESTAMP('2019-09-15')) in order to query from all 'myproject:applogs.event_*' tables with a suffix of dates 20190901 to 20190915 User-Defined Functions Why? loops, complex conditions, non-trivial string parsing SQL or JavaScript Can only use UDFs for the current query or command-line session Standard SQL UDFs are scalar; legacy SQL UDFs are tabular CREATE [TEMPORARY|TEMP] FUNCTION funcName(x INT64, y INT64) AS ( x/y ) CREATE [TEMPORARY|TEMP] FUNCTION funcName() RETURNS FLOAT64 LANGUAGE js AS """ return x*y; """; UDF Limits Output per input row should be less than 5mb Each user can run 6 concurrent JavaScript UDF queries per project Native code JavaScript functions are NOT supported JavaScript handles only the most significant 32 bits Query job can have a max of 50 JavaScript UDF resources; each inline code blob with a max of 32KB; each external code resource limited to max of 1MB Analytic Window Functions - within a window or ordered group of records Navigation Functions LEAD() - returns the value of a row n rows ahead of the current row LAG() - returns the value of a row n rows behind the current row NTH_VALUE() - returns the value of the nth value in the window Ranking and Numbering Functions - function_blah() OVER ( PARTITION BY col ORDER BY col2) CUME_DIST() - returns the cumulative distribution of a value in a group DENSE_RANK() - returns the integer rank of a value in a group ROW_NUMBER() - returns the current row number of the query result RANK() - returns the integer rank of a value in a group of values PERCENT_RANK() - returns the rank of the current row, relative to the other rows in the partition Logic Functions IF(condition, value if true, value if false) Geographic Functions ST_GeogPoint(longitude, latitude) - a geographic point ST_DWithin(point 1, point 2, point n, withinAmount) - boundary of a geographic area within the given points - to ensure the map can be focused in on correctly ST_GeogFromText('POLYGON((-179 26, -179 48, -10 48, -10 26, -100 -10.1, -179 26))') ST_MakePolygon ST_MakeLine([geopoint 1, geopoint n]) Data Types in BigQuery Managed Storage ARRAY (Mode REPEATED) - ordered list of 0 or more non-ARRAY type elements; lets you have parent/child data in the same row without having to join to another table STRUCT (a RECORD) - a container of ordered fields each with a required type and an optional field name; a sub-object, that can be REPEATED, again letting you have parent/child data in the same row without having to join to another table STRING - Variable-length unicode INT64 - 64-bit integer FLOAT64 - Double-precision decimal values BOOL - case insensitive true or false TIMESTAMP - absolute point in time; precision up to microseconds; years 1 to 9999 inclusive Access Pattern Summary: Use as a SQL data warehouse to power all your analytics; petabytes of structured data, latency in seconds Capacity: Petabytes Access metaphor: Data warehouse Read: SELECT rows Write: Batch / stream Update granularity: Field Usage: Interactive SQL; querying a fully managed data warehouse Use Cases BigQuery Solutions and Use Cases Execution Plans In UI graph, the first part of the graph is the Average time, next segment is Max time, and last segment is fraction of time A significant difference between Average and Max? Possibly tail latency / data skew (check with APPROX_TOP_COUNT), or filter earlier Most time spent reading from intermediate stages? Need to filter earlier Most time spent on CPU tasks? Consider approx functions, inspect UDF usage, filter earlier BigQuery Machine Learning Overview Write ML models with SQL Experiment and iterate right where your data lives - in BigQuery data warehouse Build classification and forecasting models Advanced: inspect model weights and adjust hyper-parameters BigQuery End-to-End ML Process Step 1a: Create a Dataset - ETL into BigQuery - public data sources, Google Marketing Platform analytics/ads, YouTube, custom Step 1b: Preprocess features - explore, join, create train/test tables Step 2: Create a model - CREATE OR REPLACE MODEL xyzdataset.mymodel OPTIONS(model_type='linear_reg', input_label_cols=['mycolumn'], ls_init_learn_rate=0.15, l1_reg=1, max_iterations=5) AS WITH blah AS (SELECT... [data for training] Need to alias your label column as 'label' or specify input_label_cols Step 3: Evaluate model - look at metadata in UI, or SELECT roc_auc, accuracy, precision, recall FROM ML.EVALUATE(MODEL `xyzdataset.mymodel` ... [SQL query with evaluation data] Step 4: Predict - write SQL predication query - SELECT predicted_value, col1, col2 FROM ml.PREDICT(MODEL `xyzdataset.mymodel`, (WITH blah AS (SELECT... [test data for prediction] Features Auto-tunes learning rate Auto-splits data into training and test Leverages BigQuery's processing power to build a model Advanced Model Features Set custom options in the OPTIONS of the model definition Specify L1/L2 regularisation 3 strategies for train/test split: Random, Sequential, Custom Set the learning rate Use Standard SQL and UDFs within the ML queries to create training datasets Supported Model Types Linear Regression - for forecasting Binary and Multi-Class Logistic Regression - for classification More coming Model evaluation functions - standard metrics: ROC, precision-recall curves - select from SQL or click on model in UI - SELECT * FROM ML.EVALUATE(MODEL `xyzdataset.mymodel`) Model weight inspection SELECT category, weight FROM UNNEST((SELECT category_weights FROM ML.WEIGHTS(MODEL `xyzdataset.mymodel`) WHERE processed_input = 'blah' ORDER BY weight DESC SELECT * FROM ML.WEIGHTS(MODEL `xyzdataset.mymodel`, (SELECT ...)) Feature distribution analysis - standard functions Get data about features of a trained model - SELECT * FROM ML.FEATURE_INFO(MODEL `xyzdataset.mymodel`) Model object - click in UI to see number of training runs, when last updated, ... View training progress - SELECT * FROM ML.TRAINING_INFO(MODEL `xyzdataset.mymodel`) Terminology Instance/Observation - is a record or row of data Label - the correct answer (historically, and what is trying to be predicted) Feature - all the other columns or values Feature Engineering One Hot Encoding categorical variables Unsupervised Clustering Supervised learning - there is a historical right answer Forecast a number Linear Regression Classify something Binary Logistic Regression - for a binary choice Multi-class Logistic Regression - more than 2 choices Recommend something Matrix Factorisation