Google Cloud Platform (GCP) Overview
Google Cloud Customers
Showcase of Big Data Customers
Feature Engineering
Where to do it?
On the fly, as data is sent to the input function
A separate pre-processing step on all the data, before the training step - in Dataflow so it is scaled and distributed; also good for time-windowed aggregations in a real-time pipeline (and use Dataflow for the predictions)
Alternative is to do pre-processing in plain Python Dataflow, in order to reuse the same pre-processing steps for inference on the serving inputs of the trained model
Do pre-processing in Dataflow and create a set of pre-processed features using tf.transform (min, max, vocab, etc. stored in metadata.json) so they are part of the actual model graph and can be used in TensorFlow during serving/inference
Google's core infrastructure
Level 1: Security
Communications to Google Cloud are encrypted in-transit, with multiple layers of security to protect against denial of service attacks , backed by Google security teams 24/7
Data is encrypted at rest and distributed for availability and reliability
Example - BigQuery
Table data is encrypted with envelope encryption - where data is encrypted with data encryption keys, keys are encrpyted with key encryption keys
Also can use customer managed keys
Can monitor and flag queries for anomalous behaviour
Can limit data access with authorised views, at the row and column level
Google Cloud Platform managed services manage
Deployment
Web app security
Identity
Operations
Access and authorisation
Network security
OS, data and content
Audit logging
Network
Storage and encryption
Hardware
You manage (when using GCP managed services)
Content
Access policies
Usage
Level 2: Compute Power
Level 2: Storage
Level 2: Networking
Overview
Private network, petabit bi-sectional, full-duplex bandwidth and Edge points of presence, with software-defined networking
Thousands of miles of fibre optic cable, crossing oceans, with repeaters to amplify optical signals
Carries ~40% of the worlds internet traffic daily
Any machine in Google's Jupiter Network can communicate with any other machine at over 10 gigabits per second
This speed means co-location of compute and storage is not necessary
>100 edge points of presence globally
>7500 edge node locations globally
Level 3: Big Data, ML & Analytics Product Suite
Google Cloud Public Datasets
Big Data Product Life-cycle
2/5 - Ingestion Products
Cloud Pub/Sub
Cloud Dataflow
Cloud Composer
3/5 - Analytics Products
Cloud Dataprep
BigQuery
Cloud Dataproc
Cloud Datalab
Overview
Based on JupyterLab
CLI: datalab create dataengvm --zone us-central1-a
4/5 - Machine Learning Products
Cloud TPU
Cloud ML
Cloud AutoML
ML APIs
TensorFlow
5/5 - Products to Serve Data and Insights to Users
Dialogflow
Data Studio Dashboards/BI
1/5 - Storage Products
Cloud Bigtable
Cloud Storage
Cloud SQL
Cloud Spanner
Cloud Datastore
Compute Engine Disks
GCP doco on Compute Engine disk options
Persistent Disks
Disks that are attached to a VM (running in Compute Engine)
Must be attached to a VM in the same zone
GCP doco on Snapshot creation
GCP doco on Snapshot best practices
Compute Products
Compute Engine
Infrastructure as a Service (IaaS)
Quick lift and shift, or maximum flexibility to manage server instances yourself
Google Kubenetes Engine (GKE)
Clusters of machines (mangaged by Google), under your administrative control
Run and orchestrate multiple portable containers in an efficient way
App Engine
Fully managed Platform as a Service (PaaS) framework
Run long-lived code (e.g. web applications) that can autoscale without needing to worry about infrastructure provisioning or resource management
Cloud Functions
Pre-trained AI building blocks
Sight
Cloud Vision API
Cloud Video Intelligence API
AutoML Vision
Language
Cloud Translation API
Converts text from one language to another
GCP Cloud Translation API documentation
Features
Supports 100+ languages
Cloud Natural Language API
Recognises parts of speech: entities and sentiment
Features
Sentiment score - from -1.0 to 1.0 (positive or negative)
Sentiment magnitude - from 0.0 to 1.0 (how intense of a feeling)
AutoML Translation
Translation using Automated Machine Learning
AutoML Natural Language
Natural Language Processing (NLP) using Automated Machine Learning
Conversation
Dialogflow Enterprise Edition
Cloud Text-to-Speech
Converts text into high-quality speech audio
Cloud Speech-to-Text
Converts audio to text for data processing
Resource Hierarchy
Organisation
Overview
Not required, but allows policies to be set and applied throughout all the projects under that organisation
Root node of an entire GCP hierarchy
Folders
A logical grouping for a collection of projects or nested folders
Use a folder for logically grouping different teams and/or products
Projects
A base-level, logical organizing entity for creating and using resources and services for managing billing APIs and permissions
Use a project for each environment - e.g. Dev, Test, Production
Actions that can be performed
Create
Manage
Delete
Undelete
Resources
Lowest level in the hierarchy
Examples: BigQuery dataset, Cloud Storage bucket, Compute Engine instance
Cloud Identity and Access Management (IAM)
IAM policies control user access to resources
Zones and Regions
Physically organise the GCP resources
Region - independent geographic areas; a data centre in the world, consisting of 2 or more zones (most have 3 zones, some have 4 zones)
Zone - a deployment area within a region; a single failure domain within a region (e.g. independent power supply, switch, etc).
History of Google Inventions
2002 - Google File System (GFS)
For sharding and storing petabytes of data at scale
Foundation for Cloud Storage and BigQuery Managed Storage
Cloud Storage
2004 - MapReduce
Challenge was how to index the expanding content of the Web
MapReduce-based programs can automatically parallelise and execute on a large cluster of commodity machines
A year later, Apache Hadoop was created by Doug Cutting and Mike Cafarella
Issue was that developers have to write the code to manage all of the commodity server infrastructure, rather than just focus on application logic - so Google moved away from MapReduce to Dremil between 2008 and 2010.
Cloud Dataproc
2006 - Bigtable
Challenge was how to record and retrieve millions of streaming user actions with high throughput
Was the inspiration behind MongoDB and Hbase
Cloud Bigtable
2008 to 2010 - Dremil
Issue with MapReduce was that developers have to write the code to manage all of the commodity server infrastructure, rather than just focus on application logic
Decomposed data into shards, compresses into columnular format across distributed storage
Uses a query optimiser to farm out a task for each shard of data to be processed in parallel across commodity hardware
Automatically manages data imbalances, worker communication, and scaling
Became the query engine behind BigQuery
BigQuery
2009 - Collossus
Next generation distributed data store
Cloud Storage
2010 - Flume
Data pipelines
Cloud Dataflow
2011 - Megastore
Cloud Datastore
2012 - Spanner
Planet scale relational database
Cloud Spanner (in 2016)
2013 - Pub/Sub
Messaging
Cloud Pub/Sub
2013 - Millwheel
Data pipelines
2014 - F1
2015 - TensorFlow
Machine Learning framework and library
TensorFlow
Cloud ML Engine
2017 - TPU
Hardware specialised for machine learning
AutoML
Hardware
CPU
GPU
TPU
An Application-Specific Chips (ASICs), faster than GPUs
Case Studies
Ebay - uses Cloud TPU Pods, giving them 10x speedup for training image recognition models - from months to a few days, and with the increased memory can process many more images at once
Cloud TPU v1 - ~90 teraflops
Cloud TPU v2 - 180 teraflops, 64-GB High Bandwidth Memory (HBM)
Cloud TPU v3 - 420 teraflops, 128-GB High Bandwidth Memory (HBM)