Big Data Overview
Big Data Challenges
Migrating existing data workloads (e.g. Hadoop, Spark jobs)
Analysing large datasets at scale
Building streaming data pipelines
Applying machine learning to the data
Team Roles
Data Engineer role
Build pipelines and clean the data
Purpose
Solve pipeline design - code working for batch and streaming, transformation code in the SDK?, existing solutions?
Solve implementation issues - maintenance overhead, reliability, scaling, monitoring, alerting, vendor lock-in
Decision Maker role
Cost/benefit analysis to decide how deep to invest in a data-driven opportunity
Analyst role
Explore the data for insights and potential relationships that may be useful as features in a Machine Learning model
Statistician role
Add rigour and make 'data-inspired' decisions become truly 'data-driven' decisions
Applied ML Engineer role
To build production Machine Learning models from the best and latest research from the researchers
Data Scientist role
Have mastery over analysis, statistics and machine learning
Analytics Manager role
To lead the team
Social Scientist role
To ensure that the quantitative impact is there for your project
Ethicist role
To ensure that the impact is the right thing to do
ML Researcher role
To discover better Machine Learning models and algorithms
Team role overlap
Size of organisation often determines how much role overlap there is
Data Analyst Job
Data Engineering Job
Applied ML Engineer Job
Lead Job
Open-source ecosystem for Big Data
Hadoop - canonical open-source MapReduce framework
Pig - convenient imperative scripting language that can be compiled into Hadoop MapReduce jobs
Features
Imperative - makes execution plans but then requests the underlying systems determine how to process the data - so better for pipeline paradigms
Hive - a data warehousing system and query language
Features
Declarative - specifies exactly how to perform the analysis, limiting the flexibility of underlying systems
Apache Spark - fast, interactive, general-purpose framework for SQL, streaming, machine learning, etc
Features
Batch processing
Stream processing
Automates data partitioning, replication, recovery, pipelining of processing
Can construct parallel pipelines of your submitted graph programs based on resources available
Concepts
Resilient Distributed Datasets (RDDs) - abstraction over data in storage - hides complexity of the location of data in the cluster and replication
Resilient (fault-tolerant) to data loss due to node failures
RDD lineage graph is used to recompute damaged or missing partitions
Your program is written as a request, stored as a Directed Acyclic Graph and is lazy evaluated and executed only when an output from it is required
Transformations - lazy; both input and output is an RDD; often implemented as anonymous functions in code
Actions - do it now - result is an output like a text file, so it is immediately executed; often implemented as anonymous functions in code
Spark DataFrame - is immutable and faster than Pandas DataFrames which is mutable; can convert between the two.
Data Security
More than setting roles upon Dataset creation
Organisations should have a written & internally published Data Access Policy - how, when and why data should be shared, and with whom
Best Practices
Users should have minimum permissions required for their role
Use separate Projects or Datasets for different environments (Dev, Test, Prod)
Audit security roles assigned to users periodically
Terminology
Unstructured Data - data that is not structured in the way that is useful or fit for your purpose, even if it has a schema
Streaming
Streaming - processing unbounded datasets, scales to process massive data fast, needing immediate insights in order to make timely decisions; IoT sensor data, telemetry, transactions (and comparing new data with historical data for fraud detection!), current user activity in online games
Challenges
To handle variable volumes, need ingestion to scale and be fault-tolerant, handle spikes and durable messaging - need loose coupling with a Message Bus - Cloud Pub/Sub
Latency is to be expected, late or unordered data, during ingestion or processing - Beam/Dataflow provides exactly-once processing of events
Exactly-once processing in Cloud Dataflow - Part 1
Instance insights - Dataflow can write data to BigQuery which can query on both that streaming data and historic data - but use BigTable if you have very high throughput - e.g. IoT data
Scenario Analysis Methodology
1. What data is sent?
2. What processing does the data need?
3. What analytics/queries need to be run?
4. What dashboard/reporting would be most useful? Today's query is often tomorrow's dashboard!