Overview of Big Data

Below is an overview of Big Data.

Knowledge Graph | Text | Top

^*Double-tap to expand/collapse an item. Left-tap & drag to move. Mouse-wheel/pinch to zoom.

Knowledge Graph | Text | Top

Big Data Overview

Big Data Challenges

Migrating existing data workloads (e.g. Hadoop, Spark jobs)

Analysing large datasets at scale

Building streaming data pipelines

Applying machine learning to the data

Team Roles

Data Engineer role

Build pipelines and clean the data

Purpose

Solve pipeline design - code working for batch and streaming, transformation code in the SDK?, existing solutions?

Solve implementation issues - maintenance overhead, reliability, scaling, monitoring, alerting, vendor lock-in

Decision Maker role

Cost/benefit analysis to decide how deep to invest in a data-driven opportunity

Analyst role

Explore the data for insights and potential relationships that may be useful as features in a Machine Learning model

Statistician role

Add rigour and make 'data-inspired' decisions become truly 'data-driven' decisions

Applied ML Engineer role

To build production Machine Learning models from the best and latest research from the researchers

Data Scientist role

Have mastery over analysis, statistics and machine learning

Analytics Manager role

To lead the team

Social Scientist role

To ensure that the quantitative impact is there for your project

Ethicist role

To ensure that the impact is the right thing to do

ML Researcher role

To discover better Machine Learning models and algorithms

Team role overlap

Size of organisation often determines how much role overlap there is

Data Analyst Job

Data Engineering Job

Applied ML Engineer Job

Lead Job

Open-source ecosystem for Big Data

Hadoop - canonical open-source MapReduce framework

Pig - convenient imperative scripting language that can be compiled into Hadoop MapReduce jobs

Features

Imperative - makes execution plans but then requests the underlying systems determine how to process the data - so better for pipeline paradigms

Hive - a data warehousing system and query language

Features

Declarative - specifies exactly how to perform the analysis, limiting the flexibility of underlying systems

Apache Spark - fast, interactive, general-purpose framework for SQL, streaming, machine learning, etc

Features

Batch processing

Stream processing

Automates data partitioning, replication, recovery, pipelining of processing

Can construct parallel pipelines of your submitted graph programs based on resources available

Concepts

Resilient Distributed Datasets (RDDs) - abstraction over data in storage - hides complexity of the location of data in the cluster and replication

Resilient (fault-tolerant) to data loss due to node failures

RDD lineage graph is used to recompute damaged or missing partitions

Your program is written as a request, stored as a Directed Acyclic Graph and is lazy evaluated and executed only when an output from it is required

Transformations - lazy; both input and output is an RDD; often implemented as anonymous functions in code

Actions - do it now - result is an output like a text file, so it is immediately executed; often implemented as anonymous functions in code

Spark DataFrame - is immutable and faster than Pandas DataFrames which is mutable; can convert between the two.

Data Security

More than setting roles upon Dataset creation

Organisations should have a written & internally published Data Access Policy - how, when and why data should be shared, and with whom

Best Practices

Users should have minimum permissions required for their role

Use separate Projects or Datasets for different environments (Dev, Test, Prod)

Audit security roles assigned to users periodically

Terminology

Unstructured Data - data that is not structured in the way that is useful or fit for your purpose, even if it has a schema

Streaming

Streaming - processing unbounded datasets, scales to process massive data fast, needing immediate insights in order to make timely decisions; IoT sensor data, telemetry, transactions (and comparing new data with historical data for fraud detection!), current user activity in online games

Challenges

To handle variable volumes, need ingestion to scale and be fault-tolerant, handle spikes and durable messaging - need loose coupling with a Message Bus - Cloud Pub/Sub

Latency is to be expected, late or unordered data, during ingestion or processing - Beam/Dataflow provides exactly-once processing of events

Exactly-once processing in Cloud Dataflow - Part 1

Instance insights - Dataflow can write data to BigQuery which can query on both that streaming data and historic data - but use BigTable if you have very high throughput - e.g. IoT data

Scenario Analysis Methodology

1. What data is sent?

2. What processing does the data need?

3. What analytics/queries need to be run?

4. What dashboard/reporting would be most useful? Today's query is often tomorrow's dashboard!