Overview of Big Data

Below is an overview of Big Data.

*Double-tap to expand/collapse an item. Left-tap & drag to move. Mouse-wheel/pinch to zoom.
Knowledge Graph |  Text |  Top
Big Data Overview Big Data Challenges Migrating existing data workloads (e.g. Hadoop, Spark jobs) Analysing large datasets at scale Building streaming data pipelines Applying machine learning to the data Team Roles Data Engineer role Build pipelines and clean the data Purpose Solve pipeline design - code working for batch and streaming, transformation code in the SDK?, existing solutions? Solve implementation issues - maintenance overhead, reliability, scaling, monitoring, alerting, vendor lock-in Decision Maker role Cost/benefit analysis to decide how deep to invest in a data-driven opportunity Analyst role Explore the data for insights and potential relationships that may be useful as features in a Machine Learning model Statistician role Add rigour and make 'data-inspired' decisions become truly 'data-driven' decisions Applied ML Engineer role To build production Machine Learning models from the best and latest research from the researchers Data Scientist role Have mastery over analysis, statistics and machine learning Analytics Manager role To lead the team Social Scientist role To ensure that the quantitative impact is there for your project Ethicist role To ensure that the impact is the right thing to do ML Researcher role To discover better Machine Learning models and algorithms Team role overlap Size of organisation often determines how much role overlap there is Data Analyst Job Data Engineering Job Applied ML Engineer Job Lead Job Open-source ecosystem for Big Data Hadoop - canonical open-source MapReduce framework Pig - convenient imperative scripting language that can be compiled into Hadoop MapReduce jobs Features Imperative - makes execution plans but then requests the underlying systems determine how to process the data - so better for pipeline paradigms Hive - a data warehousing system and query language Features Declarative - specifies exactly how to perform the analysis, limiting the flexibility of underlying systems Apache Spark - fast, interactive, general-purpose framework for SQL, streaming, machine learning, etc Features Batch processing Stream processing Automates data partitioning, replication, recovery, pipelining of processing Can construct parallel pipelines of your submitted graph programs based on resources available Concepts Resilient Distributed Datasets (RDDs) - abstraction over data in storage - hides complexity of the location of data in the cluster and replication Resilient (fault-tolerant) to data loss due to node failures RDD lineage graph is used to recompute damaged or missing partitions Your program is written as a request, stored as a Directed Acyclic Graph and is lazy evaluated and executed only when an output from it is required Transformations - lazy; both input and output is an RDD; often implemented as anonymous functions in code Actions - do it now - result is an output like a text file, so it is immediately executed; often implemented as anonymous functions in code Spark DataFrame - is immutable and faster than Pandas DataFrames which is mutable; can convert between the two. Data Security More than setting roles upon Dataset creation Organisations should have a written & internally published Data Access Policy - how, when and why data should be shared, and with whom Best Practices Users should have minimum permissions required for their role Use separate Projects or Datasets for different environments (Dev, Test, Prod) Audit security roles assigned to users periodically Terminology Unstructured Data - data that is not structured in the way that is useful or fit for your purpose, even if it has a schema Streaming Streaming - processing unbounded datasets, scales to process massive data fast, needing immediate insights in order to make timely decisions; IoT sensor data, telemetry, transactions (and comparing new data with historical data for fraud detection!), current user activity in online games Challenges To handle variable volumes, need ingestion to scale and be fault-tolerant, handle spikes and durable messaging - need loose coupling with a Message Bus - Cloud Pub/Sub Latency is to be expected, late or unordered data, during ingestion or processing - Beam/Dataflow provides exactly-once processing of events Exactly-once processing in Cloud Dataflow - Part 1 Instance insights - Dataflow can write data to BigQuery which can query on both that streaming data and historic data - but use BigTable if you have very high throughput - e.g. IoT data Scenario Analysis Methodology 1. What data is sent? 2. What processing does the data need? 3. What analytics/queries need to be run? 4. What dashboard/reporting would be most useful? Today's query is often tomorrow's dashboard!