Overview of Machine Learning

Below is a high-level overview of Machine Learning.

Knowledge Graph | Text | Top

^*Double-tap to expand/collapse an item. Left-tap & drag to move. Mouse-wheel/pinch to zoom.

Knowledge Graph | Text | Top

Machine Learning

Artificial Intelligence

Statistical tools for learning from data in order to derive predictive insights. Pattern recognition from examples.

Terminology

Label - a correct output for an input - a fact or true answer

Input - a variable used to predict the label

Example - a set of inputs and a corresponding label

Model - a mathematics function that takes input variables and tries to approximate the label

Training - adjusting the model to minimise the error between the approximated label and the actual label

Prediction - the output of the Model on unlabelled data

Hyper-parameter tuning

Back-propagation

Epoch - one traversal through the entire training set

Gradient Descent - optimisation - the process of reducing the error

Batch size - the amount of data that the error is computed on

Weights - the parameters of a function that are optimised

Evaluation - periodically determining whether the model is good enough, based on a set of metrics

Training - the process of optimising the weights, including gradient descent and evaluation

Softmax - helps handle multiple labels - all values are normalised to sum to one

Over-fitting - the model does not generalise very well for unseen examples

Under-fitting - the model is too inaccurate

Feature Engineering - using insights to calculate or engineer extra features/inputs

Neuron - one unit of combining inputs

Activation Function

Hidden Layer - a set of neurons that operate on the same set of inputs

Features - transformations of inputs, typically using an Activation Function

Ground Truth

Error = ground truth value - prediction value

Root Mean Squared Error (RMSE) - for Regression - square the error (so it becomes positive) and take the mean, then square rooted

Cross-Entropy - differentiable error value for Classification - the log loss

Confusion Matrix - for evaluation of a model - True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN)

Accuracy - intuitive measure of skill for classifiers if dataset is balanced - the fraction that is correct (fails if dataset is unbalanced)

Precision - use when what you are trying to find is common; accuracy when a classifier is positive ; positive predictive value = TP / (TP + FP) (good if dataset is unbalanced)

Recall - use when what you are trying to find is rare; accuracy when the truth is positive ; true positive rate = TP / (TP + FN) (good if dataset is unbalanced)

Training Dataset

Validation Dataset

Test Dataset

Cross-validation - if Test Dataset is rare, use different splits of training and validation datasets

Dense features - continuous numbers; Neural Network is good for these

Sparse/Wide features - independent, discrete, categorical values; feature cross pairs; Linear models are good for these

ML Steps

1. Explore the data

2. Split data into train/validation/test datasets

3. Benchmark the performance to be obtained

Classes of Machine Learning

Supervised Learning

Learning from past examples to predict future values

Model Types

Regression

Label has a continuous real value

Classification

Label has a discrete set of values or classes

Datasets

What makes a good Dataset?

Positive examples

Negative examples

Negative examples that are near misses

Exhaustive coverage of examples

Examples of outliers - so they can be learned and handled gracefully

Neural Networks

Single neuron line function - w.x1 + w.x2 > bias ?

Optimisation - Gradient Descent - iteratively reducing the error of output from the label

Unsupervised Learning

Using unlabelled data to discover relationships between data

Clustering

Semi-supervised Learning

Applications of Machine Learning

Natural Language Processing (NLP)

The processing of any natural language in order to understand both its grammatical syntax and semantics

Computer Vision

Methods for acquiring, processing, analysing and reasoning about images or video sequences in order to extract meaningful/useful information that can be interpreted and acted upon as desired

Robotics

Deep Learning

Deep Neural Networks

Code Libraries

Python Libraries

PyTorch

FastAI

TensorFlow

TODO

Classical Machine Learning

TODO

AI Adoption Strategy

Preference 1 - Use pre-built AI services/models

Preference 2 - Customise pre-built AI services/models

Preference 3 - Create new models => rule of thumb: only when you have > 100k high-quality examples

Workflow Options

Kubeflow Pipelines

TODO

Feature Engineering

Tips

Have a reasonable hypothesis for why a specific feature may be relevant for the problem, otherwise discard it

The feature value must be known at the time when a prediction is needed - don't use historical data that was later determined - careful when training on data from a Data Warehouse!

Ensure the feature data is legal and ethical to use

Feature values need to be numerical WITH a meaningful magnitude; or at least representable in a numeric form with a vector representation...

Must have enough examples of each feature input value - e.g. at least 5 examples or samples; for real values you may need to group/bin them together

Discard values that are too specific - like a transaction id

One-hot encode categorical values - a vector/list representing each input category; only one item in the list has a value of 1 and the others are zero

Create a vocabulary in training pre-processing to create a vocab of keys

Don't mix magic numbers representing missing data (e.g. null or -1) with real data - perhaps have 2 values - one for whether the value was provided and another for the actual value (or zero if was not provided)

Use feature crosses - e.g. using intuition like a yellow car in New York is likely a taxi, so combine 2 features so a yellow car in another city is not misrepresented because of the training data from New York.

e.g. Bucketise Latitude/Longitude into 0.1 degrees and do a feature cross - essentially same as putting lat/long points onto grid cells

Use a wide and deep network if you have both dense and sparse features