TensorFlow
TensorFlow Playground
High-Level Summary
Framework
Machine Learning
Library
Computational Graphs
Declarative Programming
Abstraction for describing computations as a directed graph
Edges
Tensors
Multidimensional arrays / matrices
Nodes
Operations
Terminology
Why?
Dependency-Driven Scheduling
Dependencies specify the order of execution
Parallel processing on distributed cores
Runtime
Executing graphs on a variety of hardware
Coding Steps
1. Set up a collection of feature columns
2. Create a model, passing in the feature columns
Model Types
Linear Regression - tf.estimator.LinearRegressor(feature_columns)
Deep Neural Network - tf.estimator.DNNRegressor(feature_columns, hidden_units=[128, 64, 32])
Linear Classification - tf.estimator.LinearClassifier(feature_columns)
Deep Neural Network Classification - tf.estimator.DNNClassifier(feature_columns, hidden_units)
3. Write input function / generator function to return 'features,labels', with features being a Dict {}
Input Functions
Pandas Input Function - tf.estimator.inputs.pandas_input_fn(x, y, batch_size, num_epochs, shuffle, queue_capacity, num_threads)
4. Train, passing in input function and number of steps
5. Use the trained model to predict
Function Reference
tf
decode_csv(value_column, record_defaults = [ [1], [2] ]
tf.feature_column
Can be thought of as doing some types of pre-processing
numeric_column('colname')
categorical_column_with_vocabulary_list('blahId', keys=['123', '434']) - for one-hot encoding categorical values
categorical_column_with_identity('blahId', num_buckets=10) - if there are 10 known id values; e.g. hour of day being 0 to 23.
categorical_column_with_hash_bucket('blahId', hash_bucket_size=500) - don't need a full vocab known ahead of time; almost one-hot encoding based on the hash of the value
crossed_column([dayofweek, hourofday], 24*7) - to create a day_hour feature cross
bucketized_column(listOfValues, buckets) - possibly use np.linearspace() to create buckets
real_valued_column
sparse_column_with_keys('dayofweek', keys=['Sun',...'Sat'])
sparse_column_with_integerized_feature('hourofday', bucket_size=24)
embedding_column(mycrosspair, 10)
tf.transform
Allows users to define pre-processing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph when making predictions
YouTube - Introduction to Tensorflow Transform
Blog - Pre-processing for Machine Learning with tf.Transform
Concepts
Analyzers - e.g. mean, stddev, quantiles - implemented as Apache Beam data pipelines; Analyzers run and inject the result as constants in the TensorFlow graph
Scale to functions
tft.scale_to_z_score - or between 0 and 1
Bucketisation
tft.quantiles
tft.apply_buckets
Bag of Words / N-Grams
tft.string_split
tft.ngrams
tft.string_to_int
Feature Crosses
tft.string_join
tft.string_to_int
tft.apply_saved_model - inline any other saved TensorFlow model
tf.logging
set_verbosity(tf.logging.INFO) or DEBUG, INFO, WARN (default), ERROR, FATAL
tf.gfile
file_list = Glob(filename_pattern)
tf.data
dataset = TextLineDataset(file_list).map(my_decode_csv)
tf.estimator
train_and_evaluate(estimator, train_spec, eval_spec) - handles fault-tolerant, distributed training and evaluation
Features
Distributes the graph
Share variables
Evaluate periodically
Handles machine failures
Creates checkpoint files
Recovers from failures - workers and the chief
Saves summaries for TensorBoard
tf.estimator.TrainSpec
Contains the input function
Contains the max_steps = number of training steps (not epochs, because it could have recovered from a crash, so it is a count of steps)
tf.estimator.EvalSpec
Contains the input function
Contains the steps = None
Contains start_delay_secs=60 for starting evaluation after N seconds
Contains throttle_secs=600 for evaluating every N seconds
Contains exporters - for checkpointing
ModeKeys.TRAIN and ModeKeys.EVAL
Wide-and-deep DNN - NNLinearCombinedClassifier(model_dir=..., linear_feature_columns=sparse_wide_columns, dnn_feature_columns=dense_deep_columns, dnn_hidden_units=[100,50])
Tips
Start training a model fresh each time - shutil.rmtree(OUTDIR, ignore_errors = True)
Read sharded CSV files - dataset = tf.data.TextLineDataset(filenames).map(decode_csv_to_features_label)
Return the features and labels node in the graph - call dataset.make_one_shot_iterator().get_next()
Use TensorBoard to monitor training