Cover

Ready reference for MLS-C01: Sagemaker Algorithms Compared

  • Study following table to compare various built-in algorithms in Sagemaker
  • You can use this as a ready reckoner for MLS-C01 AWS Certified Machine Learning Specialty Exam
  • Pay attention: Scroll the table horizontally for more columns
  • Download the table as PDF here
Algorithm Algo. Type Input Format INT/FLOAT Processor Type Instance Multiprocessor
in Single Machine
Multi Machine Use Cases Comments HP
Linear Learner SUPERVISED • RecordIO Wrapped Protobuf / CSV
• Float32 Data only
FLOAT32 CPU
GPU
Any CPU
GPU
Only CPU
No GPU
• Regression and classification
• Classification: Binary or multi-class
• Need data to be normalized else algo may not converge
• Multiple models are trained in parallel
balance_multiclass_weights
learning_rate
mini_batch_Size
L1, L2
XGBoost SUPERVISED • CSV or LibSVM (not AWS algo, but adapted, hence NO RecordIO-protobuf) - CPU M4 - No • Regression and classification
• Classification: Binary or multi-class
• Output model as pickle
• Uses extreme boosting of trees
• Algo is memory bound, not much compute
• subsample_trees (less overfitting)
• eta (eq. to learning rate)
• alpha, gamma, lambda (conservative trees for higher values)
Seq2Seq SUPERVISED • RecordIO-Protobuf INT GPU P3 GPU No • Machine translation
• Text summarization
• Speech to text
• Any use case where input a sequence and output is a sequence
• Along with training data and validation data files, must provide vocabulary files – in case of text seq2seq
• Start with tokenized text files, then convert to RecordIO-Protobuf
• Uses RNNs and CNNs internally
• batch_size
• optimizer
• learning_rate
• num_layers_encoder
• num_layers_decoder
• can optimize on: accuracy, BLEU score (mach. translation), perplexity
DeepAR SUPERVISED • JSON Lines
• GZIP
• Parquet

-- Each record to contain
- Start: starting TS
- Target: the TS values to learn/predict
- CPU
GPU
C4
P3
CPU
GPU
CPU
GPU
• Stock price prediction
• Sales and promotion effectiveness
• Any time oriented forecasting, single dimension
• Uses RNNs
• Can train several related timeseries, more series the better results, learns relationships b/w timeseries
• Start with CPU (C4.2xlarge, or higher), if necessary, move to GPU. Only large models need GPU
• context_length (number of time points back in time the model learns)
• epochs, batch_size, learning_rate, num_cells
Blazing Text - Text Classification SUPERVISED Augmented manifest text format –
"__label__1 this is a sentence with , punctuations also tokenized . that is space delimited . One sentence per line . label at the start"
- CPU
GPU
size < 2GB: C5
size > 2GB: P2, P3
Single GPU No • web search and information retrieval • predict labels for sentence • epochs
• learning_rate
• word_ngrams
• vector_dim
Blazing Text - Word2Vec UNSUPERVISED Word2Vec
one sentence per line
- CPU
GPU
P3 CPU/GPU: CBOW & Skip Gram GPU: Batch skip gram
CPU: No
• Preparing input for NLP use cases
• Vectorization of text for machine translation and sentiment analysis
• Semantic similarity of words
• Represents words as vectors
• Semantically similar words are represented by vectors close to each other
• Semantic – of or relating to meaning in language

MULTIPLE MODES:
• CBOW - Continuous Bag of Words - Order of words DO NOT matter
• Skip Gram i.e. n-gram - order of words matter
• Batch skip gram - order of words matter
• mode: mandatory
• learning_rate
• window_size
• vector_dim
• negative_Samples
Object2Vec • Any object to be tokenized into integers
• Training data:
- pairs of tokens
- sequence of tokens
INT CPU
GPU
M5, P2 Single machine No • Collaborative recommendation system
• Multi-label document classification system
• Sentence Embeddings
• Learns relations or associations:
- sen to sen
- labels to seq (genre to description)
- product to product (recommendation)
- user to item (recommendation)
• CNNs and RNNs used
• Encoders used in input
- uses 2 encoders in parallel
- learns associations b/w encoders, using a comparator

Encoder types:
• Hierchical CNNs (hCNNs)
• bi-lstm
• pooled_embedding
dropout, early_stopping_ epochs, learning_rate, batch_size, layers, act. func., optimizer, weight_decay
Object Detection SUPERVISED RecordIO (NOT Protobuf) or Images (JPEG or PNG)
+
With image manifest in JSON, one JSON per image
that contains annotations
- GPU P2, P3 Yes Yes • Detect objects in an image
• Object tracking
• Uses CNN with SSD
• Transfer learning/incremental learning supported
• Uses FLIP, RESCALE, JITTER internally to avoid overfitting
• CPUs can be used for inference, not for training
Standard CNN HPs like: learning_rate, batch size, optimizer etc.
Image Classification SUPERVISED • Pipe: Apache MxNET RecordIO (NOT Protobuf) - for interoperability with other DNN frameworks
• File Mode: Raw JPEG/PNG + *.LST files - associates image index, class label, path to image
-- To use images directly in Pipe mode use JSON based Augmented Manifest Format
- GPU P2, P3 Yes Yes • classify images into multiple classes
• dog/cat/rat/tiger etc.
• Full training: ResNet CNN is used. N/W initialized with random weights
• Transfer Learning/Pre-trained: Image Net is used. Initialized with pre trained weights. Only Top FC layer is initialized with random weights.

• CPU can be used for inference, if not suitable, move to GPU
• batch_size
• learning_rate
• optimizer, B1, B2, eps, Gamma
Semantic Segmentation SUPERVISED • Raw JPEG/PNG in file mode + annotations
• Add Augmented Manifest Format for Pipe Mode
- GPU P2, P3 Yes No • Self driving cars
• Medical imaging and diagnostics
• Robot sensing
• Given a pixel - what object does it belong to ?
• Algo under hood: Gluon CV of MxNET = FC + Pyramid Scene Pairing + DeepLabV3
• Arch: ResNet50/ResNet101 = “Backbone” selection in HP
• Trained on ImageNet data
• Incremental/Transfer learning allowed
• Inference can use CPU or GPU

Each of the three algorithms has two distinct components:
• The backbone (or encoder)—A network that produces reliable activation maps of features.
• The decoder—A network that constructs the segmentation mask from the encoded activation maps.

The segmentation output is represented as a grayscale image, called a segmentation mask. A segmentation mask is a grayscale image with the same shape as the input image.
epochs, learning_rate, batch size, algo, backbone
Random Cut Forest UNSUPERVISED • RecordIO-Protobuf
• CSV
- CPU M4,C4,C5 - No • Anomaly detection
• Detect unexpected spikes in TS data
• Few people have tried using this for fraud detection
• Assigns anomaly score to each data point
• Uses forest of trees
• Looks at expected change in complexity as a result of adding a point to a tree
• Random sampling
• RCF is used in Kinesis Analytics in real time
num_trees, num_samples_per_tree (= choose inversely proportional to ratio #anomalous/#normal in dataset)

Neural Topic Modelling UNSUPERVISED • RecordIO-Protobuf
• CSV

- Words must be tokenized to integers
- aux channel for vocab
INT GPU P2, P3 - • Organize docs into topics
• Summarize docs based on topics
• Algo: Neural Variational Inference
• Define how many topics to group docs into
• Used only on text
• CPU / GPU for inference
num_topics
mini_batch_size
learning_rate
variation_loss (at expense of learning time)
LDA (Latent Dirichlet Allocation) UNSUPERVISED • RecordIO-Protobuf (Pipe Mode)
• CSV

- Words must be tokenized to integers
- aux channel for vocab
- CPU M4 No No • Cluster customers based on purchases
• Harmonic analysis in music
• Algo: LDA - Open source availability, not DNN
• Can process more than text, like harmonic music analysis
• Single inst. CPU
num_topics
alpha0 = small values - sparse topic mixtures, >1 uniform topic mixture
kNN (k Nearest Neighbors) SUPERVISED • RecordIO-protobuf
• CSV
-- File or pipe mode both
- first column has label
- CPU
GPU
- - - • Classification and regression • Sagemaker automates 3 steps:
- Sample data (can’t use for huge data)
- Dim reduction (sign or nfjlt methods)
- Build index for looking up neighbours
k
sample_size
K-Means UNSUPERVISED • RecordIO-protobuf
• CSV
-- File or pipe mode both
- CPU (recommended)
GPU
M4, M5, C4, C5 - - • Cluster data - unsupervised
• Find groups of data points based on similarity
• Webscale K-Means in Sagemaker
• Similarity measured by euclidean distance
• Works to optimize the centers of eack of the k-clusters
• Algorithm:
1) Determine init. cluster centers = 2 ways: k-means++ (tries to make initial clusters far apart) OR random
2) Iterate over data and calculate cluster center
3) Reduce from K to k - using Lloyd’s method or k-means++

K comes from “extra_cluster_centers” which improves accuracy, but later reduced to k.
K = k • x
• K
• mini_batch_size
• extra_center_factor (x)
• init_method (k-means++ OR random)
PCA - Principal Component Analysis UNSUPERVISED • RecordIO-protobuf
• CSV
-- File or pipe mode both
- CPU
GPU
- - - • Dimensionality Reduction
• Removes Curse of Dimensionality
• Reduced Dimensions are called components
• 1st component - largest possible variaility, next 2nd component, so on ..
• Used Singular Value Decomposition (SVD)
• Two Modes:
- Regular: Sparse data. modelate #features, #rows
- Randomized: Dense data. #large data, #large features, uses approximation algos
• algorithm_mode (regular, random
• subtract_mean: unbiases data
Factorization Machines SUPERVISED • RecordIO-Protobuf FLOAT32 CPU (recommended)
GPU
- - - • Regression, Classification, recommendation - all in one general purpose algo for sparse data
• Click prediction
• Item recommendation
• Limited to pairwise interaction - 2nd order
e.g. user to item interactions
• CSV not practical hence not supported,a s data is sparse
*GPU not recommented as data is sparse, GPU works better on dense data
• Initialization methods for bias, factors and linear terms
- methods: uniform, normal or const
- can tune properties of each method
IP Insights UNSUPERVISED • CSV only for training
• Inference: JSON lines, CSV, JSON
- CPU
GPU (recommended)
- Multi GPU - • Identify suspicious IP addresses in context of security
• Logins from anomalous IPs
• Identify accounts creating resources from anamolous IPs
• Only IPv4 supported
• Uses NN to learn latent vector rep. of entities and IP addresses
• entities are hashed and embedded - large hash size
• Automatically generates anomalous data by randomly pairing entities and IPs - as data will be highly imbalanced
• num_entity_vectors (hash size, set to twice the unique entity identifiers)
• vector_dim (size of embedding vectors, scales model size)
• Others: epoch, batch_size, leraning_rate, etc.
Reinforcement Learning REINFORCEMENT LEARNING • Nothing specific to Sagemaker - GPU GPU Yes Yes - Multi Instance GPU recommended • Games
• Supply chain management
• HVAC Systems
• Industrial robotics
• Dialog systems
• Autonomous vehicles
• Supports Intel coach, Ray RLLib
• Tensorflow, MxNET
• Custom, commercial and opensource environments supported - Matlab simulink, energy plus, robo school, pybullet, Amazon Sumerian, AWS Robomaker
• Depends on framework and algo used, nothing tied to Sagemaker

Algorithms categorized by feature

  • Here a feature is based on:
    • Data type of input
    • Algorithms that can train on CPU or GPU
    • Algorithms that can be trained incrementally
      • Incremental training enables to resume training, or to retrain a DNN by changing the final FC layer by using a pre-trained model i.e transfer learning
Mandatory FLOAT32 Mandatory INT32 CPU ONLY GPU Only Incremental Training Available
Linear Learner Seq2Seq XGBoost Seq2Seq Image Classification
Factorization Machines Object2Vec RCF Image Classification Semantic Segmentation
NTM LDA Semantic Segmentation Object Detection
Object Detection
NTM
Reinforcement Learning

Algorithms that support distributed training

  • Mnemonic to remember: F-SKILBDR as in SkillBuilder (If you can come up with something better, please share in comments! )
  • An entire new blog to understand distributed training and how it works coming soon!
Distributed Training Support
Factorization Machine
Seq2Seq
K-Means
IP Insights
Linear Learner (Not LDA)
Blazing Text - Word2Vec
DeepAR
RCF - Random Cut Forests

Leave your comments to appreciate the article or request for change. Thanks!