Formulae Cheat Sheet to Prepare for Machine Learning Specialty
Remember how to write Confusion Matrix
- Know how to write confusion matrix when
Actual
andPredicted
are swapped - Write down both versions of confusion matrix in a rough sheet provided as soon as you start exam
Basic Formulae for Classification
Precision, Recall and Specificity
$$ Precision = Positive Prediction Value (PPV) = \frac{TP}{TP+FP} $$ $$ Recall = Sensitivity = True Positive Rate (TPR) = \frac{TP}{TP+FN} $$ $$ Specificity = True Negative Rate (TNR) = \frac{TN}{TN+FP} $$
Sudo Exam Tip:
- How to remember above formulae ?
- After having understood what above formulae mean, only way to quickly reproduce is to have some sort of recall :) from your memory, here is a trick, the way I remember:
- Precision:
- Precision formula has all components ending with P i.e. Precision = TP/(TP+FP)
- Recall:
- Once you know
Precision
, forRecall
just replace the FP -> FN, that’s all!- Specificity
- Once you know
Precision
, forSpecificity
just replace the TP -> TN, that’s all!
F1-Score
$$ F1 = \underbrace{\frac{2 * TP}{2*TP + FP + FN}}_{\text{In terms of TP, FP and FN}} $$
$$ F1 = \underbrace{\frac{2 * Precision * Recall}{Precision + Recall}}_{\text{In terms Precision and Recall}} $$
Sample Question and Solution
TF-IDF: Term Frequency - Inverse Document Frequency
- In information retrieval,
tf–idf
,TF*IDF
, orTFIDF
, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpustf–idf
is one of the most popular term-weighting schemes today- Read more on [Wikipedia]{https://en.wikipedia.org/wiki/Tf%E2%80%93idf}
- Term frequency (tf) of a
term
is calculated over alln
documents. Iftf
is to be calculated on many terms, the process is repeated again on alln
documents
$$ \begin{aligned} tf(t, d_{n}) \cr &= {\text{Term frequency of term } t \text{ in } document_{n} } \cr &= \frac{\text{Number of times }t\text{ occurs in } d_{n}}{\text{Number of words in }d_{n}} \cr \end{aligned} $$
- Document frequency (df) of a
term
is calculated only once over alln
documents
$$ \begin{aligned} df(t) \cr &= {\text{Document frequency of term } t \text{ in all documents} } \cr &= \frac{\text{Number of documents with term }t}{\text{Total number of documents i.e. }n} \end{aligned} $$
- Inverse document frequency
idf
is simply inverse ofdf
$$ idf(t) = log(\frac{1}{df(t)}) $$
- Finally
tf-idf
is calculated for every term usingtf
andidf
$$ \begin{aligned} &tf(t, d_{n}) * idf(t) \cr\cr &{\text{Which translates to every term as }} \cr &tf(t, d_{1}) * idf(t) \cr &tf(t, d_{2}) * idf(t) \cr &tf(t, d_{3}) * idf(t) \cr &tf(t, d_{4}) * idf(t) \cr &{\text{so on … till …}} \cr &tf(t, d_{n}) * idf(t) \cr \end{aligned} $$
TF-IDF Exercise
- TF-IDF tells us the significance of a term in a document
- Let’s consider few documents:
Document 1 - d1
: Sudo Code blogs are a very good resource to prepare for machine learning specialty exam. The learning experience is very good. Machine learning specialty preparation is made easyDocument 2 - d2
: Sudo Code blogs are a very good resource to prepare for machine learning specialty exam. The blogs are very informative and to the point. The blogs take a new approachDocument 3 - d3
: Sudo Code blogs are very helpful for MLS-C01 exam
- The question we ask is: How significant is the term
learning
in all documents - The answer is to calculate TF-IDF
Sample Question and Solution
$$ \begin{aligned} tf(learning,d_{1}) \cr &= \frac {\text{No. of times term learning occurs in } d_{1}} {\text{No. of words in } d_{1} } \cr &= \frac{3}{28} = 0.11 \cr \cr tf(learning,d_{2}) \cr &= \frac {\text{No. of times term learning occurs in } d_{2}} {\text{No. of words in } d_{2} } \cr &= \frac{1}{30} = 0.03 \cr \cr tf(learning,d_{3}) \cr &= \frac {\text{No. of times term learning occurs in } d_{3}} {\text{No. of words in } d_{3} } \cr &= \frac{0}{9} = 0 \cr \cr df(learning) \cr &= \frac {\text{No. of documents with term learning}} {\text{No. of documents }} \cr &= \frac{2}{3} \cr \cr idf(learning) \cr &= log(\frac{1}{df(learning)}) \cr &= log(\frac{3}{2}) \cr &= log(1.5) = 0.176 \cr \cr tfidf(learning, d_{1}) \cr &= tf(learning, d_{1}) * idf(learning) \cr &= 0.11 * 0.176 = 0.01936 \cr \cr tfidf(learning, d_{2}) \cr &= tf(learning, d_{2}) * idf(learning) \cr &= 0.03 * 0.176 = 0.00528 \cr \cr tfidf(learning, d_{3}) \cr &= tf(learning, d_{3}) * idf(learning) \cr &= 0 * 0.176 = 0 \cr \end{aligned} $$
Interpretation of TF-IDF Values
- tf-idf of term
learning
is largest fordocument 1
, hence the term is more significant indocument 1
with weight 0.0193 - The next document is
document 2
with weight 0.0058 document 3
has no significance with weight 0
Kinesis Shards Calculation
- Number of shards required for a
Kinesis
stream is a precisely calculation based on:- Record size
- Write Bandwidth (into kinesis stream)
- Read Bandwidth (out of kinesis stream)
- The number of shards =
shards
is calculated as
$$ \begin{aligned} shards \cr &= max(\frac{\text{Write bandwidth in KB}}{1000},\frac{\text{Read bandwidth in KB}}{2000}) \cr \cr Where,\cr &\text{Write Bandwidth in KB} = \text{Average Record Size in KB} * \text{Records Per Second} \cr &\text{Read Bandwidth in KB} = \text{Write Bandwidth in KB} * \text{Number of Consumers} \end{aligned} $$
- Reference: Read this FAQ on Kinesis Data Streams
- Search for question
How do I decide the throughput of my Kinesis stream?
- Search for question
Kinesis shard calculation example
- You are designing a system where Kinesis data streams are to be used for realtime processing of data produced by IoT systems
- The average record size produced by IoT devices are 500KB in size
- The data records are written to Kinesis stream by IoT devices using
PutRecord
API directly at a rate of 120 records per minute - There are 7 Lambda instances that will read from the Kinesis stream and process the data, and finally store them to DynamoDB
- How many shards will the Kinesis stream need to support the above described system ?
Solution
- Remember to convert
RPM
Records Per Minute toRPS
Records Per Second
$$ 120 \text{ RPM} = \frac{120}{60} = 2 \text{ RPS} $$ $$ \begin{aligned} \text{Write Bandwidth in KB} \cr &= \text{Average Record Size in KB} * \text{Records Per Second} \cr &= 500 * 2 \cr &= 1000 KB \end{aligned} $$ $$ \begin{aligned} \text{Read Bandwidth in KB} \cr &= \text{Write Bandwidth in KB} * \text{Number of Consumers} \cr &= 1000 * 7 \cr &= 7000 KB \end{aligned} $$ $$ \begin{aligned} shards \cr &= max(\frac{\text{Write bandwidth in KB}}{1000},\frac{\text{Read bandwidth in KB}}{2000}) \cr &= max(\frac{1000}{1000},\frac{7000}{2000}) \cr &= max(1, 3.5) \cr &= 3.5 \approx 4 \end{aligned} $$
Answer
- As number of
shards
cannot be a fraction, round up to the next integer i.e4 shards
are needed to support the demands of the system
Autoscaling Sagemaker
- The production variants of your model need to be autoscaled to handle fluctuation in traffic
- Perform load testing to find the peak
SageMakerVariantInvocationsPerInstance
that your model’s production variant can handle - The recommended
SAFETY_FACTOR
is 0.5 to start with, as per AWS - Refer here for a detailed AWS blog on fine tuning sagemaker.
If RPS is used:
$$ \begin{aligned} SageMakerVariantInvocationsPerInstance = MAXRPS * SAFETYFACTOR * 60 \end{aligned} $$
If RPM is used:
$$ \begin{aligned} SageMakerVariantInvocationsPerInstance = MAXRPM * SAFETYFACTOR \end{aligned} $$
- Where MAX_RPS is the maximum RPS that you determined from load test, and
SAFETY_FACTOR
is the safety factor that you chose to ensure that your clients don’t exceed the maximum RPS. Same holds for MAX_RPM
Sudo Exam Tip:
SageMakerVariantInvocationsPerInstance
is the average number of timesper minute
that each instance for a variant is invoked. The Gist! - Final configuration provided to Sagemaker forSageMakerVariantInvocationsPerInstance
should be in terms of RPM
Exercise Question: When load testing results are in RPS
A Machine Learning Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS. As this is the first deployment, the Specialist intends to set the invocation safety factor to 0.5 Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the SageMakerVariantInvocationsPerInstance
setting?
Solution
$$ \begin{aligned} SageMakerVariantInvocationsPerInstance \cr &= MAXRPS * SAFETYFACTOR * 60 \cr &= 20 * 0.5 * 60 \cr &= 10 * 60 = 600 \cr \end{aligned} $$
Exercise Question: When load testing results are in RPM
A Machine Learning Specialist has performed a load test on a single instance and determined that peak requests per minute (RPM) without service degradation is about 1400 RPM. The Specialist intends to set the invocation safety factor to 0.7
What should the Specialist set as the SageMakerVariantInvocationsPerInstance
setting?
Solution
$$ \begin{aligned} SageMakerVariantInvocationsPerInstance \cr &= MAXRPM * SAFETYFACTOR \cr &= 1400 * 0.7 \cr &= 980 \cr \end{aligned} $$