Cover

Formulae Cheat Sheet to Prepare for Machine Learning Specialty

Remember how to write Confusion Matrix

  • Know how to write confusion matrix when Actual and Predicted are swapped
  • Write down both versions of confusion matrix in a rough sheet provided as soon as you start exam

Confusion Matrix

Basic Formulae for Classification

Precision, Recall and Specificity

$$ Precision = Positive Prediction Value (PPV) = \frac{TP}{TP+FP} $$ $$ Recall = Sensitivity = True Positive Rate (TPR) = \frac{TP}{TP+FN} $$ $$ Specificity = True Negative Rate (TNR) = \frac{TN}{TN+FP} $$

Sudo Exam Tip:

  • How to remember above formulae ?
    • After having understood what above formulae mean, only way to quickly reproduce is to have some sort of recall :) from your memory, here is a trick, the way I remember:
      • Precision:
        • Precision formula has all components ending with P i.e. Precision = TP/(TP+FP)
      • Recall:
        • Once you know Precision, for Recall just replace the FP -> FN, that’s all!
      • Specificity
        • Once you know Precision, for Specificity just replace the TP -> TN, that’s all!

F1-Score

$$ F1 = \underbrace{\frac{2 * TP}{2*TP + FP + FN}}_{\text{In terms of TP, FP and FN}} $$

$$ F1 = \underbrace{\frac{2 * Precision * Recall}{Precision + Recall}}_{\text{In terms Precision and Recall}} $$

Sample Question and Solution

precision recall exercise

TF-IDF: Term Frequency - Inverse Document Frequency

  • In information retrieval, tf–idf, TF*IDF, or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus
    • tf–idf is one of the most popular term-weighting schemes today
    • Read more on [Wikipedia]{https://en.wikipedia.org/wiki/Tf%E2%80%93idf}
  • Term frequency (tf) of a term is calculated over all n documents. If tf is to be calculated on many terms, the process is repeated again on all n documents

$$ \begin{aligned} tf(t, d_{n}) \cr &= {\text{Term frequency of term } t \text{ in } document_{n} } \cr &= \frac{\text{Number of times }t\text{ occurs in } d_{n}}{\text{Number of words in }d_{n}} \cr \end{aligned} $$

  • Document frequency (df) of a term is calculated only once over all n documents

$$ \begin{aligned} df(t) \cr &= {\text{Document frequency of term } t \text{ in all documents} } \cr &= \frac{\text{Number of documents with term }t}{\text{Total number of documents i.e. }n} \end{aligned} $$

  • Inverse document frequency idf is simply inverse of df

$$ idf(t) = log(\frac{1}{df(t)}) $$

  • Finally tf-idf is calculated for every term using tf and idf

$$ \begin{aligned} &tf(t, d_{n}) * idf(t) \cr\cr &{\text{Which translates to every term as }} \cr &tf(t, d_{1}) * idf(t) \cr &tf(t, d_{2}) * idf(t) \cr &tf(t, d_{3}) * idf(t) \cr &tf(t, d_{4}) * idf(t) \cr &{\text{so on … till …}} \cr &tf(t, d_{n}) * idf(t) \cr \end{aligned} $$

TF-IDF Exercise

  • TF-IDF tells us the significance of a term in a document
  • Let’s consider few documents:
    • Document 1 - d1: Sudo Code blogs are a very good resource to prepare for machine learning specialty exam. The learning experience is very good. Machine learning specialty preparation is made easy
    • Document 2 - d2: Sudo Code blogs are a very good resource to prepare for machine learning specialty exam. The blogs are very informative and to the point. The blogs take a new approach
    • Document 3 - d3: Sudo Code blogs are very helpful for MLS-C01 exam
  • The question we ask is: How significant is the term learning in all documents
  • The answer is to calculate TF-IDF

Sample Question and Solution

$$ \begin{aligned} tf(learning,d_{1}) \cr &= \frac {\text{No. of times term learning occurs in } d_{1}} {\text{No. of words in } d_{1} } \cr &= \frac{3}{28} = 0.11 \cr \cr tf(learning,d_{2}) \cr &= \frac {\text{No. of times term learning occurs in } d_{2}} {\text{No. of words in } d_{2} } \cr &= \frac{1}{30} = 0.03 \cr \cr tf(learning,d_{3}) \cr &= \frac {\text{No. of times term learning occurs in } d_{3}} {\text{No. of words in } d_{3} } \cr &= \frac{0}{9} = 0 \cr \cr df(learning) \cr &= \frac {\text{No. of documents with term learning}} {\text{No. of documents }} \cr &= \frac{2}{3} \cr \cr idf(learning) \cr &= log(\frac{1}{df(learning)}) \cr &= log(\frac{3}{2}) \cr &= log(1.5) = 0.176 \cr \cr tfidf(learning, d_{1}) \cr &= tf(learning, d_{1}) * idf(learning) \cr &= 0.11 * 0.176 = 0.01936 \cr \cr tfidf(learning, d_{2}) \cr &= tf(learning, d_{2}) * idf(learning) \cr &= 0.03 * 0.176 = 0.00528 \cr \cr tfidf(learning, d_{3}) \cr &= tf(learning, d_{3}) * idf(learning) \cr &= 0 * 0.176 = 0 \cr \end{aligned} $$

Interpretation of TF-IDF Values

  • tf-idf of term learning is largest for document 1, hence the term is more significant in document 1 with weight 0.0193
  • The next document is document 2 with weight 0.0058
  • document 3 has no significance with weight 0

Kinesis Shards Calculation

  • Number of shards required for a Kinesis stream is a precisely calculation based on:
    • Record size
    • Write Bandwidth (into kinesis stream)
    • Read Bandwidth (out of kinesis stream)
  • The number of shards = shards is calculated as

$$ \begin{aligned} shards \cr &= max(\frac{\text{Write bandwidth in KB}}{1000},\frac{\text{Read bandwidth in KB}}{2000}) \cr \cr Where,\cr &\text{Write Bandwidth in KB} = \text{Average Record Size in KB} * \text{Records Per Second} \cr &\text{Read Bandwidth in KB} = \text{Write Bandwidth in KB} * \text{Number of Consumers} \end{aligned} $$

  • Reference: Read this FAQ on Kinesis Data Streams
    • Search for question How do I decide the throughput of my Kinesis stream?

Kinesis shard calculation example

  • You are designing a system where Kinesis data streams are to be used for realtime processing of data produced by IoT systems
  • The average record size produced by IoT devices are 500KB in size
  • The data records are written to Kinesis stream by IoT devices using PutRecord API directly at a rate of 120 records per minute
  • There are 7 Lambda instances that will read from the Kinesis stream and process the data, and finally store them to DynamoDB
  • How many shards will the Kinesis stream need to support the above described system ?

Solution

  • Remember to convert RPM Records Per Minute to RPS Records Per Second

$$ 120 \text{ RPM} = \frac{120}{60} = 2 \text{ RPS} $$ $$ \begin{aligned} \text{Write Bandwidth in KB} \cr &= \text{Average Record Size in KB} * \text{Records Per Second} \cr &= 500 * 2 \cr &= 1000 KB \end{aligned} $$ $$ \begin{aligned} \text{Read Bandwidth in KB} \cr &= \text{Write Bandwidth in KB} * \text{Number of Consumers} \cr &= 1000 * 7 \cr &= 7000 KB \end{aligned} $$ $$ \begin{aligned} shards \cr &= max(\frac{\text{Write bandwidth in KB}}{1000},\frac{\text{Read bandwidth in KB}}{2000}) \cr &= max(\frac{1000}{1000},\frac{7000}{2000}) \cr &= max(1, 3.5) \cr &= 3.5 \approx 4 \end{aligned} $$

Answer

  • As number of shards cannot be a fraction, round up to the next integer i.e 4 shards are needed to support the demands of the system

Autoscaling Sagemaker

  • The production variants of your model need to be autoscaled to handle fluctuation in traffic
  • Perform load testing to find the peak SageMakerVariantInvocationsPerInstance that your model’s production variant can handle
  • The recommended SAFETY_FACTOR is 0.5 to start with, as per AWS
  • Refer here for a detailed AWS blog on fine tuning sagemaker.

If RPS is used:

$$ \begin{aligned} SageMakerVariantInvocationsPerInstance = MAXRPS * SAFETYFACTOR * 60 \end{aligned} $$

If RPM is used:

$$ \begin{aligned} SageMakerVariantInvocationsPerInstance = MAXRPM * SAFETYFACTOR \end{aligned} $$

  • Where MAX_RPS is the maximum RPS that you determined from load test, and SAFETY_FACTOR is the safety factor that you chose to ensure that your clients don’t exceed the maximum RPS. Same holds for MAX_RPM

Sudo Exam Tip: SageMakerVariantInvocationsPerInstance is the average number of times per minute that each instance for a variant is invoked. The Gist! - Final configuration provided to Sagemaker for SageMakerVariantInvocationsPerInstance should be in terms of RPM

Exercise Question: When load testing results are in RPS

A Machine Learning Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS. As this is the first deployment, the Specialist intends to set the invocation safety factor to 0.5 Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the SageMakerVariantInvocationsPerInstance setting?

Solution

$$ \begin{aligned} SageMakerVariantInvocationsPerInstance \cr &= MAXRPS * SAFETYFACTOR * 60 \cr &= 20 * 0.5 * 60 \cr &= 10 * 60 = 600 \cr \end{aligned} $$

Exercise Question: When load testing results are in RPM

A Machine Learning Specialist has performed a load test on a single instance and determined that peak requests per minute (RPM) without service degradation is about 1400 RPM. The Specialist intends to set the invocation safety factor to 0.7 What should the Specialist set as the SageMakerVariantInvocationsPerInstance setting?

Solution

$$ \begin{aligned} SageMakerVariantInvocationsPerInstance \cr &= MAXRPM * SAFETYFACTOR \cr &= 1400 * 0.7 \cr &= 980 \cr \end{aligned} $$