Cover

We need to learn about bias and variance in order to understand how to fix underfitting and overfitting of model. Spend enough time to read and understand, everything till the end of the blog, where you get to know how to apply the concepts (helpful for MLS-C01 exam too!)

What are Bias and Variance ?

Welcome to the world of bias and variance. It may be confusing at first, but by end of the blog, you’ll have full clarity. When learning about bias and variance, you need to be very clear about the context where it’s used.

  • There is a statistical way of understanding it on data, used in descriptive statistics
  • Bias and variance in the context of ML model training, where in underfit and overfit terms are used interchangeably, respectively
    • To be clear, in ML model training:
      • Bias and underfit are used interchangeably
      • Variance and overfit are used interchangeably

We’ll learn it in the context of Data/Statistics at first then in the context of Machine Learning

Bias and Variance in Data

  • Bias and variance are one of the characteristics of data.
  • Bias is more of error introduced in data due to the nature of data collection process used, where as, variance is a statistical measure
  • Remember this is not what ML Engineers refer to when they talk about bias-variance tradeoff, rather, it’s in context of model training using the data (More about this later)

Bias in Data

Data bias in statistics is type of error in which certain elements of a dataset are more heavily weighted and/or represented than others. A biased dataset does not accurately represent the real world perception of a concept or reality under study, resulting in skewed outcomes, low accuracy levels, and errors when such a data is used in machine learning

Let’s consider an example:

  • Let’s say we want to predict income of individuals in a country, but it’s really difficult to collect such information from the whole population
  • For our study, statistically, we can collect data in such a way that it nearly represents all the strata of society i.e both rich and poor, all races, religions etc.
  • However some regions of a country have may have really affluent people than other regions
  • If we don’t randomly sample data across the country we may end up in collecting skewed data that we may assume to represent the country
  • This kind of skewed data is due to incorrect sampling or collection is called as Sample bias

There are many kinds of bias, let’s consider a few common one’s here:

  • Exclusion bias: Most often it’s a case of deleting valuable data thought to be unimportant. However, it can also occur due to the systematic exclusion of certain information. For example in sales data, 99% of the sales may be from Product A, however 1% of Product B sales could be contributing to the 50% of revenue. Excluding Product B thinking it’s not contributing to sales can be a disaster.
  • Observer bias or Confirmation bias: is the effect of seeing what you expect to see or want to see in data. This can happen when researchers go into a project with subjective thoughts about their study, either conscious or unconscious

Sudo Tip: Bias in data, gets injected in the model learnt by ML Algorithms and can result in bad predictions lowering the accuracy of the model. Identifying and removing bias in data is an important step in data preparation for ML.

Variance of Data

  • Well, statistics is the root of traditional machine learning algorithms, there is not doing away with statistics when data is in focus
  • Variance in data, is nothing but a measure to describe how spread out the data is
    • E.g. Variance is trying to describe how much a set of numbers vary
    • One way to do that is to describe a number range i.e. min is 5 and max is 500 from a given set of numbers
    • But there are issues in using such a double number measure (min, max), instead a single number measure is used
    • Refer this awesome guide, to refresh your memory
    • After reading above guide, you’ll realize standard deviation is another measure of variance, and a better one

Sudo Tip: Variance in data is calculated at feature (column of data prepared for ML algorithm training) level, and variance is considered good.

  • Features that have low variance - contribute very less towards learning and hence are removed from training dataset
  • Features that have high variance, help in describing patterns in data, thereby helps an ML model to learn them

Bias and Variance in ML Model

Having understood Bias and Variance in data, now we can understand what it means in Machine Learning models

Cover

  • Bias and variance in a model can be easily identified by comparing the data set points and predictions
  • Above figure shows an example for a regression case
  • The blue dots are training data points
  • The red line is the regression line learnt (or as it’s called fit a curve to data) by ML algorithm
  • Overfit/High Variance:
    • The line fit by algorithm is so tight to the training data that is cannot generalize to new unseen data
    • This case is also called as high variance in model because, the model has picked up variance in data and learnt it perfectly. The high variance in data could be because of noise, and when learnt by model, it lowers accuracy of model
    • We should avoid overfit models to generalize better on new data (keep reading to know how to reduce overfit in models)
  • Underfit/High Bias:
    • The line fit by algorithm is flat i.e constant value. No matter what is the input, prediction is a constant. This is the worst form of bias in ML
    • The algorithm has learnt so less from data that the line has been underfit (due to high bias)
    • We should avoid underfit models (keep reading to know how to reduce underfit in models)
  • Good Balance/Right Fit
    • The right fit models learn a smooth curve on data representing the direction of data and variation in data

Bias-Variance Trade-off

  • There is a fancy term called bias-variance tradeoff which simply means you cannot reduce both bias and variance in model
  • You can only achieve a good balance between
  • A good analogy would be: one cannot achieve both high speed and torque at the same time. Higher the torque, lower the speed and vice-versa
  • Less geeky analogies:
    • When you have to put up with a half hour commute in order to make more money
    • You might take a day off work to go to a movie, gaining leisure and entertainment, while losing a day’s wages
  • The point of a trade-off is that you can’t have both

The closure: Important for MLS-C01 exam and for your learning

How to identify high bias (underfit) and high variance (overfit) in a model ?

Sudo Exam Tip: Below graph is important to recognize bias and variance cases in training. Remember to identify the exact case by looking at gap between the training and validation curves towards the end of the curve

Bias and Variance from Learning Curves

  • First graph: A high training error (underfit) and a small gap between the train and validation error curves, towards the end of curve indicates high bias, inturn implies low variance
  • Second graph: A low training error (overfit) and large gap between the train and validation error curves, towards the end of curve indicates low bias, inturn implies high variance

How to fix high bias (underfit) and high variance (overfit) in models ?

Sudo Exam Tip: Remember and memorize the following, after you understand it thoroughly. Write it down in the given rough sheet of paper at exam center as soon as you enter to save time. Many questions need you to have below information at the back of your hand, meaning, committed to memory.

Fix Bias and Variance in Models