We need to learn about bias and variance in order to understand how to fix underfitting and overfitting of model. Spend enough time to read and understand, everything till the end of the blog, where you get to know how to apply the concepts (helpful for MLS-C01 exam too!)
What are Bias and Variance ?
Welcome to the world of bias and variance. It may be confusing at first, but by end of the blog, you’ll have full clarity. When learning about bias and variance, you need to be very clear about the context where it’s used.
- There is a statistical way of understanding it on data, used in descriptive statistics
- Bias and variance in the context of ML model training, where in underfit and overfit terms are used interchangeably, respectively
- To be clear, in ML model training:
- Bias and underfit are used interchangeably
- Variance and overfit are used interchangeably
- To be clear, in ML model training:
We’ll learn it in the context of Data/Statistics at first then in the context of Machine Learning
Bias and Variance in Data
- Bias and variance are one of the characteristics of data.
- Bias is more of error introduced in data due to the nature of data collection process used, where as, variance is a statistical measure
- Remember this is not what ML Engineers refer to when they talk about bias-variance tradeoff, rather, it’s in context of model training using the data (More about this later)
Bias in Data
Data bias in statistics is type of error in which certain elements of a dataset are more heavily weighted and/or represented than others. A biased dataset does not accurately represent the real world perception of a concept or reality under study, resulting in skewed outcomes, low accuracy levels, and errors when such a data is used in machine learning
Let’s consider an example:
- Let’s say we want to predict income of individuals in a country, but it’s really difficult to collect such information from the whole population
- For our study, statistically, we can collect data in such a way that it nearly represents all the strata of society i.e both rich and poor, all races, religions etc.
- However some regions of a country have may have really affluent people than other regions
- If we don’t randomly sample data across the country we may end up in collecting skewed data that we may assume to represent the country
- This kind of skewed data is due to incorrect sampling or collection is called as
Sample bias
There are many kinds of bias, let’s consider a few common one’s here:
Exclusion bias
: Most often it’s a case of deleting valuable data thought to be unimportant. However, it can also occur due to the systematic exclusion of certain information. For example in sales data, 99% of the sales may be fromProduct A
, however 1% ofProduct B
sales could be contributing to the 50% of revenue. ExcludingProduct B
thinking it’s not contributing to sales can be a disaster.Observer bias
orConfirmation bias
: is the effect of seeing what you expect to see or want to see in data. This can happen when researchers go into a project with subjective thoughts about their study, either conscious or unconscious
Sudo Tip: Bias in data, gets injected in the model learnt by ML Algorithms and can result in bad predictions lowering the accuracy of the model. Identifying and removing bias in data is an important step in data preparation for ML.
Variance of Data
- Well, statistics is the root of traditional machine learning algorithms, there is not doing away with statistics when data is in focus
- Variance in data, is nothing but a measure to describe how spread out the data is
- E.g. Variance is trying to describe how much a set of numbers vary
- One way to do that is to describe a number range i.e. min is 5 and max is 500 from a given set of numbers
- But there are issues in using such a double number measure
(min, max)
, instead a single number measure is used - Refer this awesome guide, to refresh your memory
- After reading above guide, you’ll realize
standard deviation
is another measure of variance, and a better one
Sudo Tip: Variance in data is calculated at feature (column of data prepared for ML algorithm training) level, and variance is considered good.
- Features that have low variance - contribute very less towards learning and hence are removed from training dataset
- Features that have high variance, help in describing patterns in data, thereby helps an ML model to learn them
Bias and Variance in ML Model
Having understood Bias and Variance in data, now we can understand what it means in Machine Learning models
- Bias and variance in a model can be easily identified by comparing the data set points and predictions
- Above figure shows an example for a regression case
- The blue dots are training data points
- The red line is the regression line learnt (or as it’s called
fit
a curve to data) by ML algorithm - Overfit/High Variance:
- The line fit by algorithm is so tight to the training data that is cannot generalize to new unseen data
- This case is also called as high variance in model because, the model has picked up variance in data and learnt it perfectly. The high variance in data could be because of noise, and when learnt by model, it lowers accuracy of model
- We should avoid overfit models to generalize better on new data (keep reading to know how to reduce overfit in models)
- Underfit/High Bias:
- The line fit by algorithm is flat i.e constant value. No matter what is the input, prediction is a constant. This is the worst form of bias in ML
- The algorithm has learnt so less from data that the line has been underfit (due to high bias)
- We should avoid underfit models (keep reading to know how to reduce underfit in models)
- Good Balance/Right Fit
- The right fit models learn a smooth curve on data representing the direction of data and variation in data
Bias-Variance Trade-off
- There is a fancy term called
bias-variance tradeoff
which simply means you cannot reduce both bias and variance in model - You can only achieve a good balance between
- A good analogy would be: one cannot achieve both high speed and torque at the same time. Higher the torque, lower the speed and vice-versa
- Less geeky analogies:
- When you have to put up with a half hour commute in order to make more money
- You might take a day off work to go to a movie, gaining leisure and entertainment, while losing a day’s wages
- The point of a trade-off is that you can’t have both
The closure: Important for MLS-C01 exam and for your learning
How to identify high bias (underfit) and high variance (overfit) in a model ?
Sudo Exam Tip: Below graph is important to recognize bias and variance cases in training. Remember to identify the exact case by looking at gap between the training and validation curves towards the end of the curve
- First graph: A high training error (underfit) and a small gap between the train and validation error curves, towards the end of curve indicates high bias, inturn implies low variance
- Second graph: A low training error (overfit) and large gap between the train and validation error curves, towards the end of curve indicates low bias, inturn implies high variance
How to fix high bias (underfit) and high variance (overfit) in models ?
Sudo Exam Tip: Remember and memorize the following, after you understand it thoroughly. Write it down in the given rough sheet of paper at exam center as soon as you enter to save time. Many questions need you to have below information at the back of your hand, meaning, committed to memory.