Day 1 of #66daysofdata challenge.
Imagine we use a relationship between persons height and weight, the purple dots here represents the training data, we will use Linear Regression algorithm that will fit a straight line as seen above.
Now we consider that the red line here is the true relation. Note that regression line above will not capture the true relation. This inability to capture true relation is called Bias.
Now we have a method that might fit this squiggly line represented by yellow. Here the Bias is very little.
We measure the distance of fit line(yellow) to data (purple points), square them and add them as a measure of how well it performs. Note that squiggly line fits the data very well.
Now the green dots represents here the test dataset. Straight line here performs better than squiggly line. The difference in fits between datasets is called Variance.
Squiggly line is Overfit because it fits training data very well so well but performs badly on test dataset.
So ideally we need low bias and low variance.
References:
- https://youtu.be/EuBBz3bI-aA?si=xrcf1FzrxNpRiwRB (StatQuest)