Why Use Squared Errors?
January 01, 2020
Gist
The function that minimizes the squared error loss function is the conditional mean.
Good Problems, Bad Explanations
I came to statistics and machine learning from a background in proof-driven mathematics. These two are not in opposition at all, however most of my high school and undergrad exposure to stats certainly felt that way. One of the most significant stumbling blocks for me as a beginner was how soon the concept of squared errors came up right away with very little motivation! Many concepts in beginning statistics matches intuition pretty closely so sometimes the background is unnecessary - for me, this was not one of those concepts. While it isn’t the only context this comes up in, I think the best way to explore this is in the context of the regression problem.
A Little Setup: Regression
Let’s consider a minimal version of the regression problem: given our data in the form of a dependent variable and an independent variable 1, we want to find a function so that we can model the unknown process that generated our data. Our model looks like this,
where represents the prediction error. We don’t usually want just any function though - we want the best function! As mathematicians, that word “best” should raise a lot of questions, and this is where we need the concept of a loss function. Without going too deep, a loss function provides the framework for evaulating the performance of and defines what “best” means by answering the question how off is our guess of ?. We then have two tasks to solve in our simplified regression problem:
- Choose an appropriate loss function
- Find the function such that the value of is the lowest
In introductory stats, the loss function we use is almost always the squared error loss.
Squared Errors
Minimizing Squared Errors, Conditional Mean
Why not another loss function?
-
Most of this stuff is generalizable, but for simplicity assume
↩