Where Do Loss Functions Come From? | by Yoshimasa | Mar, 2025

Photograph by Antoine Dautry on Unsplash

While you practice a machine studying mannequin, you decrease a loss perform — however have you ever ever questioned why we use those we do? Why is Imply Squared Error (MSE) so frequent in regression? Why does Cross-Entropy Loss dominate classification? Are loss features simply arbitrary decisions, or have they got deeper mathematical roots?

It seems that many loss features aren’t simply invented — they emerge naturally from chance concept. However not all of them. Some loss features defy probabilistic instinct and are purely designed for optimization.

Let’s begin with a easy instance. Suppose we’re predicting home costs utilizing a regression mannequin. The commonest strategy to measure error is the Imply Squared Error (MSE) loss:

At first look, this simply seems to be like a mathematical strategy to measure how far our predictions are from actuality. However why squared error? Why not absolute error? Why not dice error?

Chance Density Operate (PDF)

If we assume that the errors in our mannequin comply with a regular distribution:

Then the chance density perform (PDF) is:

Probability Operate

If we observe a number of impartial knowledge factors x1,x2,…,xn, then their joint chance (probability perform) is:

Since we sometimes work with log-likelihoods for simpler optimization:

Deriving the Loss Operate

Now, to show this right into a loss perform, we negate the log-likelihood (since optimizers decrease loss relatively than maximize chance):

If we assume σ^2 is fixed, the loss perform simplifies to:

which is simply Imply Squared Error (MSE).
MSE isn’t only a selection — it’s the results of assuming usually distributed errors. Because of this we implicitly assume a Gaussian distribution each time we decrease MSE.

If we don’t assume a hard and fast variance, we get a barely totally different loss perform:

This additional time period, logσ^2, implies that the optimum parameters for μ and σ^2 are discovered collectively, relatively than assuming a hard and fast variance.

If we deal with σ^2 as unknown, we transfer towards heteroscedastic fashions, which permit for various ranges of uncertainty throughout totally different predictions.

Cross-Entropy Loss

For classification issues, we frequently decrease Cross-Entropy Loss, which comes from the Bernoulli or Categorical probability perform.

For binary classification:

This arises naturally from the probability of knowledge coming from a Bernoulli distribution.

To date, we’ve seen that many loss features come up naturally from probability features. However not all of them. Some are designed for optimization effectivity, robustness, or task-specific wants.

Hinge Loss (SVMs)

Most classification loss features, like cross-entropy loss, come from a probabilistic framework. However Hinge Loss, the core loss perform in Assist Vector Machines (SVMs), is totally different.

As a substitute of modeling probability, it focuses on maximizing the margin between lessons.

If now we have labels y∈{−1,+1} and a mannequin making predictions f(x), hinge loss is: