← All Posts

Mathematics of Data

Double Descent and the Failure of Classical Generalization Theory

March 2025deep-learninggeneralizationbias-variancerademacher

Every statistics course teaches the same lesson: models that are too complex overfit. Add enough parameters and you will memorize the training data, producing a model that generalizes poorly. The cure is regularization, cross-validation, early stopping — any technique that reins in complexity.

Modern deep learning ignores this advice completely. Neural networks routinely have more parameters than training points, are trained to zero training loss, and yet generalize remarkably well. This is not a minor violation of classical theory — it is a fundamental contradiction.

This post develops both sides: the classical theory that predicts catastrophe, and the modern empirical reality that contradicts it. The gap between them is still not fully explained, but understanding where the classical theory breaks down is the first step.


1. The Classical Bias-Variance Trade-Off

The Decomposition

Let the true relationship be Y=f(x)+εY = f(x) + \varepsilon where ε(0,σ2)\varepsilon \sim (0, \sigma^2) is irreducible noise. We learn an estimator f^\hat{f} from training data SS drawn i.i.d. from the data distribution. The test error at a fixed point xx decomposes as:

Theorem (Bias-Variance Decomposition). For squared loss:

ES[(Yf^(x))2]=(f(x)ES[f^(x)])2Bias2+ES[(f^(x)ES[f^(x)])2]Variance+σ2irreducible\mathbb{E}_S\left[(Y - \hat{f}(x))^2\right] = \underbrace{(f(x) - \mathbb{E}_S[\hat{f}(x)])^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_S\left[(\hat{f}(x) - \mathbb{E}_S[\hat{f}(x)])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{irreducible}}

Proof. Let fˉ(x)=ES[f^(x)]\bar{f}(x) = \mathbb{E}_S[\hat{f}(x)] and write:

Yf^(x)=(Yf(x))+(f(x)fˉ(x))+(fˉ(x)f^(x))Y - \hat{f}(x) = (Y - f(x)) + (f(x) - \bar{f}(x)) + (\bar{f}(x) - \hat{f}(x))

=ε+Bias(x)+(fˉ(x)f^(x))= \varepsilon + \text{Bias}(x) + (\bar{f}(x) - \hat{f}(x))

Squaring and taking ES[]\mathbb{E}_S[\cdot], the cross terms vanish: ES[ε(fˉf^)]=0\mathbb{E}_S[\varepsilon \cdot (\bar{f}-\hat{f})] = 0 since ε\varepsilon is independent of SS, and ES[ε]=0\mathbb{E}_S[\varepsilon] = 0; similarly Bias(x)ES[fˉf^]=0\text{Bias}(x) \cdot \mathbb{E}_S[\bar{f}-\hat{f}] = 0 since fˉ(x)\bar{f}(x) is deterministic and ES[f^]=fˉ\mathbb{E}_S[\hat{f}] = \bar{f}. So:

ES[(Yf^)2]=σ2+Bias(x)2+ES[(fˉf^)2]\mathbb{E}_S[(Y-\hat{f})^2] = \sigma^2 + \text{Bias}(x)^2 + \mathbb{E}_S[(\bar{f}-\hat{f})^2] \qquad \square

The Classical Implication

Bias measures systematic error — how far the average prediction is from the truth. Simple models (e.g., linear) have high bias if the truth is nonlinear.

Variance measures sensitivity to the training set — how much f^\hat{f} fluctuates across different datasets. Complex models (e.g., high-degree polynomials) fit the training data tightly, so they fluctuate wildly.

As model complexity increases:

  • Bias decreases (the model can approximate more complex functions)
  • Variance increases (the model fits noise in the training data)

The sum produces a U-shaped risk curve: test error is high for very simple models (high bias), decreases as complexity grows (bias reduction wins), then increases again (variance takes over). The optimal model sits at the bottom of the U.

This is Occam's Razor formalized: prefer the simplest model consistent with the data. In classical statistics, this principle has a rigorous theoretical foundation.


2. What Actually Happens with Neural Networks

The Interpolation Threshold

Define the interpolation threshold as the model complexity at which the model can exactly fit all nn training points — zero training loss. In classical terms, this is the worst possible point: maximum variance, and the U-curve predicts maximum test error here.

In practice, something different happens.

As model size increases past the interpolation threshold, test error does peak — confirming the classical prediction. But for sufficiently overparameterized models, test error descends again, often reaching lower values than any model on the classical U-curve.

This is the double descent curve:

Test
Error
  |         classical       second
  |          regime         descent
  |
  |    ╲                    
  |     ╲      ╱╲          ╲
  |      ╲    ╱  ╲          ╲___
  |       ╲__╱    ╲
  |                ↑
  |          interpolation
  |            threshold
  |________________________________
                          Model size / complexity

The phenomenon was named and systematized by Belkin et al. (2019), though related observations go back to the 1990s in ensemble methods and kernel machines.

Empirical Reality

On standard benchmarks (CIFAR-10, ImageNet):

  • ResNet models with increasing width show classic double descent as width grows
  • Test error at the interpolation threshold can be 10–20% worse than both smaller and larger models
  • Sufficiently large models achieve test error comparable to or better than the classical optimum
  • Training longer (more epochs) also produces double descent as a function of training time

This is not noise or a quirk of specific architectures. Double descent appears across datasets, architectures, and training procedures. It is a structural feature of modern machine learning.


3. Benign Overfitting

Double descent suggests that interpolating the training data is not inherently harmful. Benign overfitting makes this precise.

Definition. An estimator f^\hat{f} exhibits benign overfitting if:

  1. f^\hat{f} interpolates the training data: f^(xi)=yi\hat{f}(x_i) = y_i for all ii (zero training error)
  2. f^\hat{f} achieves near-optimal test error

The Linear Case: Minimum-Norm Interpolation

Consider linear regression with d>nd > n: the system Xw=yXw = y (where XRn×dX \in \mathbb{R}^{n \times d}) is underdetermined — infinitely many ww achieve zero training loss.

Among all interpolating solutions, the minimum-norm interpolator is:

wMN=argminw:Xw=yw2=XT(XXT)1y=Xyw_{MN} = \arg\min_{w: Xw=y} \|w\|_2 = X^T(XX^T)^{-1}y = X^\dagger y

where XX^\dagger is the Moore-Penrose pseudoinverse. This is what gradient descent on a zero-initialized linear model finds (we will prove this in Post 10).

When does wMNw_{MN} generalize? Suppose the true model is wRdw^* \in \mathbb{R}^d and y=Xw+εy = Xw^* + \varepsilon. The test error of wMNw_{MN} is:

E[wMNw2]=E[Xε2]+(IXX)w2\mathbb{E}[\|w_{MN} - w^*\|^2] = \mathbb{E}[\|X^\dagger \varepsilon\|^2] + \|(\mathbb{I} - X^\dagger X)w^*\|^2

The first term is the variance (noise amplification by XX^\dagger). The second is the bias (the component of ww^* in the null space of XX, which wMNw_{MN} cannot recover).

Key insight for benign overfitting: When dnd \gg n and the rows of XX are isotropic (e.g., i.i.d. Gaussian with dimension dd), the smallest singular values of XX are Θ(dn)\Theta(\sqrt{d-n}), so X2=O(1/(dn))\|X^\dagger\|^2 = O(1/(d-n)). The variance term scales as:

E[Xε2]=σ2XF2σ2ndn0 as d\mathbb{E}[\|X^\dagger \varepsilon\|^2] = \sigma^2 \|X^\dagger\|_F^2 \approx \sigma^2 \cdot \frac{n}{d-n} \to 0 \text{ as } d \to \infty

When dnd \gg n, the noise is spread across many directions and the minimum-norm solution absorbs it harmlessly. Meanwhile, the bias term goes to zero when ww^* is in the row space of XX (which happens with probability 1 for random XX).

The intuition: in very high dimensions, there are many "orthogonal directions" available to absorb the noise without corrupting the signal directions. The minimum-norm interpolator exploits this — it fits the noise in directions irrelevant to the true signal.


4. Rademacher Complexity and Its Failure

Definition

The Rademacher complexity of a function class F\mathcal{F} on a sample S={x1,,xn}S = \{x_1, \ldots, x_n\} is:

Rn(F)=Eσ[supfF1ni=1nσif(xi)]\mathcal{R}_n(\mathcal{F}) = \mathbb{E}_{\sigma}\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(x_i)\right]

where σiUniform({1,+1})\sigma_i \sim \text{Uniform}(\{-1, +1\}) i.i.d. are Rademacher variables (random signs).

Intuition. Rademacher complexity measures how well functions in F\mathcal{F} can correlate with random ±1\pm 1 labels. If F\mathcal{F} is rich enough to fit any labeling, the supremum is 11 and Rn=1\mathcal{R}_n = 1. If F\mathcal{F} is restricted (e.g., linear functions with small norm), the supremum is controlled.

High Rademacher complexity = the class can fit noise = the model is likely to overfit.

The Generalization Bound

Theorem. With probability 1δ\geq 1-\delta over the draw of training set SS:

L(f)L^(f)+2Rn(F)+3log(2/δ)2nL(f) \leq \hat{L}(f) + 2\mathcal{R}_n(\mathcal{F}) + 3\sqrt{\frac{\log(2/\delta)}{2n}}

where L(f)L(f) is test loss and L^(f)\hat{L}(f) is training loss.

Proof sketch. The proof uses McDiarmid's inequality applied to the supremum of the empirical process supf(L(f)L^(f))\sup_f (L(f) - \hat{L}(f)), symmetrized using Rademacher variables. The resulting bound on E[supf(LL^)]\mathbb{E}[\sup_f(L-\hat{L})] is 2Rn(F)2\mathcal{R}_n(\mathcal{F}). The high-probability bound follows from concentration. \square

The Paradox

For overparameterized neural networks, Rn(F)1\mathcal{R}_n(\mathcal{F}) \approx 1. Networks with enough parameters can fit any labeling of nn points — this has been verified empirically (Zhang et al., 2017: networks trained on random labels achieve zero training error). So the generalization bound becomes:

L(f)L^(f)+2+O ⁣(log(1/δ)n)3L(f) \leq \hat{L}(f) + 2 + O\!\left(\sqrt{\frac{\log(1/\delta)}{n}}\right) \approx 3

Since losses are typically in [0,1][0,1], this bound is completely vacuous. It says nothing. Yet the networks generalize with test error of 5%5\% on CIFAR-10.

Classical Rademacher complexity theory fails to explain deep learning generalization. The bound is tight in the worst case (there exist networks that memorize random labels), but real networks trained on real data find solutions with much better generalization than the worst case. The classical theory has no way to distinguish these cases.


5. What Actually Explains Generalization?

The classical framework asks: what is the complexity of the function class? Modern evidence suggests this is the wrong question. The right question is: which solution does the optimizer find?

Among all networks that interpolate the training data (and there are infinitely many), gradient descent finds a specific one. That specific solution tends to generalize well. The generalization is not a property of the architecture or the function class — it is a property of the optimization trajectory.

This realization motivates the next post. Gradient descent has an implicit bias — a hidden preference for particular solutions among all minimizers of the training loss. For linear models and matrix factorization, this bias can be characterized exactly. For neural networks, it remains an active area of research, but the outlines are becoming clear.

The modern view: deep learning works not despite overparameterization but because of it. Overparameterization creates a large manifold of zero-loss solutions, and gradient descent navigates this manifold in a structured way that happens to favor generalizing solutions. Regularization is implicit, not explicit.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.