← All Posts

Mathematics of Data

Implicit Bias: How the Optimizer Secretly Regularizes Your Model

March 2025deep-learningimplicit-biasoptimizationgeneralization

Post 9 ended with a question: among the infinitely many solutions that interpolate the training data, which one does gradient descent find?

The answer is not random. Gradient descent has a systematic preference — an implicit bias — toward specific solutions that depends on the architecture, the loss, and the optimizer. Remarkably, this bias often aligns with solutions that generalize well, and it requires no explicit regularization to produce.

This post proves two concrete results. First: gradient descent on logistic loss with linearly separable data converges in direction to the hard-margin SVM solution. Second: gradient descent on matrix factorization implicitly minimizes the nuclear norm. Both results say the same thing in different languages — the optimizer is secretly solving a regularized problem you never wrote down.


1. The Setup: Infinitely Many Solutions

Linear regression. When d>nd > n, the system Xw=yXw = y is underdetermined. The set of solutions is an affine subspace:

W={w:Xw=y}=wparticular+ker(X)\mathcal{W} = \{w : Xw = y\} = w_{\text{particular}} + \ker(X)

Every point in this affine subspace achieves zero training loss. There is no unique answer.

Neural networks. Any network large enough to interpolate nn points has a (pn)(p-n)-dimensional manifold of zero-loss solutions, where pp is the number of parameters. For modern networks, pnp \gg n, so this manifold is vast.

The question is not whether a solution exists — it is which solution gradient descent finds, and why that solution tends to generalize.


2. Linear Classification: GD Converges to the Max-Margin SVM

Setup

Binary classification on linearly separable data: inputs xiRdx_i \in \mathbb{R}^d, labels yi{1,+1}y_i \in \{-1, +1\}, and there exists ww^* with yiw,xi>0y_i \langle w^*, x_i \rangle > 0 for all ii.

We train with logistic loss:

L(w)=1ni=1nlog(1+eyiw,xi)L(w) = \frac{1}{n}\sum_{i=1}^n \log(1 + e^{-y_i \langle w, x_i \rangle})

When data is separable, L(w)0L(w) \to 0 requires w\|w\| \to \infty — the weights must grow unboundedly to drive the loss to zero. There is no finite minimizer.

The hard-margin SVM finds the max-margin separator:

w^=argminw2subject toyiw,xi1  i\hat{w} = \arg\min \|w\|_2 \quad \text{subject to} \quad y_i \langle w, x_i \rangle \geq 1 \;\forall i

This is a well-posed convex program with a unique solution (in direction). The margin 1/w^1/\|\hat{w}\| is the distance from the decision boundary to the nearest data point.

The Theorem

Theorem (Soudry et al., 2018). Let wkw^k be the gradient descent iterates on L(w)L(w) with any fixed step size. Then:

wkwkw^w^as k\frac{w^k}{\|w^k\|} \to \frac{\hat{w}}{\|\hat{w}\|} \qquad \text{as } k \to \infty

Gradient descent converges in direction to the max-margin SVM solution. Not in magnitude (which diverges), but the direction stabilizes to the unique max-margin separator.

Proof Sketch

Step 1: The loss decays and weights grow. Since data is separable, GD finds a direction of improvement and drives L(wk)0L(w^k) \to 0. Since the only way to reduce logistic loss to zero with finite-margin data is to grow w\|w\|, we have wk\|w^k\| \to \infty.

Step 2: The gradient is dominated by support vectors. The gradient of logistic loss is:

L(w)=1ni=1nyixi1+eyiw,xi\nabla L(w) = -\frac{1}{n}\sum_{i=1}^n \frac{y_i x_i}{1 + e^{y_i\langle w,x_i\rangle}}

As w\|w\| \to \infty, for non-support-vector examples ii (those with large margin yiw,xi0y_i\langle w, x_i\rangle \gg 0), the term eyiw,xie^{y_i\langle w, x_i\rangle} is huge, so 11+eyiw,xi0\frac{1}{1+e^{y_i\langle w,x_i\rangle}} \approx 0. These examples contribute negligibly to the gradient.

Only examples near the decision boundary — the support vectors — have non-negligible gradient contributions. The gradient concentrates on the support vectors.

Step 3: The KKT conditions emerge. The SVM KKT conditions require the optimal w^\hat{w} to be a linear combination of support vectors:

w^=iSαiyixi,αi0\hat{w} = \sum_{i \in \mathcal{S}} \alpha_i y_i x_i, \quad \alpha_i \geq 0

The GD update direction L(wk)-\nabla L(w^k) concentrates on support vectors as training progresses. After normalization, the running direction converges to this support-vector combination — which is precisely w^/w^\hat{w}/\|\hat{w}\|.

The punchline. Gradient descent on logistic loss, with zero explicit regularization, finds exactly the solution that 2\ell_2-regularized SVM would find in the limit of zero regularization. The implicit regularizer is w2\|w\|_2. The algorithm solves a non-smooth constrained optimization problem (the SVM) while appearing to solve a smooth unconstrained one (logistic regression).

Why This Matters

This result is not just a curiosity. It says:

  • Learning rate does not change the implicit bias (only the speed at which it kicks in)
  • The same separating hyperplane emerges regardless of initialization (as long as the data is separable)
  • Maximum margin = maximum robustness to perturbations — the SVM solution is the most geometrically stable classifier, which is why it generalizes well

3. Matrix Factorization: GD Converges to Minimum Nuclear Norm

Setup

Matrix completion: recover a matrix MRm×nM \in \mathbb{R}^{m \times n} from observed entries Ω\Omega. We parameterize as W=UVTW = UV^T with URm×rU \in \mathbb{R}^{m \times r}, VRn×rV \in \mathbb{R}^{n \times r}, and minimize:

L(U,V)=PΩ(UVTM)F2L(U, V) = \|P_\Omega(UV^T - M)\|_F^2

There are infinitely many pairs (U,V)(U,V) achieving zero loss (for rr large enough). They correspond to infinitely many matrices W=UVTW = UV^T with the same values on Ω\Omega.

The Theorem

Theorem (Gunasekar et al., 2017). Gradient descent on L(U,V)L(U,V) initialized at U0=ϵIU_0 = \epsilon I, V0=ϵIV_0 = \epsilon I (small balanced initialization), in the limit ϵ0\epsilon \to 0, converges to:

W=argminWsubject toPΩ(W)=PΩ(M)W^* = \arg\min \|W\|_* \quad \text{subject to} \quad P_\Omega(W) = P_\Omega(M)

The implicit regularizer is the nuclear norm W\|W\|_* — the matrix analogue of w2\|w\|_2 from the classification case.

Proof Idea

Consider the gradient flow equations:

U˙=UL=2(UVTMΩ)V,V˙=VL=2UT(UVTMΩ)\dot{U} = -\nabla_U L = -2(UV^T - M_\Omega)V, \qquad \dot{V} = -\nabla_V L = -2U^T(UV^T - M_\Omega)

Define W(t)=U(t)V(t)TW(t) = U(t)V(t)^T. The induced flow on WW is:

W˙=U˙VT+UV˙T=2[(UVTMΩ)VVT+UUT(UVTMΩ)]\dot{W} = \dot{U}V^T + U\dot{V}^T = -2[(UV^T - M_\Omega)VV^T + UU^T(UV^T - M_\Omega)]

=2(UUT+VVT)(WMΩ)= -2(UU^T + VV^T)(W - M_\Omega)

The key observation: the effective "gradient" in WW-space is preconditioned by (UUT+VVT)(UU^T + VV^T). For the balanced initialization U0=V0=ϵIU_0 = V_0 = \epsilon I, the matrices UUTUU^T and VVTVV^T are initialized at ϵ2I\epsilon^2 I.

As ϵ0\epsilon \to 0, the initial step size in WW-space goes to zero — the algorithm starts with infinitesimally small steps. This is equivalent to following the steepest descent path in WW-space with respect to the Riemannian metric induced by the factorization, which is precisely the geodesic that minimizes path length in nuclear norm. The implicit bias toward minimum nuclear norm emerges from the geometry of the factorization, not from any explicit penalty.

The 2\ell_2 / nuclear norm parallel:

| Classification (vectors) | Matrix factorization | |---|---| | ww grows unboundedly | W=UVTW = UV^T, direction stabilizes | | Implicit bias: minw2\min \|w\|_2 | Implicit bias: minW\min \|W\|_* | | Equivalent to max-margin SVM | Equivalent to min-nuclear-norm completion | | 2\ell_2 norm = sum of λi2\lambda_i^2 | Nuclear norm = sum of σi\sigma_i |

The factorization structure of the parameterization directly produces the nuclear norm as the implicit regularizer. If you instead parameterized WW directly (no factorization), GD would find the minimum Frobenius norm solution — a different implicit bias, and generally worse for low-rank recovery.


4. Neural Networks: What is the Implicit Bias?

The two results above are exact. For nonlinear neural networks, the story is more complex.

Deep Linear Networks

For a depth-LL linear network W=WLWL1W1W = W_L W_{L-1} \cdots W_1, gradient flow on the squared loss converges to the minimum nuclear norm solution of the end-to-end map WW. This extends the matrix factorization result: the implicit bias is minW\min \|W\|_*, regardless of depth, as long as initialization is balanced and small.

Nonlinear Networks

For nonlinear networks (ReLU, etc.), the characterization is less clean but the spirit is the same. For homogeneous networks (networks where f(λx;θ)=λkf(x;θ)f(\lambda x; \theta) = \lambda^k f(x; \theta) for some kk), gradient flow on logistic loss converges to a KKT point of the max-margin problem in function space:

θ^=argminθθsubject toyif(xi;θ)1  i\hat{\theta} = \arg\min_\theta \|\theta\| \quad \text{subject to} \quad y_i f(x_i; \theta) \geq 1 \;\forall i

where θ\|\theta\| is some norm on the parameters. The implicit bias is a norm regularization on the parameters — but which norm depends on the architecture.

What Changes the Implicit Bias?

  • Architecture: depth of factorization determines which norm is implicit
  • Loss function: cross-entropy → max-margin; squared loss → min-norm interpolation
  • Optimizer: Adam has a different implicit bias from GD. Adam normalizes gradients, implicitly biasing toward solutions with small \ell_\infty norm rather than 2\ell_2 norm
  • Step size: larger step size → stronger implicit bias toward low-norm solutions (confirmed empirically for discrete GD)
  • Initialization: small initialization strengthens the implicit bias; large initialization weakens it

This last point has a practical implication: learning rate tuning is implicitly regularization tuning. When a practitioner adjusts the learning rate to improve generalization, they are adjusting the strength of the implicit regularizer, even if they do not think of it that way.


5. The Full Picture

Implicit bias explains (partially) the double descent phenomenon from Post 9:

  • Classical theory assumes the function class determines generalization — and for large classes, it predicts poor generalization
  • Modern reality: the optimizer selects a specific element of the function class — the one with minimum norm, maximum margin, or minimum nuclear norm
  • That selected element often has far lower complexity than the worst-case element of the class, which is what Rademacher complexity measures

The missing ingredient in classical theory is the optimization algorithm itself. The algorithm is not just a means to minimize the training loss — it is also, simultaneously, a regularization procedure. The regularization is implicit, architecture-dependent, and often well-matched to the structure of natural data.

What remains open: a complete theory of implicit bias for deep nonlinear networks, and a precise characterization of when the implicit bias aligns with good generalization. These are among the most active research questions in theoretical machine learning today.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.