Mathematics of Data
Post 9 ended with a question: among the infinitely many solutions that interpolate the training data, which one does gradient descent find?
The answer is not random. Gradient descent has a systematic preference — an implicit bias — toward specific solutions that depends on the architecture, the loss, and the optimizer. Remarkably, this bias often aligns with solutions that generalize well, and it requires no explicit regularization to produce.
This post proves two concrete results. First: gradient descent on logistic loss with linearly separable data converges in direction to the hard-margin SVM solution. Second: gradient descent on matrix factorization implicitly minimizes the nuclear norm. Both results say the same thing in different languages — the optimizer is secretly solving a regularized problem you never wrote down.
Linear regression. When , the system is underdetermined. The set of solutions is an affine subspace:
Every point in this affine subspace achieves zero training loss. There is no unique answer.
Neural networks. Any network large enough to interpolate points has a -dimensional manifold of zero-loss solutions, where is the number of parameters. For modern networks, , so this manifold is vast.
The question is not whether a solution exists — it is which solution gradient descent finds, and why that solution tends to generalize.
Binary classification on linearly separable data: inputs , labels , and there exists with for all .
We train with logistic loss:
When data is separable, requires — the weights must grow unboundedly to drive the loss to zero. There is no finite minimizer.
The hard-margin SVM finds the max-margin separator:
This is a well-posed convex program with a unique solution (in direction). The margin is the distance from the decision boundary to the nearest data point.
Theorem (Soudry et al., 2018). Let be the gradient descent iterates on with any fixed step size. Then:
Gradient descent converges in direction to the max-margin SVM solution. Not in magnitude (which diverges), but the direction stabilizes to the unique max-margin separator.
Step 1: The loss decays and weights grow. Since data is separable, GD finds a direction of improvement and drives . Since the only way to reduce logistic loss to zero with finite-margin data is to grow , we have .
Step 2: The gradient is dominated by support vectors. The gradient of logistic loss is:
As , for non-support-vector examples (those with large margin ), the term is huge, so . These examples contribute negligibly to the gradient.
Only examples near the decision boundary — the support vectors — have non-negligible gradient contributions. The gradient concentrates on the support vectors.
Step 3: The KKT conditions emerge. The SVM KKT conditions require the optimal to be a linear combination of support vectors:
The GD update direction concentrates on support vectors as training progresses. After normalization, the running direction converges to this support-vector combination — which is precisely .
The punchline. Gradient descent on logistic loss, with zero explicit regularization, finds exactly the solution that -regularized SVM would find in the limit of zero regularization. The implicit regularizer is . The algorithm solves a non-smooth constrained optimization problem (the SVM) while appearing to solve a smooth unconstrained one (logistic regression).
This result is not just a curiosity. It says:
Matrix completion: recover a matrix from observed entries . We parameterize as with , , and minimize:
There are infinitely many pairs achieving zero loss (for large enough). They correspond to infinitely many matrices with the same values on .
Theorem (Gunasekar et al., 2017). Gradient descent on initialized at , (small balanced initialization), in the limit , converges to:
The implicit regularizer is the nuclear norm — the matrix analogue of from the classification case.
Consider the gradient flow equations:
Define . The induced flow on is:
The key observation: the effective "gradient" in -space is preconditioned by . For the balanced initialization , the matrices and are initialized at .
As , the initial step size in -space goes to zero — the algorithm starts with infinitesimally small steps. This is equivalent to following the steepest descent path in -space with respect to the Riemannian metric induced by the factorization, which is precisely the geodesic that minimizes path length in nuclear norm. The implicit bias toward minimum nuclear norm emerges from the geometry of the factorization, not from any explicit penalty.
The / nuclear norm parallel:
| Classification (vectors) | Matrix factorization | |---|---| | grows unboundedly | , direction stabilizes | | Implicit bias: | Implicit bias: | | Equivalent to max-margin SVM | Equivalent to min-nuclear-norm completion | | norm = sum of | Nuclear norm = sum of |
The factorization structure of the parameterization directly produces the nuclear norm as the implicit regularizer. If you instead parameterized directly (no factorization), GD would find the minimum Frobenius norm solution — a different implicit bias, and generally worse for low-rank recovery.
The two results above are exact. For nonlinear neural networks, the story is more complex.
For a depth- linear network , gradient flow on the squared loss converges to the minimum nuclear norm solution of the end-to-end map . This extends the matrix factorization result: the implicit bias is , regardless of depth, as long as initialization is balanced and small.
For nonlinear networks (ReLU, etc.), the characterization is less clean but the spirit is the same. For homogeneous networks (networks where for some ), gradient flow on logistic loss converges to a KKT point of the max-margin problem in function space:
where is some norm on the parameters. The implicit bias is a norm regularization on the parameters — but which norm depends on the architecture.
This last point has a practical implication: learning rate tuning is implicitly regularization tuning. When a practitioner adjusts the learning rate to improve generalization, they are adjusting the strength of the implicit regularizer, even if they do not think of it that way.
Implicit bias explains (partially) the double descent phenomenon from Post 9:
The missing ingredient in classical theory is the optimization algorithm itself. The algorithm is not just a means to minimize the training loss — it is also, simultaneously, a regularization procedure. The regularization is implicit, architecture-dependent, and often well-matched to the structure of natural data.
What remains open: a complete theory of implicit bias for deep nonlinear networks, and a precise characterization of when the implicit bias aligns with good generalization. These are among the most active research questions in theoretical machine learning today.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.