Mathematics of Data
Post 2 proved that gradient descent converges at rate for smooth convex functions. A natural question: is this the best any gradient-based algorithm can do?
The answer is no — and the gap is not small. There is a provable lower bound of for first-order methods, meaning GD is off by a factor of . Nesterov's Accelerated Gradient Descent (AGD), published in 1983, achieves the rate with the same oracle cost per step as GD. This post explains why the gap exists, how AGD closes it, and what the acceleration buys in practice.
Before presenting the algorithm, we need to understand what we are optimizing against.
Theorem (Nesterov's Lower Bound): For any first-order method and any , there exists an -smooth convex function such that after steps starting from :
This is an lower bound. GD achieves . There is a quadratic gap.
The proof constructs an explicit "hard" function — a tridiagonal quadratic — whose structure means any first-order method must "explore" one new dimension per step, and the global minimum is always in a direction not yet explored. The key insight is:
GD is memoryless. At step , GD only uses . It discards all information from steps . A smarter algorithm would accumulate and use that history.
This is the intuition behind acceleration. If you remember where you came from, you can exploit momentum — move not just in the direction the gradient points, but also continue in the direction you were already traveling.
AGD maintains two sequences: the iterates and a momentum point .
Compared to GD — which just does — AGD adds one extra step: before taking the gradient, it moves to a momentum point that overshoots the current position in the direction of the previous step.
The momentum coefficient starts near zero and grows toward 1 as . Early in training, the method barely differs from GD. As it progresses and the iterates become more directional, momentum plays an increasingly large role.
Cost per iteration: One gradient evaluation — identical to GD. No extra oracle calls.
Convergence:
This matches the lower bound up to a constant factor. AGD is optimal among all first-order methods for smooth convex functions.
How do we prove the rate? The standard approach uses a Lyapunov function — an "energy" quantity that is non-increasing along the algorithm's trajectory and whose initial value controls the final error.
Define the energy function:
where is an auxiliary sequence defined by .
Claim: for all .
Proof sketch. We need to show:
Term 1: Bound . By the descent lemma applied at :
Term 2: Expand . By the update rule for :
Term 3: Use convexity. From the first-order convexity condition:
Rearranging: .
Combining these three steps with the specific choice of momentum coefficient (which ensures lies on the right convex combination of and ) and carefully tracking the algebra, one shows that all cross-terms cancel and .
Once we have for all , the bound follows immediately:
Dividing: .
Why does this work? The Lyapunov function is not the function value — it mixes the function value with a distance term involving . This coupling is what makes the proof go through. The proof technique of finding a Lyapunov function and showing it decreases is general: it appears in control theory, dynamical systems, and throughout optimization analysis. When you see a convergence proof, look for the Lyapunov function — it is the heart of the argument.
There is a feature of AGD that surprises people seeing it for the first time: it is not a descent method. The function value can increase between consecutive steps.
Consider starting from . GD with marches steadily downhill:
GD: f(x⁰) = 1.000 → f(x¹) = 0.250 → f(x²) = 0.063 → f(x³) = 0.016 ...
AGD: f(x⁰) = 1.000 → f(x¹) = 0.250 → f(x²) = 0.028 → f(x³) = 0.031 → f(x⁴) = 0.004 ...
At step 3, AGD's value increases from 0.028 to 0.031 before dropping sharply to 0.004. GD's value at step 4 would still be 0.004 — but AGD arrived there a step earlier overall, and the gap widens with more steps.
The overshoot is not a bug. When AGD takes the momentum step , it moves past the gradient-step point. This can temporarily increase , but the momentum is aimed at a region of lower further ahead. The global trajectory is more efficient precisely because it is willing to go uphill locally.
This non-monotonicity has a practical consequence: you cannot use as an early-stopping criterion with AGD. A common error in implementations is to detect an increase in loss and reset or reduce the step — this breaks the acceleration and reverts to convergence.
For -strongly convex functions, AGD has a modified form that achieves a linear rate — but with a dramatically better contraction factor than GD.
| Method | Rate (strongly convex) | Steps to -accuracy | |---|---|---| | GD | | | | AGD | | |
The improvement from to in the iteration count is the headline result.
For ill-conditioned problems, this is transformative. Take — not extreme for a large-scale ML problem with poor data conditioning. Steps needed to reach relative accuracy:
A factor of 1000. With identical cost per step. This is why acceleration matters for large-scale optimization.
The accelerated update for strongly convex uses a fixed momentum coefficient instead of the increasing schedule :
AGD is optimal among first-order methods — but what if we allowed second-order oracle queries, i.e., access to the Hessian ?
Newton's method uses the Hessian to form a local quadratic model and jumps directly to its minimum:
The convergence is quadratic near the minimum:
If you start within distance of , after steps you are within distance . The number of correct decimal places doubles at each step. After 10 steps near the minimum, you have decimal places of accuracy.
The cost: forming and inverting the Hessian requires memory and computation per step. For a neural network with parameters, this is flops per step. Completely infeasible.
L-BFGS (Limited-memory BFGS) approximates using the last gradient differences (typically –). This gives superlinear convergence in practice with cost per step. It is the standard second-order method for non-neural-network ML (SVMs, logistic regression, etc.) and works extremely well when is in the thousands to low millions.
For neural networks, the story is different — and we will return to it when we discuss adaptive methods like Adam, which can be understood as cheap diagonal approximations to the inverse Hessian.
The convergence hierarchy so far:
| Method | Assumption | Rate | |---|---|---| | GD | -smooth, convex | | | AGD | -smooth, convex | — optimal | | GD | -smooth, -SC | | | AGD | -smooth, -SC | — optimal | | Newton | -smooth (local) | Quadratic — but /step |
The next post confronts a different problem entirely: what happens when the dataset is too large to compute at all? When and , even a single gradient evaluation is prohibitive. The answer is Stochastic Gradient Descent — and the price of cheapening each step is a fundamental, unavoidable variance that forces the convergence rate back down to .
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.