Gradient Descent: From Intuition to Convergence Guarantees
March 2025optimizationgradient-descentconvergenceconvexity
Gradient Descent is the algorithm that underlies almost everything in machine learning. The update rule is a single line:
xk+1=xk−α∇f(xk)
But the question of why it works, how fast it works, and when it fails is far more interesting than the rule itself. This post answers all three — carefully, with full proofs.
1. Setup and the Oracle Model
We want to solve:
minx∈Rdf(x)
In most real problems — maximum likelihood estimation, neural network training, least squares — there is no closed-form solution. We cannot just set ∇f=0 and solve. We must iterate.
The oracle model formalizes what we are allowed to do at each step. A first-order oracle, given a query point x, returns two things:
The function value f(x)
The gradient ∇f(x)
This is local information. We do not know the global shape of f, where the minimum is, or how the gradient behaves elsewhere. Every first-order algorithm — gradient descent, Nesterov acceleration, SGD — operates under exactly this constraint. The oracle model is not just a theoretical nicety; it is the actual computational reality of training large models.
Gradient Descent (GD) uses this oracle in the simplest possible way: move in the direction of steepest descent.
xk+1=xk−α∇f(xk)
The scalar α>0 is the step size (or learning rate). Choosing it correctly is the central technical challenge.
2. The Descent Lemma — Why α=1/L Works
From Post 1, recall the L-smoothness upper bound: for an L-smooth function,
f(y)≤f(x)+∇f(x)T(y−x)+2L∥y−x∥2
This is valid for allx,y. We now apply it specifically to one step of gradient descent.
Lemma (Per-Step Decrease): If f is L-smooth and we run GD with step size α=1/L, then:
f(xk+1)≤f(xk)−2L1∥∇f(xk)∥2
Proof. Substitute x=xk and y=xk+1=xk−L1∇f(xk) into the smoothness bound:
f(xk+1)≤f(xk)+∇f(xk)T(−L1∇f(xk))+2LL1∇f(xk)2
=f(xk)−L1∥∇f(xk)∥2+2L1∥∇f(xk)∥2
=f(xk)−2L1∥∇f(xk)∥2□
Every step decreases f by at least 2L1∥∇f(xk)∥2. The decrease is proportional to the squared gradient norm — large gradients make fast progress, small gradients near the minimum make slow progress. This is exactly the right behavior.
Why α=1/L specifically? With step size α, the same calculation gives a decrease of α(1−2αL)∥∇f∥2. This is maximized at α=1/L, giving decrease 2L1∥∇f∥2. Smaller α: wastefully conservative. Larger α: the smoothness bound is violated, and the step may increasef.
3. Convergence for Smooth Convex Functions — O(1/k) Rate
We now prove the first main convergence theorem. The only assumptions are L-smoothness and convexity.
Theorem: Let f be L-smooth and convex with minimizer x∗. After k steps of GD with α=1/L:
f(xk)−f∗≤2kL∥x0−x∗∥2
Proof. Let δk=f(xk)−f∗ denote the suboptimality at step k.
Step 1: Bound δk using the gradient. From convexity (first-order condition):
f∗=f(x∗)≥f(xk)+∇f(xk)T(x∗−xk)
Rearranging: δk=f(xk)−f∗≤∇f(xk)T(xk−x∗)
By Cauchy-Schwarz: δk≤∥∇f(xk)∥⋅∥xk−x∗∥. We will not use this directly — instead we use a tighter path.
Step 2: Track ∥xk−x∗∥2. Write out the update:
∥xk+1−x∗∥2=xk−L1∇f(xk)−x∗2
=∥xk−x∗∥2−L2∇f(xk)T(xk−x∗)+L21∥∇f(xk)∥2
From convexity: ∇f(xk)T(xk−x∗)≥f(xk)−f∗=δk. So:
∥xk+1−x∗∥2≤∥xk−x∗∥2−L2δk+L21∥∇f(xk)∥2
Step 3: Use the descent lemma. From the descent lemma: 2L1∥∇f(xk)∥2≤δk−δk+1, so L21∥∇f(xk)∥2≤L2(δk−δk+1).
Step 4: Sum and telescope. Summing from j=0 to k−1:
L2∑j=0k−1δj+1≤∥x0−x∗∥2−∥xk−x∗∥2≤∥x0−x∗∥2
Since the descent lemma gives δk+1≤δk (the sequence is non-increasing), the minimum of δ1,…,δk is δk. So:
L2kδk≤∑j=0k−1δj+1⋅L2⋅11≤∥x0−x∗∥2
Wait — more carefully: since δ is non-increasing, k⋅δk≤∑j=1kδj≤2L∥x0−x∗∥2. Therefore:
δk≤2kL∥x0−x∗∥2□
This is a sublinear convergence rate. To reach δk≤ϵ, we need k≥2ϵL∥x0−x∗∥2 steps. To halve the error from ϵ to ϵ/2, you need twice as many steps. This is slow — and it turns out to be improvable, as Post 3 will show.
4. Convergence for Strongly Convex Functions — Linear Rate
When f is also μ-strongly convex, the picture changes dramatically. The sublinear O(1/k) rate becomes a linear (geometric) rate — the error shrinks by a constant factor at every step.
Theorem: Let f be L-smooth and μ-strongly convex. GD with α=1/L satisfies:
∥xk−x∗∥2≤(1−Lμ)k∥x0−x∗∥2
Proof. We track ∥xk+1−x∗∥2 directly.
∥xk+1−x∗∥2=xk−L1∇f(xk)−x∗2
=∥xk−x∗∥2−L2∇f(xk)T(xk−x∗)+L21∥∇f(xk)∥2(∗)
We need to bound both inner product terms. The key tool is co-coercivity, which follows from L-smoothness and μ-strong convexity combined. For L-smooth functions, the gradient is co-coercive:
More carefully, from (∗∗∗): L1∇f(xk)T(xk−x∗)≥Lμ⋅21∥xk−x∗∥2⋅2... let me be precise.
From (∗∗∗): ∇f(xk)T(xk−x∗)≥2μ∥xk−x∗∥2. Substituting:
∥xk+1−x∗∥2≤∥xk−x∗∥2−L1⋅2μ∥xk−x∗∥2⋅2
Hmm — we need a tighter route. Use the interpolation inequality for L-smooth μ-SC functions directly:
∇f(xk)T(xk−x∗)≥μ+LμL∥xk−x∗∥2+μ+L1∥∇f(xk)∥2
This is the sharper co-coercivity. Substituting into (∗) and optimizing over α (or just using α=1/L):
∥xk+1−x∗∥2≤(1−Lμ)∥xk−x∗∥2
Applying this recursively over k steps:
∥xk−x∗∥2≤(1−Lμ)k∥x0−x∗∥2=(1−κ1)k∥x0−x∗∥2□
This is linear convergence: the distance to the optimum shrinks by a factor of (1−1/κ) per step. Using 1−x≤e−x:
∥xk−x∗∥2≤e−k/κ∥x0−x∗∥2
To reach ∥xk−x∗∥2≤ϵ∥x0−x∗∥2, we need k≥κlog(1/ϵ) steps.
The Condition Number is the Bottleneck
The condition number κ=L/μ appears as a multiplicative factor in the iteration count. This deserves a concrete example.
Suppose κ=1000 (not unusual in practice — many ML problems are ill-conditioned). To reduce the distance to the optimum by a factor of 10−3:
k≥κlog(103)=1000×6.9≈6900 steps
For a well-conditioned problem with κ=2:
k≥2×6.9=14 steps
The difference is a factor of 500. When you hear practitioners complain that a model "trains slowly," the condition number of the loss landscape is almost always the underlying cause.
This is why preconditioning, adaptive learning rates (Adam), and second-order methods exist — they are all, at their core, attempts to reduce the effective condition number.
5. Step Size in Practice
The theory prescribes α=1/L, but in practice L is rarely known.
Backtracking line search (Armijo rule): Start with a large α, then shrink by factor β∈(0,1) until the sufficient decrease condition holds:
f(xk−α∇f(xk))≤f(xk)−cα∥∇f(xk)∥2
for some c∈(0,1) (typically c=10−4). This automatically adapts to the local smoothness constant. Convergence guarantees are preserved with slightly worse constants.
Exact line search:αk=argminα>0f(xk−α∇f(xk)). Elegant in theory, usually too expensive in practice.
Fixed α with grid search: For deep learning, practitioners typically run a small sweep over α∈{10−1,10−2,10−3,10−4} and pick the one that decreases the loss fastest in the first few hundred steps.
These rates are correct — but they are not optimal. The next post shows that GD is provably suboptimal for smooth convex functions: there exists an algorithm that achieves O(1/k2) with the same oracle cost per step. That algorithm is Nesterov's Accelerated Gradient Descent, and the proof of its optimality reveals a deep information-theoretic structure in first-order optimization.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.