Mathematics of Data
The previous posts assumed we could evaluate exactly at each step. In modern machine learning, this assumption fails catastrophically.
A typical loss function looks like:
where might be training examples and each is the loss on a single data point. Computing the full gradient requires one forward-backward pass over the entire dataset. At the scale of modern training, this is hours per step.
Stochastic Gradient Descent (SGD) escapes this by never computing the full gradient at all.
Every supervised learning objective has the finite-sum structure:
where is the loss on the -th training example with label .
Full Gradient Descent computes at each step. Cost: gradient evaluations per step.
SGD instead samples a random index and uses:
Cost: gradient evaluation per step. times cheaper.
The key property that makes this sensible is unbiasedness:
In expectation, the stochastic gradient equals the true gradient. SGD is gradient descent with noise added — and the noise is zero-mean.
The fact that is reassuring. But the variance is what causes trouble.
Define:
This measures how much individual gradients deviate from the true gradient. It is bounded for well-behaved problems and typically does not go to zero as (the noise persists even at the optimum, because different data points give different gradient signals).
With a fixed step size , SGD does not converge to . Instead, it oscillates in a neighborhood around of radius proportional to .
To see why: near , the true gradient is small (close to zero). But the stochastic gradient still has variance — it pushes the iterate in random directions even when we are already close. With fixed , these random kicks never diminish, so the iterate keeps bouncing.
Fixed α: Decaying αₖ:
╭──────╮ ╭──╮
~~~~│ x* │~~~~ ~~~│x*│
╰──────╯ ╰──╯
(orbit, never stops) (spiral in)
To guarantee convergence, we must decay . The classical condition (Robbins-Monro):
The first condition ensures we can still reach from anywhere (steps don't shrink too fast). The second ensures the cumulative noise vanishes (steps shrink fast enough that variance accumulates to a finite total).
The standard choice satisfying both: or .
Theorem: For -smooth -strongly convex with SGD using where :
where depends on , , , and .
This is — the same rate as GD on a merely convex function, despite strong convexity. The noise has erased the linear rate we worked hard to establish in Post 2. Full GD on the same problem achieves linear rate — exponentially faster.
The loss of linear rate is the direct price of variance. And it is not avoidable with vanilla SGD — it is a fundamental consequence of the noise floor .
Instead of a single sample, use a mini-batch of random indices:
The expectation is still . The variance reduces by a factor of :
This follows directly from the variance of an average of i.i.d. random variables.
The cost per step is gradient evaluations. So mini-batching trades compute for variance reduction at a 1:1 ratio — doubling the batch halves variance but doubles cost per step.
Is it worth it? Only up to a point. Consider the total gradient evaluations needed to reach -accuracy:
Mini-batching does not reduce the total number of gradient evaluations needed — it just parallelizes them. The real benefit of large batches is hardware utilization: GPUs are more efficient computing 256 gradients simultaneously than one at a time.
Beyond a certain batch size (the "critical batch size"), you get no further convergence benefit and only waste compute. In practice, – for typical language model training. Training with batch size is mostly parallelism engineering, not optimization efficiency.
The oscillation of SGD around under fixed step size is unavoidable — but we can average out the noise.
Polyak-Ruppert averaging: instead of returning the final iterate , return the running average:
Theorem (Averaged SGD): For -smooth convex with step sizes :
where .
Proof sketch. Expand using the SGD update, take expectations, use unbiasedness to handle the gradient term, and use convexity to lower-bound . Sum from to and telescope. The averaging step then converts the sum of suboptimalities into the suboptimality at the average point.
Plugging in :
So:
The rate is slower than full GD. But the cost per step is rather than , so comparing at the same total gradient evaluation budget:
For large and moderate , SGD-A wins. This is the fundamental justification for stochastic optimization in the large- regime.
Why does averaging help? Intuitively: the iterates oscillate around , spending roughly equal time on either side. Averaging cancels the oscillation — the mean of the orbit is close to even when individual iterates are far.
Formally: while individual iterates may be far from , the convexity of guarantees , which is the average of function values that are individually close to .
| Method | Rate | Cost/step | Notes | |---|---|---|---| | GD | | | -smooth, convex | | GD | | | -smooth, -SC | | AGD | | | -smooth, convex — optimal | | SGD-A | | | -smooth, convex | | SGD | | | -smooth, -SC, decaying |
The SGD rows are striking: strong convexity, which gave GD a linear rate, only gives SGD an rate. The noise floor is the bottleneck, not the function geometry.
This raises a sharp question: can we have both? cost per step like SGD, but linear convergence rate like GD?
The answer is yes — and the algorithm that achieves it is the subject of the next post.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.