Variance Reduction: Getting Linear Convergence for Free with SVRG
March 2025optimizationvariance-reductionsvrgconvergence
Post 4 ended with a tension that deserves a resolution.
Full GD achieves linear convergence — error decaying as (1−1/κ)k — but costs O(n) per step. SGD costs O(1) per step but the variance forces the rate back down to O(1/k). For large n and large κ, neither is satisfying.
The question Post 4 raised: can we get O(1) cost per step and linear convergence?
SVRG (Stochastic Variance Reduced Gradient, Johnson & Zhang 2013) answers yes. The idea is elegant: use a periodically-computed full gradient as a correction signal that drives the stochastic variance to zero as the algorithm converges. The noise doesn't just average out — it self-destructs.
1. Why the Variance Prevents Linear Convergence
Recall from Post 4: for μ-strongly convex f, the per-step progress of SGD satisfies:
E[∥xk+1−x∗∥2]≤(1−μα)∥xk−x∗∥2+α2σ2
The first term is the contractive progress from strong convexity — it would give linear convergence on its own. The second term is the variance penalty from the stochastic gradient noise.
With fixed α, the iteration converges to a ball of radius μασ2 around x∗, not to x∗ itself. To make this ball shrink to zero, we need α→0 — which kills the linear rate in the first term.
The fix must address the source: make σ2 itself go to zero as xk→x∗, even with fixed α.
2. The SVRG Update
SVRG operates in epochs. At the start of epoch s, pick a snapshot point x~ (typically the last iterate or an average of last epoch's iterates) and compute the full gradient:
μ~=∇f(x~)=n1∑i=1n∇fi(x~)
This costs O(n) — one full pass over the data.
Then run m inner steps. At each inner step k, sample ik∼Uniform{1,…,n} and update:
xk+1=xk−αvk[∇fik(xk)−∇fik(x~)+μ~]
The corrected gradient vk is the key object. It looks like SGD but with a correction term −∇fik(x~)+μ~ added.
Why is this correction the right one? Let's check two properties.
Unbiasedness
Eik[vk]=Eik[∇fik(xk)]−Eik[∇fik(x~)]+μ~
=∇f(xk)−∇f(x~)+∇f(x~)=∇f(xk)✓
vk is an unbiased estimate of ∇f(xk), just like the plain SGD gradient. So the update is still valid.
For L-smooth functions, there is a useful variance bound: for any x,
Ei[∥∇fi(x)−∇f(x)∥2]≤2L(f(x)−f∗)
This follows from the smoothness of each fi and the fact that ∇fi(x∗) averages to ∇f(x∗)=0.
Applying this to both terms:
E[∥vk−∇f(xk)∥2]≤4L(f(xk)−f∗)+4L(f(x~)−f∗)
This is the central estimate of SVRG. The variance is not a fixed constant σ2 — it is proportional to the suboptimality at the current point and at the snapshot. As xk→x∗ and x~→x∗, both terms go to zero. The noise self-corrects.
This is the mechanism that breaks the SGD barrier: unlike plain SGD whose noise floor is a fixed σ2, SVRG's effective noise shrinks with the algorithm's progress.
3. Convergence Theorem
Theorem (SVRG Linear Convergence): Let f be L-smooth and μ-strongly convex. Run SVRG with epoch length m, step size α<4L1, and set the next snapshot x~s+1 to a uniformly random iterate from epoch s. Then:
E[f(x~s)−f∗]≤ρs(f(x~0)−f∗)
where
ρ=μα(1−2Lα)m1+1−2Lα2Lα
For ρ<1, choose m=μ20L=20κ and α=10L1. Then:
ρ≤2κ⋅10L1⋅54⋅20κ1+1−102102=41+41=21
A contraction factor of ρ=1/2 per epoch regardless of κ.
Proof sketch. Track E[∥xk+1−x∗∥2] for one inner step. Expand using the SVRG update, then apply:
Strong convexity to bound ∇f(xk)T(xk−x∗)
The variance bound from Section 2 to control the noise term
The per-step inequality: E[∥xk+1−x∗∥2]≤(1−2μα)∥xk−x∗∥2+2α2[4L(f(xk)−f∗)+4L(f(x~)−f∗)]
The variance bound connects the noise to suboptimality, which connects to distance via strong convexity. This closes the loop: the contraction in distance-to-optimum drives down the variance, which drives down the distance further.
Summing over m inner steps, taking expectation over the random snapshot selection, and rearranging gives the epoch-level contraction ρ. □
4. Cost Comparison: The Payoff
Let's count total gradient evaluations to reach ϵ-accuracy.
Epoch cost for SVRG:
Snapshot gradient: n evaluations (one full pass)
Inner loop of m=20κ steps: 20κ evaluations (each step samples one ∇fi)
Total per epoch: n+20κ
Epochs to ϵ-accuracy:
Since ρ≤1/2, after s epochs error ≤(1/2)s⋅initial error. To reach ϵ from initial error Δ: s=log2(Δ/ϵ)=O(log(1/ϵ)) epochs.
Total gradient evaluations for SVRG:O((n+κ)logϵ1)
Compare to full GD:
GD needs O(κ) steps, each costing O(n) evaluations:
O(nκlogϵ1)
Compare to SGD:
SGD with rate O(1/k) needs O(1/ϵ) steps of O(1) cost:
O(ϵ1)
SVRG achieves linear convergence (like GD) with a factor of n+κ instead of nκ. When κ≫1 — the regime where GD is slow — SVRG outperforms GD by a factor of n+κnκ≈κnκ=n when κ≫n, or nnκ=κ when n≫κ.
Concretely: n=106 data points, κ=103, target ϵ=10−6:
GD: 106×103×14≈1.4×1010 evaluations
SVRG: (106+103)×14≈1.4×107 evaluations
SVRG is 1000x cheaper. With identical convergence guarantees.
5. Continual Learning Extension: CSVRG
SVRG assumes a fixed dataset {f1,…,fn}. In many modern settings — online learning, continual learning, streaming data — new data points arrive over time and the model must update without forgetting old knowledge.
The naive approach: when fn+1 arrives, re-run SVRG from scratch on {f1,…,fn+1}. Cost: O(n+1) per epoch. For a stream of T new points, total cost O(Tn) — quadratic in the total data.
CSVRG (Continual SVRG) addresses this by maintaining a stale snapshotμ~ that is updated incrementally. When fn+1 arrives:
Update the snapshot gradient: μ~new=n+1nμ~+∇fn+1(x~) — a rank-1 update, O(1) cost
Continue running SVRG inner steps, now with the updated μ~new
The staleness of μ~ introduces a bounded extra variance term that degrades gracefully — the convergence rate remains linear with a slightly worse constant that shrinks as the model reprocesses old data.
CSVRG trades exact variance reduction for memory efficiency. It is the right tool when data arrives faster than you can take full passes, and is one of several variance-reduced methods finding application in federated learning (where clients have local data and cannot share raw gradients).
Where We Stand
SVRG closes the loop opened in Post 4. The tension between cost and convergence rate — which seemed fundamental to stochastic optimization — is resolved by exploiting the finite-sum structure. The variance is not an intrinsic property of the problem; it is a property of the estimator, and a cleverer estimator eliminates it.
The next post turns to a different structural challenge: what happens when the objective is not even differentiable? The Lasso, total variation regularization, and group sparsity penalties all have kinks. Subgradients are the right generalization of gradients in this world — and the proximal operator is the right generalization of the gradient step.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.