← All Posts

Mathematics of Data

Variance Reduction: Getting Linear Convergence for Free with SVRG

March 2025optimizationvariance-reductionsvrgconvergence

Post 4 ended with a tension that deserves a resolution.

Full GD achieves linear convergence — error decaying as (11/κ)k(1-1/\kappa)^k — but costs O(n)O(n) per step. SGD costs O(1)O(1) per step but the variance forces the rate back down to O(1/k)O(1/k). For large nn and large κ\kappa, neither is satisfying.

The question Post 4 raised: can we get O(1)O(1) cost per step and linear convergence?

SVRG (Stochastic Variance Reduced Gradient, Johnson & Zhang 2013) answers yes. The idea is elegant: use a periodically-computed full gradient as a correction signal that drives the stochastic variance to zero as the algorithm converges. The noise doesn't just average out — it self-destructs.


1. Why the Variance Prevents Linear Convergence

Recall from Post 4: for μ\mu-strongly convex ff, the per-step progress of SGD satisfies:

E[xk+1x2](1μα)xkx2+α2σ2\mathbb{E}[\|x^{k+1} - x^*\|^2] \leq (1 - \mu\alpha)\|x^k - x^*\|^2 + \alpha^2 \sigma^2

The first term is the contractive progress from strong convexity — it would give linear convergence on its own. The second term is the variance penalty from the stochastic gradient noise.

With fixed α\alpha, the iteration converges to a ball of radius ασ2μ\frac{\alpha \sigma^2}{\mu} around xx^*, not to xx^* itself. To make this ball shrink to zero, we need α0\alpha \to 0 — which kills the linear rate in the first term.

The fix must address the source: make σ2\sigma^2 itself go to zero as xkxx^k \to x^*, even with fixed α\alpha.


2. The SVRG Update

SVRG operates in epochs. At the start of epoch ss, pick a snapshot point x~\tilde{x} (typically the last iterate or an average of last epoch's iterates) and compute the full gradient:

μ~=f(x~)=1ni=1nfi(x~)\tilde{\mu} = \nabla f(\tilde{x}) = \frac{1}{n}\sum_{i=1}^n \nabla f_i(\tilde{x})

This costs O(n)O(n) — one full pass over the data.

Then run mm inner steps. At each inner step kk, sample ikUniform{1,,n}i_k \sim \text{Uniform}\{1,\ldots,n\} and update:

xk+1=xkα[fik(xk)fik(x~)+μ~]vkx^{k+1} = x^k - \alpha \underbrace{\left[\nabla f_{i_k}(x^k) - \nabla f_{i_k}(\tilde{x}) + \tilde{\mu}\right]}_{v^k}

The corrected gradient vkv^k is the key object. It looks like SGD but with a correction term fik(x~)+μ~-\nabla f_{i_k}(\tilde{x}) + \tilde{\mu} added.

Why is this correction the right one? Let's check two properties.

Unbiasedness

Eik[vk]=Eik[fik(xk)]Eik[fik(x~)]+μ~\mathbb{E}_{i_k}[v^k] = \mathbb{E}_{i_k}[\nabla f_{i_k}(x^k)] - \mathbb{E}_{i_k}[\nabla f_{i_k}(\tilde{x})] + \tilde{\mu}

=f(xk)f(x~)+f(x~)=f(xk)= \nabla f(x^k) - \nabla f(\tilde{x}) + \nabla f(\tilde{x}) = \nabla f(x^k) \checkmark

vkv^k is an unbiased estimate of f(xk)\nabla f(x^k), just like the plain SGD gradient. So the update is still valid.

Variance Goes to Zero

Now the key calculation. The variance of vkv^k:

E[vkf(xk)2]=E[fik(xk)fik(x~)+μ~f(xk)2]\mathbb{E}[\|v^k - \nabla f(x^k)\|^2] = \mathbb{E}[\|\nabla f_{i_k}(x^k) - \nabla f_{i_k}(\tilde{x}) + \tilde{\mu} - \nabla f(x^k)\|^2]

=E[(fik(xk)f(xk))(fik(x~)f(x~))2]= \mathbb{E}[\|(\nabla f_{i_k}(x^k) - \nabla f(x^k)) - (\nabla f_{i_k}(\tilde{x}) - \nabla f(\tilde{x}))\|^2]

Using ab22a2+2b2\|a - b\|^2 \leq 2\|a\|^2 + 2\|b\|^2:

2E[fik(xk)f(xk)2]+2E[fik(x~)f(x~)2]\leq 2\mathbb{E}[\|\nabla f_{i_k}(x^k) - \nabla f(x^k)\|^2] + 2\mathbb{E}[\|\nabla f_{i_k}(\tilde{x}) - \nabla f(\tilde{x})\|^2]

For LL-smooth functions, there is a useful variance bound: for any xx,

Ei[fi(x)f(x)2]2L(f(x)f)\mathbb{E}_i[\|\nabla f_i(x) - \nabla f(x)\|^2] \leq 2L(f(x) - f^*)

This follows from the smoothness of each fif_i and the fact that fi(x)\nabla f_i(x^*) averages to f(x)=0\nabla f(x^*) = 0.

Applying this to both terms:

E[vkf(xk)2]4L(f(xk)f)+4L(f(x~)f)\boxed{\mathbb{E}[\|v^k - \nabla f(x^k)\|^2] \leq 4L(f(x^k) - f^*) + 4L(f(\tilde{x}) - f^*)}

This is the central estimate of SVRG. The variance is not a fixed constant σ2\sigma^2 — it is proportional to the suboptimality at the current point and at the snapshot. As xkxx^k \to x^* and x~x\tilde{x} \to x^*, both terms go to zero. The noise self-corrects.

This is the mechanism that breaks the SGD barrier: unlike plain SGD whose noise floor is a fixed σ2\sigma^2, SVRG's effective noise shrinks with the algorithm's progress.


3. Convergence Theorem

Theorem (SVRG Linear Convergence): Let ff be LL-smooth and μ\mu-strongly convex. Run SVRG with epoch length mm, step size α<14L\alpha < \frac{1}{4L}, and set the next snapshot x~s+1\tilde{x}^{s+1} to a uniformly random iterate from epoch ss. Then:

E[f(x~s)f]ρs(f(x~0)f)\mathbb{E}[f(\tilde{x}^s) - f^*] \leq \rho^s (f(\tilde{x}^0) - f^*)

where

ρ=1μα(12Lα)m+2Lα12Lα\rho = \frac{1}{\mu\alpha(1 - 2L\alpha)m} + \frac{2L\alpha}{1 - 2L\alpha}

For ρ<1\rho < 1, choose m=20Lμ=20κm = \frac{20L}{\mu} = 20\kappa and α=110L\alpha = \frac{1}{10L}. Then:

ρ12κ110L4520κ+2101210=14+14=12\rho \leq \frac{1}{2\kappa \cdot \frac{1}{10L} \cdot \frac{4}{5} \cdot 20\kappa} + \frac{\frac{2}{10}}{1 - \frac{2}{10}} = \frac{1}{4} + \frac{1}{4} = \frac{1}{2}

A contraction factor of ρ=1/2\rho = 1/2 per epoch regardless of κ\kappa.

Proof sketch. Track E[xk+1x2]\mathbb{E}[\|x^{k+1} - x^*\|^2] for one inner step. Expand using the SVRG update, then apply:

  1. Strong convexity to bound f(xk)T(xkx)\nabla f(x^k)^T(x^k - x^*)
  2. The variance bound from Section 2 to control the noise term
  3. The per-step inequality: E[xk+1x2](12μα)xkx2+2α2[4L(f(xk)f)+4L(f(x~)f)]\mathbb{E}[\|x^{k+1} - x^*\|^2] \leq (1-2\mu\alpha)\|x^k-x^*\|^2 + 2\alpha^2\left[4L(f(x^k)-f^*)+4L(f(\tilde{x})-f^*)\right]

The variance bound connects the noise to suboptimality, which connects to distance via strong convexity. This closes the loop: the contraction in distance-to-optimum drives down the variance, which drives down the distance further.

Summing over mm inner steps, taking expectation over the random snapshot selection, and rearranging gives the epoch-level contraction ρ\rho. \square


4. Cost Comparison: The Payoff

Let's count total gradient evaluations to reach ϵ\epsilon-accuracy.

Epoch cost for SVRG:

  • Snapshot gradient: nn evaluations (one full pass)
  • Inner loop of m=20κm = 20\kappa steps: 20κ20\kappa evaluations (each step samples one fi\nabla f_i)
  • Total per epoch: n+20κn + 20\kappa

Epochs to ϵ\epsilon-accuracy: Since ρ1/2\rho \leq 1/2, after ss epochs error (1/2)sinitial error\leq (1/2)^s \cdot \text{initial error}. To reach ϵ\epsilon from initial error Δ\Delta: s=log2(Δ/ϵ)=O(log(1/ϵ))s = \log_2(\Delta/\epsilon) = O(\log(1/\epsilon)) epochs.

Total gradient evaluations for SVRG: O ⁣((n+κ)log1ϵ)O\!\left((n + \kappa)\log\frac{1}{\epsilon}\right)

Compare to full GD: GD needs O(κ)O(\kappa) steps, each costing O(n)O(n) evaluations: O ⁣(nκlog1ϵ)O\!\left(n\kappa \log\frac{1}{\epsilon}\right)

Compare to SGD: SGD with rate O(1/k)O(1/k) needs O(1/ϵ)O(1/\epsilon) steps of O(1)O(1) cost: O ⁣(1ϵ)O\!\left(\frac{1}{\epsilon}\right)

| Method | Total gradient evals | Linear rate? | |---|---|---| | GD | O(nκlog1/ϵ)O(n\kappa \log 1/\epsilon) | Yes | | SGD | O(1/ϵ)O(1/\epsilon) | No | | SVRG | O((n+κ)log1/ϵ)O((n+\kappa)\log 1/\epsilon) | Yes |

SVRG achieves linear convergence (like GD) with a factor of n+κn + \kappa instead of nκn\kappa. When κ1\kappa \gg 1 — the regime where GD is slow — SVRG outperforms GD by a factor of nκn+κnκκ=n\frac{n\kappa}{n+\kappa} \approx \frac{n\kappa}{\kappa} = n when κn\kappa \gg n, or nκn=κ\frac{n\kappa}{n} = \kappa when nκn \gg \kappa.

Concretely: n=106n = 10^6 data points, κ=103\kappa = 10^3, target ϵ=106\epsilon = 10^{-6}:

  • GD: 106×103×141.4×101010^6 \times 10^3 \times 14 \approx 1.4 \times 10^{10} evaluations
  • SVRG: (106+103)×141.4×107(10^6 + 10^3) \times 14 \approx 1.4 \times 10^7 evaluations

SVRG is 1000x cheaper. With identical convergence guarantees.


5. Continual Learning Extension: CSVRG

SVRG assumes a fixed dataset {f1,,fn}\{f_1, \ldots, f_n\}. In many modern settings — online learning, continual learning, streaming data — new data points arrive over time and the model must update without forgetting old knowledge.

The naive approach: when fn+1f_{n+1} arrives, re-run SVRG from scratch on {f1,,fn+1}\{f_1, \ldots, f_{n+1}\}. Cost: O(n+1)O(n+1) per epoch. For a stream of TT new points, total cost O(Tn)O(Tn) — quadratic in the total data.

CSVRG (Continual SVRG) addresses this by maintaining a stale snapshot μ~\tilde{\mu} that is updated incrementally. When fn+1f_{n+1} arrives:

  1. Update the snapshot gradient: μ~new=nμ~+fn+1(x~)n+1\tilde{\mu}^{\text{new}} = \frac{n\tilde{\mu} + \nabla f_{n+1}(\tilde{x})}{n+1} — a rank-1 update, O(1)O(1) cost
  2. Continue running SVRG inner steps, now with the updated μ~new\tilde{\mu}^{\text{new}}

The staleness of μ~\tilde{\mu} introduces a bounded extra variance term that degrades gracefully — the convergence rate remains linear with a slightly worse constant that shrinks as the model reprocesses old data.

CSVRG trades exact variance reduction for memory efficiency. It is the right tool when data arrives faster than you can take full passes, and is one of several variance-reduced methods finding application in federated learning (where clients have local data and cannot share raw gradients).


Where We Stand

SVRG closes the loop opened in Post 4. The tension between cost and convergence rate — which seemed fundamental to stochastic optimization — is resolved by exploiting the finite-sum structure. The variance is not an intrinsic property of the problem; it is a property of the estimator, and a cleverer estimator eliminates it.

The next post turns to a different structural challenge: what happens when the objective is not even differentiable? The Lasso, total variation regularization, and group sparsity penalties all have kinks. Subgradients are the right generalization of gradients in this world — and the proximal operator is the right generalization of the gradient step.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.