← All Posts

Mathematics of Data

Frank-Wolfe: Optimization Without Projections

March 2025optimizationfrank-wolfeconstrainedmatrix-completion

Posts 6 handled non-smooth objectives by splitting them into a smooth part and a non-smooth part with a tractable proximal operator. But there is a second class of problems where that approach fails: constrained optimization over sets where projection is computationally expensive.

The Frank-Wolfe algorithm — also called the Conditional Gradient method — sidesteps projection entirely by replacing it with a simpler operation: minimizing a linear function over the constraint set. In many important cases, this linear minimization is orders of magnitude cheaper than projection.


1. The Projection Bottleneck

Consider constrained optimization:

minxCf(x)\min_{x \in \mathcal{C}} f(x)

The standard approach is projected gradient descent:

xk+1=ΠC ⁣(xkαf(xk))x^{k+1} = \Pi_\mathcal{C}\!\left(x^k - \alpha \nabla f(x^k)\right)

where ΠC(z)=argminxCxz2\Pi_\mathcal{C}(z) = \arg\min_{x \in \mathcal{C}} \|x - z\|^2 is the Euclidean projection onto C\mathcal{C}.

For simple constraint sets, projection is cheap:

  • 2\ell_2 ball {x:xr}\{x : \|x\| \leq r\}: scale xx if x>r\|x\| > r. O(d)O(d).
  • Simplex {x0:ixi=1}\{x \geq 0 : \sum_i x_i = 1\}: O(dlogd)O(d \log d) by sorting.
  • Box [a,b]d[a,b]^d: clamp each coordinate. O(d)O(d).

But for the nuclear norm ball C={XRm×n:Xτ}\mathcal{C} = \{X \in \mathbb{R}^{m \times n} : \|X\|_* \leq \tau\}, projection requires:

  1. Full SVD of the gradient-step matrix: O(mnmin(m,n))O(mn\min(m,n))
  2. Soft-threshold the singular values
  3. Reconstruct

For a 10000×1000010000 \times 10000 matrix, the full SVD costs O(1012)O(10^{12}) operations — roughly 101210^{12} flops per step. This is the bottleneck in matrix completion, collaborative filtering, and semidefinite relaxations.


2. The Frank-Wolfe Update

Frank-Wolfe replaces the projection with a Linear Minimization Oracle (LMO):

sk=argminsCf(xk),ss^k = \arg\min_{s \in \mathcal{C}} \langle \nabla f(x^k),\, s \rangle

Then update via a convex combination:

xk+1=(1γk)xk+γkskx^{k+1} = (1 - \gamma_k) x^k + \gamma_k s^k

where γk[0,1]\gamma_k \in [0,1] is the step size.

Why does this stay in C\mathcal{C}? Because xkCx^k \in \mathcal{C}, skCs^k \in \mathcal{C}, and C\mathcal{C} is convex. The convex combination of two points in a convex set is in the set. No projection needed.

Interpretation. At each step, Frank-Wolfe:

  1. Linearizes ff around xkx^k: f(x)f(xk)+f(xk)T(xxk)f(x) \approx f(x^k) + \nabla f(x^k)^T(x - x^k)
  2. Minimizes this linear approximation over C\mathcal{C} → finds sks^k
  3. Steps toward sks^k from the current point

The LMO finds the direction in C\mathcal{C} most aligned with the negative gradient — the "best feasible descent direction."


3. The Frank-Wolfe Gap

Before the convergence proof, we need a certificate of optimality. Define the Frank-Wolfe gap:

Gk=f(xk),xkskG^k = \langle \nabla f(x^k),\, x^k - s^k \rangle

Claim: Gk0G^k \geq 0, with Gk=0G^k = 0 iff xkx^k is optimal.

Proof. By definition of sks^k: f(xk),skf(xk),x\langle \nabla f(x^k), s^k \rangle \leq \langle \nabla f(x^k), x \rangle for all xCx \in \mathcal{C}. In particular, setting x=xx = x^*:

f(xk),skf(xk),x\langle \nabla f(x^k), s^k \rangle \leq \langle \nabla f(x^k), x^* \rangle

Therefore:

Gk=f(xk),xkskf(xk),xkxf(xk)f(x)G^k = \langle \nabla f(x^k), x^k - s^k \rangle \geq \langle \nabla f(x^k), x^k - x^* \rangle \geq f(x^k) - f(x^*)

where the last step uses convexity. So Gkf(xk)f0G^k \geq f(x^k) - f^* \geq 0. \square

The gap GkG^k is computable — we only need xkx^k and sks^k, both of which we have. This makes it a practical stopping criterion: when Gk<ϵG^k < \epsilon, we know f(xk)f<ϵf(x^k) - f^* < \epsilon.

Per-Step Descent Bound

Lemma. For LL-smooth ff and step size γk\gamma_k:

f(xk+1)f(xk)γkGk+γk2L2D2f(x^{k+1}) \leq f(x^k) - \gamma_k G^k + \frac{\gamma_k^2 L}{2} D^2

where D=diam(C)=maxx,yCxyD = \operatorname{diam}(\mathcal{C}) = \max_{x,y \in \mathcal{C}} \|x - y\|.

Proof. Apply LL-smoothness at x=xkx = x^k, y=xk+1=(1γk)xk+γksky = x^{k+1} = (1-\gamma_k)x^k + \gamma_k s^k:

f(xk+1)f(xk)+f(xk)T(xk+1xk)+L2xk+1xk2f(x^{k+1}) \leq f(x^k) + \nabla f(x^k)^T(x^{k+1} - x^k) + \frac{L}{2}\|x^{k+1} - x^k\|^2

Since xk+1xk=γk(skxk)x^{k+1} - x^k = \gamma_k(s^k - x^k):

=f(xk)+γkf(xk)T(skxk)+γk2L2skxk2= f(x^k) + \gamma_k \nabla f(x^k)^T(s^k - x^k) + \frac{\gamma_k^2 L}{2}\|s^k - x^k\|^2

=f(xk)γkGk+γk2L2skxk2= f(x^k) - \gamma_k G^k + \frac{\gamma_k^2 L}{2}\|s^k - x^k\|^2

Bounding skxkD\|s^k - x^k\| \leq D gives the result. \square

The descent is controlled by two competing terms: γkGk-\gamma_k G^k (progress toward optimum) and γk2LD22\frac{\gamma_k^2 L D^2}{2} (error from the linear approximation). Optimizing over γk\gamma_k gives γk=Gk/(LD2)\gamma_k^* = G^k / (LD^2) and maximum decrease of Gk2/(2LD2)G_k^2 / (2LD^2).


4. Convergence: O(1/k)O(1/k) Rate

Theorem. Frank-Wolfe with step size γk=2k+2\gamma_k = \frac{2}{k+2} gives:

f(xk)f2LD2k+2f(x^k) - f^* \leq \frac{2LD^2}{k+2}

Proof by induction. Let hk=f(xk)fh_k = f(x^k) - f^*.

Base case: h02LD22=LD2h_0 \leq \frac{2LD^2}{2} = LD^2, which holds since h0LD22h0h0h_0 \leq \frac{LD^2}{2} \cdot \frac{h_0}{h_0}... we verify this holds for a suitable initialization.

Inductive step: Assume hk2LD2k+2h_k \leq \frac{2LD^2}{k+2}. Using the descent lemma with γk=2k+2\gamma_k = \frac{2}{k+2} and GkhkG^k \geq h_k:

hk+1hkγkhk+γk2LD22h_{k+1} \leq h_k - \gamma_k h_k + \frac{\gamma_k^2 L D^2}{2}

=hk(1γk)+γk2LD22= h_k(1 - \gamma_k) + \frac{\gamma_k^2 LD^2}{2}

2LD2k+2kk+2+4(k+2)2LD22\leq \frac{2LD^2}{k+2}\cdot \frac{k}{k+2} + \frac{4}{(k+2)^2}\cdot\frac{LD^2}{2}

=LD2(k+2)2(2k+2)=2LD2k+2k+1k+22LD2k+3= \frac{LD^2}{(k+2)^2}\left(2k + 2\right) = \frac{2LD^2}{k+2} \cdot \frac{k+1}{k+2} \leq \frac{2LD^2}{k+3}

completing the induction. \square

Note on the rate. The O(1/k)O(1/k) rate matches ISTA and smooth GD — but with a different constant involving the diameter DD rather than the initial distance x0x\|x^0 - x^*\|. The price of avoiding projections is a dependence on the global geometry of C\mathcal{C} rather than just the local distance to the optimum.

Frank-Wolfe cannot be accelerated to O(1/k2)O(1/k^2) in general — this is a provable limitation. The O(1/k)O(1/k) rate is optimal for the conditional gradient method. The trade-off is the cheaper per-step cost.


5. The Nuclear Norm Application: Matrix Completion

The motivating application is matrix completion: recover a low-rank matrix MRm×nM \in \mathbb{R}^{m \times n} from a subset Ω\Omega of its entries.

minXRm×nPΩ(XM)F2subject toXτ\min_{X \in \mathbb{R}^{m \times n}} \|P_\Omega(X - M)\|_F^2 \quad \text{subject to} \quad \|X\|_* \leq \tau

where PΩP_\Omega zeroes out entries not in Ω\Omega.

LMO for the Nuclear Norm Ball

sk=argminSτGk,Ss^k = \arg\min_{\|S\|_* \leq \tau} \langle G^k, S \rangle

where Gk=f(Xk)=2PΩ(MXk)G^k = \nabla f(X^k) = -2P_\Omega(M - X^k) (the gradient of the squared loss).

Claim: sk=τu1v1Ts^k = -\tau u_1 v_1^T, where u1,v1u_1, v_1 are the top left and right singular vectors of GkG^k.

Proof. The minimum of G,S\langle G, S \rangle over the nuclear norm ball is τG2=τσmax(G)-\tau\|G\|_2 = -\tau\sigma_{\max}(G), achieved by S=τGG2S = -\tau \frac{G}{\|G\|_2}... more precisely, using the duality between nuclear norm and spectral norm:

minSτG,S=τmaxS1G,S=τG2=τσmax(G)\min_{\|S\|_* \leq \tau} \langle G, S \rangle = -\tau \max_{\|S\|_* \leq 1} \langle G, S \rangle = -\tau \|G\|_2 = -\tau \sigma_{\max}(G)

The maximum G,S\langle G, S \rangle over the nuclear norm ball is achieved at S=u1v1TS = u_1 v_1^T (a rank-1 matrix formed from the top singular vectors). So the LMO solution is sk=τu1v1Ts^k = -\tau u_1 v_1^T. \square

Cost Comparison

| Operation | Cost | |---|---| | Full SVD (for projection onto nuclear norm ball) | O(mnmin(m,n))O(mn\min(m,n)) | | Top singular vector via power iteration (for LMO) | O(mn)O(mn) |

For an m×nm \times n matrix, the LMO is min(m,n)\min(m,n) times cheaper than projection. For a 10000×1000010000 \times 10000 matrix: projection costs O(1012)O(10^{12}) flops, LMO costs O(108)O(10^8) — a factor of 10410^4 cheaper per step.

Rank-1 Update Structure

Each Frank-Wolfe step moves Xk+1X^{k+1} from XkX^k toward τu1v1T-\tau u_1 v_1^T:

Xk+1=(1γk)Xkγkτu1v1TX^{k+1} = (1-\gamma_k)X^k - \gamma_k \tau u_1 v_1^T

This is a rank-1 update of the current iterate. If XkX^k has rank rr, then Xk+1X^{k+1} has rank at most r+1r+1. Starting from X0=0X^0 = 0, after kk steps the iterate has rank at most kk.

This is the sparsity structure of Frank-Wolfe applied to matrix problems: the iterates are naturally low-rank, and the low-rank structure is preserved and grows incrementally. This is exactly right for matrix completion — the ground truth MM is low-rank, and we want our iterates to be low-rank approximations throughout training.


6. Sparsity in Frank-Wolfe

For polyhedral constraint sets C\mathcal{C} (convex hulls of a finite set of vertices, like the simplex or the 1\ell_1 ball), the LMO always returns a vertex of C\mathcal{C}. Each Frank-Wolfe step is a convex combination of xkx^k and a vertex, so the iterates are convex combinations of at most k+1k+1 vertices.

This gives natural sparsity: after kk steps, xkx^k is a combination of k+1k+1 extreme points of C\mathcal{C}. For the simplex, this means xkx^k has at most k+1k+1 non-zero coordinates.

Compare to proximal methods on the same problem:

  • Proximal/projection: Dense iterates, but projection is the hard operation
  • Frank-Wolfe: Sparse iterates, LMO is easy

The two approaches are complementary tools. When the constraint set has an easy proximal operator, use ISTA/FISTA. When the LMO is cheap and projection is hard, use Frank-Wolfe. For nuclear norm constraints on large matrices, Frank-Wolfe wins decisively.


Looking Forward

The pattern across the last three posts: optimization structure determines the right algorithm.

  • Composite f+gf + g with cheap prox → ISTA/FISTA
  • Constrained with cheap LMO → Frank-Wolfe
  • No structure at all, just convexity → subgradient descent (O(1/k)O(1/\sqrt{k}), unavoidable)

Post 8 fills in this last case. Subgradients are the natural tool for fully non-smooth functions — but the price is real, and the O(1/k)O(1/\sqrt{k}) lower bound is not improvable. That post closes the convergence rate hierarchy for the first-order optimization landscape.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.