March 2025optimizationfrank-wolfeconstrainedmatrix-completion
Posts 6 handled non-smooth objectives by splitting them into a smooth part and a non-smooth part with a tractable proximal operator. But there is a second class of problems where that approach fails: constrained optimization over sets where projection is computationally expensive.
The Frank-Wolfe algorithm — also called the Conditional Gradient method — sidesteps projection entirely by replacing it with a simpler operation: minimizing a linear function over the constraint set. In many important cases, this linear minimization is orders of magnitude cheaper than projection.
1. The Projection Bottleneck
Consider constrained optimization:
minx∈Cf(x)
The standard approach is projected gradient descent:
xk+1=ΠC(xk−α∇f(xk))
where ΠC(z)=argminx∈C∥x−z∥2 is the Euclidean projection onto C.
For simple constraint sets, projection is cheap:
ℓ2 ball {x:∥x∥≤r}: scale x if ∥x∥>r. O(d).
Simplex {x≥0:∑ixi=1}: O(dlogd) by sorting.
Box [a,b]d: clamp each coordinate. O(d).
But for the nuclear norm ballC={X∈Rm×n:∥X∥∗≤τ}, projection requires:
Full SVD of the gradient-step matrix: O(mnmin(m,n))
Soft-threshold the singular values
Reconstruct
For a 10000×10000 matrix, the full SVD costs O(1012) operations — roughly 1012 flops per step. This is the bottleneck in matrix completion, collaborative filtering, and semidefinite relaxations.
2. The Frank-Wolfe Update
Frank-Wolfe replaces the projection with a Linear Minimization Oracle (LMO):
sk=argmins∈C⟨∇f(xk),s⟩
Then update via a convex combination:
xk+1=(1−γk)xk+γksk
where γk∈[0,1] is the step size.
Why does this stay in C? Because xk∈C, sk∈C, and C is convex. The convex combination of two points in a convex set is in the set. No projection needed.
Interpretation. At each step, Frank-Wolfe:
Linearizes f around xk: f(x)≈f(xk)+∇f(xk)T(x−xk)
Minimizes this linear approximation over C → finds sk
Steps toward sk from the current point
The LMO finds the direction in C most aligned with the negative gradient — the "best feasible descent direction."
3. The Frank-Wolfe Gap
Before the convergence proof, we need a certificate of optimality. Define the Frank-Wolfe gap:
Gk=⟨∇f(xk),xk−sk⟩
Claim:Gk≥0, with Gk=0 iff xk is optimal.
Proof. By definition of sk: ⟨∇f(xk),sk⟩≤⟨∇f(xk),x⟩ for all x∈C. In particular, setting x=x∗:
⟨∇f(xk),sk⟩≤⟨∇f(xk),x∗⟩
Therefore:
Gk=⟨∇f(xk),xk−sk⟩≥⟨∇f(xk),xk−x∗⟩≥f(xk)−f(x∗)
where the last step uses convexity. So Gk≥f(xk)−f∗≥0. □
The gap Gk is computable — we only need xk and sk, both of which we have. This makes it a practical stopping criterion: when Gk<ϵ, we know f(xk)−f∗<ϵ.
Per-Step Descent Bound
Lemma. For L-smooth f and step size γk:
f(xk+1)≤f(xk)−γkGk+2γk2LD2
where D=diam(C)=maxx,y∈C∥x−y∥.
Proof. Apply L-smoothness at x=xk, y=xk+1=(1−γk)xk+γksk:
f(xk+1)≤f(xk)+∇f(xk)T(xk+1−xk)+2L∥xk+1−xk∥2
Since xk+1−xk=γk(sk−xk):
=f(xk)+γk∇f(xk)T(sk−xk)+2γk2L∥sk−xk∥2
=f(xk)−γkGk+2γk2L∥sk−xk∥2
Bounding ∥sk−xk∥≤D gives the result. □
The descent is controlled by two competing terms: −γkGk (progress toward optimum) and 2γk2LD2 (error from the linear approximation). Optimizing over γk gives γk∗=Gk/(LD2) and maximum decrease of Gk2/(2LD2).
4. Convergence: O(1/k) Rate
Theorem. Frank-Wolfe with step size γk=k+22 gives:
f(xk)−f∗≤k+22LD2
Proof by induction. Let hk=f(xk)−f∗.
Base case:h0≤22LD2=LD2, which holds since h0≤2LD2⋅h0h0... we verify this holds for a suitable initialization.
Inductive step: Assume hk≤k+22LD2. Using the descent lemma with γk=k+22 and Gk≥hk:
hk+1≤hk−γkhk+2γk2LD2
=hk(1−γk)+2γk2LD2
≤k+22LD2⋅k+2k+(k+2)24⋅2LD2
=(k+2)2LD2(2k+2)=k+22LD2⋅k+2k+1≤k+32LD2
completing the induction. □
Note on the rate. The O(1/k) rate matches ISTA and smooth GD — but with a different constant involving the diameter D rather than the initial distance ∥x0−x∗∥. The price of avoiding projections is a dependence on the global geometry of C rather than just the local distance to the optimum.
Frank-Wolfe cannot be accelerated to O(1/k2) in general — this is a provable limitation. The O(1/k) rate is optimal for the conditional gradient method. The trade-off is the cheaper per-step cost.
5. The Nuclear Norm Application: Matrix Completion
The motivating application is matrix completion: recover a low-rank matrix M∈Rm×n from a subset Ω of its entries.
minX∈Rm×n∥PΩ(X−M)∥F2subject to∥X∥∗≤τ
where PΩ zeroes out entries not in Ω.
LMO for the Nuclear Norm Ball
sk=argmin∥S∥∗≤τ⟨Gk,S⟩
where Gk=∇f(Xk)=−2PΩ(M−Xk) (the gradient of the squared loss).
Claim:sk=−τu1v1T, where u1,v1 are the top left and right singular vectors of Gk.
Proof. The minimum of ⟨G,S⟩ over the nuclear norm ball is −τ∥G∥2=−τσmax(G), achieved by S=−τ∥G∥2G... more precisely, using the duality between nuclear norm and spectral norm:
The maximum ⟨G,S⟩ over the nuclear norm ball is achieved at S=u1v1T (a rank-1 matrix formed from the top singular vectors). So the LMO solution is sk=−τu1v1T. □
Cost Comparison
| Operation | Cost |
|---|---|
| Full SVD (for projection onto nuclear norm ball) | O(mnmin(m,n)) |
| Top singular vector via power iteration (for LMO) | O(mn) |
For an m×n matrix, the LMO is min(m,n) times cheaper than projection. For a 10000×10000 matrix: projection costs O(1012) flops, LMO costs O(108) — a factor of 104 cheaper per step.
Rank-1 Update Structure
Each Frank-Wolfe step moves Xk+1 from Xk toward −τu1v1T:
Xk+1=(1−γk)Xk−γkτu1v1T
This is a rank-1 update of the current iterate. If Xk has rank r, then Xk+1 has rank at most r+1. Starting from X0=0, after k steps the iterate has rank at most k.
This is the sparsity structure of Frank-Wolfe applied to matrix problems: the iterates are naturally low-rank, and the low-rank structure is preserved and grows incrementally. This is exactly right for matrix completion — the ground truth M is low-rank, and we want our iterates to be low-rank approximations throughout training.
6. Sparsity in Frank-Wolfe
For polyhedral constraint sets C (convex hulls of a finite set of vertices, like the simplex or the ℓ1 ball), the LMO always returns a vertex of C. Each Frank-Wolfe step is a convex combination of xk and a vertex, so the iterates are convex combinations of at most k+1 vertices.
This gives natural sparsity: after k steps, xk is a combination of k+1 extreme points of C. For the simplex, this means xk has at most k+1 non-zero coordinates.
Compare to proximal methods on the same problem:
Proximal/projection: Dense iterates, but projection is the hard operation
Frank-Wolfe: Sparse iterates, LMO is easy
The two approaches are complementary tools. When the constraint set has an easy proximal operator, use ISTA/FISTA. When the LMO is cheap and projection is hard, use Frank-Wolfe. For nuclear norm constraints on large matrices, Frank-Wolfe wins decisively.
Looking Forward
The pattern across the last three posts: optimization structure determines the right algorithm.
Composite f+g with cheap prox → ISTA/FISTA
Constrained with cheap LMO → Frank-Wolfe
No structure at all, just convexity → subgradient descent (O(1/k), unavoidable)
Post 8 fills in this last case. Subgradients are the natural tool for fully non-smooth functions — but the price is real, and the O(1/k) lower bound is not improvable. That post closes the convergence rate hierarchy for the first-order optimization landscape.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.