Non-Smooth Optimization: Proximal Operators and FISTA
March 2025optimizationproximalfistalassonon-smooth
Every algorithm we have seen so far assumes f is differentiable. In Post 1, we saw that the ℓ1 norm's pointy geometry is precisely what induces sparsity in solutions. But that pointy geometry comes with a price: ∥x∥1 is not differentiable at any point where some xi=0, which is exactly the region of interest.
The Lasso objective is F(x)=21∥Ax−b∥2+λ∥x∥1. The squared loss is smooth. The regularizer is not. We cannot simply compute ∇F and run gradient descent.
This post develops the right framework for this structure: the proximal operator, which handles the non-smooth part exactly, and FISTA, which recovers the optimal O(1/k2) convergence rate.
1. The Composite Problem
The problems we care about have the form:
minx∈RdF(x)=f(x)+g(x)
where:
f is smooth: differentiable, L-smooth, and convex
g is convex but non-smooth: known closed form, but not differentiable everywhere
The canonical example: f(x)=21∥Ax−b∥2 and g(x)=λ∥x∥1.
The naive approach is subgradient descent on F directly. But this discards the structure entirely — it treats F as a single black-box non-smooth function and pays the O(1/k) rate that comes with non-smoothness (we will derive this lower bound in Post 8).
We can do much better by exploiting the split: f is smooth so we can take gradient steps on it; g has known structure so we can handle it exactly. The algorithm that does this is built around the proximal operator.
2. The Proximal Operator
Definition. For a convex function g and step size α>0, the proximal operator is:
proxαg(x)=argminu∈Rd{g(u)+2α1∥u−x∥2}
The minimization always has a unique solution (strongly convex in u), so proxαg is well-defined everywhere.
Interpretation.proxαg(x) finds the point u that minimizes g while not straying too far from x. The parameter α controls the trade-off: large α weights g minimization, small α stays close to x. It is a soft projection — not projecting onto the minimum of g, but compromising between minimizing g and staying near the current point.
Soft Thresholding: The Proximal Operator of ℓ1
Let g(x)=λ∥x∥1. The proximal operator decouples across coordinates, so we solve the scalar problem for each i:
[proxαg(x)]i=argminu∈R{λ∣u∣+2α1(u−xi)2}
Call this h(u)=λ∣u∣+2α1(u−xi)2. We find the minimizer by cases.
Case 1: u>0. Then h(u)=λu+2α1(u−xi)2, which is smooth. Setting h′(u)=0:
λ+αu−xi=0⟹u=xi−αλ
This is valid (gives u>0) iff xi>αλ.
Case 2: u<0. Then h(u)=−λu+2α1(u−xi)2, setting h′(u)=0:
−λ+αu−xi=0⟹u=xi+αλ
Valid (gives u<0) iff xi<−αλ.
Case 3: u=0. The subdifferential condition 0∈∂h(0) gives:
This is the soft-thresholding operatorSαλ. Its behavior:
If ∣xi∣>αλ: shrink toward zero by αλ (keep the sign, reduce the magnitude)
If ∣xi∣≤αλ: set to exactly zero
This is the sparsity mechanism of Lasso made explicit. The connection to Post 1's geometry: the ℓ1 ball has corners on the coordinate axes, and soft-thresholding is precisely the algebraic operation that pushes solutions onto those corners. Components too small to "survive" the threshold are zeroed out; components large enough survive but shrunk.
The indicator function case reveals that projection is a special case of the proximal operator — projected gradient descent is ISTA with g=δC.
3. ISTA — The Proximal Gradient Algorithm
With the proximal operator in hand, the Iterative Shrinkage-Thresholding Algorithm (ISTA) is immediate:
xk+1=proxαg(xk−α∇f(xk))
The interpretation is clean: take a gradient step on the smooth part f, then apply the proximal operator for the non-smooth part g. Each half of the problem is handled by the right tool.
Proximal Descent Lemma
Lemma. For L-smooth f and any convex g, ISTA with α=1/L satisfies:
F(xk+1)≤F(xk)−2L1∥xk+1−xk∥2⋅L2/L
More precisely, substituting y=xk+1 and x=xk−α∇f(xk) (the gradient step point):
This is the L-smoothness bound on f plus g(xk+1). Using the optimality condition for the proximal step — that xk+1 minimizes g(u)+2α1∥u−(xk−α∇f(xk))∥2 — one shows:
F(xk+1)≤F(xk)−2L1∥∇LF(xk)∥2
where ∇LF(xk)=L(xk−xk+1) is the proximal gradient — the composite analogue of ∇F.
Convergence: O(1/k)
Theorem. ISTA with α=1/L on L-smooth f and convex g gives:
F(xk)−F∗≤2kL∥x0−x∗∥2
The proof mirrors the smooth GD proof from Post 2 exactly, replacing ∇f with the proximal gradient ∇LF. The non-smoothness of g does not affect the rate because we never take a subgradient of g — we apply its proximal operator exactly.
This is the power of exploiting composite structure: we get the same O(1/k) rate as smooth GD, despite g being non-smooth.
4. FISTA — Nesterov Acceleration for Composite Problems
Nesterov's momentum trick from Post 3 applies directly to ISTA. The result is FISTA (Fast ISTA):
Initialize x0=y1, t1=1. For k=1,2,…:
xk=proxαg(yk−α∇f(yk))
tk+1=21+1+4tk2
yk+1=xk+tk+1tk−1(xk−xk−1)
The only change from ISTA: the gradient step is taken at the momentum point yk rather than xk, and yk+1 is an extrapolation of xk beyond the previous iterate.
The momentum coefficient tk+1tk−1 follows the specific schedule dictated by the sequence tk. For large k, tk≈k/2, so the coefficient approaches (k+1)/2k/2−1→1. The algorithm becomes increasingly aggressive about extrapolating as it progresses — the same behavior as AGD.
The tk Sequence
The recurrence tk+1=21+1+4tk2 with t1=1 gives:
t1=1,t2=21+5≈1.618,t3≈2.058,tk≈2k+1
The choice of this specific sequence is not arbitrary. The convergence proof requires tk+12−tk+1≤tk2, which the recurrence satisfies by construction. This algebraic condition is what makes the Lyapunov argument go through.
Convergence: O(1/k2)
Theorem (Beck & Teboulle 2009). FISTA with α=1/L satisfies:
F(xk)−F∗≤(k+1)22L∥x0−x∗∥2
This matches the Nesterov lower bound — FISTA is optimal for composite minimization. The proof uses the same Lyapunov energy function as AGD:
Ek=tk2(F(xk)−F∗)+2L∥vk−x∗∥2
where vk=xk−1+tk−11(xk−xk−1) is an auxiliary sequence. One shows Ek+1≤Ek using the proximal descent lemma and the tk recurrence, then divides by tk2≈k2/4.
The rate hierarchy for composite problems now mirrors smooth optimization exactly:
ISTA and FISTA both handle the non-smooth g exactly via the proximal step. The gap between them is purely the presence or absence of momentum — the same gap as between GD and AGD in the smooth setting.
5. When is the Proximal Operator Tractable?
The proximal framework is useful only when proxαg can be computed efficiently. Fortunately, the common regularizers in machine learning all admit closed-form proximal operators.
Group sparsityg(x)=λ∑j∥xGj∥2 (where Gj partition the coordinates into groups):
[proxαg(x)]Gj=(1−∥xGj∥2αλ)+xGj
This is block soft-thresholding: entire groups are zeroed out when ∥xGj∥2≤αλ.
Nuclear normg(X)=λ∥X∥∗. Compute the SVD X=UΣVT, soft-threshold the singular values, reassemble:
proxαg(X)=USαλ(Σ)VT
Simplex projectiong=δΔd where Δd={x≥0:∑ixi=1}: solvable in O(dlogd) by sorting.
Moreau decomposition. There is a general duality relationship:
proxαg(x)+α⋅proxg∗/α(x/α)=x
where g∗ is the convex conjugate of g. This means: if you know proxαg, you get proxg∗ for free. For example, since we know the proximal operator of ∥⋅∥1, Moreau gives us the proximal operator of ∥⋅∥∞ (its dual norm) without any additional work.
The Unified Picture
Starting from Post 1's geometric insight that the ℓ1 ball's corners produce sparsity, we now have the complete algorithmic story:
The sparsity-inducing geometry of ∥⋅∥1 translates into the soft-thresholding proximal operator, which ISTA applies at each step to handle the non-smooth regularizer exactly. FISTA adds Nesterov momentum to recover the optimal O(1/k2) rate — identical to AGD on smooth functions, now extended to the composite setting.
The next post steps back to ask: what if we cannot use proximal operators at all? When the constraint set is complex (like the set of positive semidefinite matrices with bounded trace), both projections and proximal steps become expensive. The Frank-Wolfe algorithm offers an alternative that replaces these hard operations with a much simpler one.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.