← All Posts

Mathematics of Data

Non-Smooth Optimization: Proximal Operators and FISTA

March 2025optimizationproximalfistalassonon-smooth

Every algorithm we have seen so far assumes ff is differentiable. In Post 1, we saw that the 1\ell_1 norm's pointy geometry is precisely what induces sparsity in solutions. But that pointy geometry comes with a price: x1\|x\|_1 is not differentiable at any point where some xi=0x_i = 0, which is exactly the region of interest.

The Lasso objective is F(x)=12Axb2+λx1F(x) = \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1. The squared loss is smooth. The regularizer is not. We cannot simply compute F\nabla F and run gradient descent.

This post develops the right framework for this structure: the proximal operator, which handles the non-smooth part exactly, and FISTA, which recovers the optimal O(1/k2)O(1/k^2) convergence rate.


1. The Composite Problem

The problems we care about have the form:

minxRdF(x)=f(x)+g(x)\min_{x \in \mathbb{R}^d} F(x) = f(x) + g(x)

where:

  • ff is smooth: differentiable, LL-smooth, and convex
  • gg is convex but non-smooth: known closed form, but not differentiable everywhere

The canonical example: f(x)=12Axb2f(x) = \frac{1}{2}\|Ax-b\|^2 and g(x)=λx1g(x) = \lambda\|x\|_1.

The naive approach is subgradient descent on FF directly. But this discards the structure entirely — it treats FF as a single black-box non-smooth function and pays the O(1/k)O(1/\sqrt{k}) rate that comes with non-smoothness (we will derive this lower bound in Post 8).

We can do much better by exploiting the split: ff is smooth so we can take gradient steps on it; gg has known structure so we can handle it exactly. The algorithm that does this is built around the proximal operator.


2. The Proximal Operator

Definition. For a convex function gg and step size α>0\alpha > 0, the proximal operator is:

proxαg(x)=argminuRd{g(u)+12αux2}\operatorname{prox}_{\alpha g}(x) = \arg\min_{u \in \mathbb{R}^d} \left\{ g(u) + \frac{1}{2\alpha}\|u - x\|^2 \right\}

The minimization always has a unique solution (strongly convex in uu), so proxαg\operatorname{prox}_{\alpha g} is well-defined everywhere.

Interpretation. proxαg(x)\operatorname{prox}_{\alpha g}(x) finds the point uu that minimizes gg while not straying too far from xx. The parameter α\alpha controls the trade-off: large α\alpha weights gg minimization, small α\alpha stays close to xx. It is a soft projection — not projecting onto the minimum of gg, but compromising between minimizing gg and staying near the current point.

Soft Thresholding: The Proximal Operator of 1\ell_1

Let g(x)=λx1g(x) = \lambda\|x\|_1. The proximal operator decouples across coordinates, so we solve the scalar problem for each ii:

[proxαg(x)]i=argminuR{λu+12α(uxi)2}[\operatorname{prox}_{\alpha g}(x)]_i = \arg\min_{u \in \mathbb{R}} \left\{ \lambda|u| + \frac{1}{2\alpha}(u - x_i)^2 \right\}

Call this h(u)=λu+12α(uxi)2h(u) = \lambda|u| + \frac{1}{2\alpha}(u - x_i)^2. We find the minimizer by cases.

Case 1: u>0u > 0. Then h(u)=λu+12α(uxi)2h(u) = \lambda u + \frac{1}{2\alpha}(u-x_i)^2, which is smooth. Setting h(u)=0h'(u) = 0:

λ+uxiα=0    u=xiαλ\lambda + \frac{u - x_i}{\alpha} = 0 \implies u = x_i - \alpha\lambda

This is valid (gives u>0u > 0) iff xi>αλx_i > \alpha\lambda.

Case 2: u<0u < 0. Then h(u)=λu+12α(uxi)2h(u) = -\lambda u + \frac{1}{2\alpha}(u-x_i)^2, setting h(u)=0h'(u) = 0:

λ+uxiα=0    u=xi+αλ-\lambda + \frac{u - x_i}{\alpha} = 0 \implies u = x_i + \alpha\lambda

Valid (gives u<0u < 0) iff xi<αλx_i < -\alpha\lambda.

Case 3: u=0u = 0. The subdifferential condition 0h(0)0 \in \partial h(0) gives:

0αλ[1,1]+(0xi)    xi[αλ,αλ]0 \in \alpha\lambda[-1,1] + (0 - x_i) \implies x_i \in [-\alpha\lambda, \alpha\lambda]

Combining all three cases:

[proxαg(x)]i=Sαλ(xi):=sign(xi)max(xiαλ, 0)[\operatorname{prox}_{\alpha g}(x)]_i = \mathcal{S}_{\alpha\lambda}(x_i) := \operatorname{sign}(x_i)\max(|x_i| - \alpha\lambda,\ 0)

This is the soft-thresholding operator Sαλ\mathcal{S}_{\alpha\lambda}. Its behavior:

  • If xi>αλ|x_i| > \alpha\lambda: shrink toward zero by αλ\alpha\lambda (keep the sign, reduce the magnitude)
  • If xiαλ|x_i| \leq \alpha\lambda: set to exactly zero

This is the sparsity mechanism of Lasso made explicit. The connection to Post 1's geometry: the 1\ell_1 ball has corners on the coordinate axes, and soft-thresholding is precisely the algebraic operation that pushes solutions onto those corners. Components too small to "survive" the threshold are zeroed out; components large enough survive but shrunk.

Other Proximal Operators

| g(x)g(x) | proxαg(x)\operatorname{prox}_{\alpha g}(x) | Notes | |---|---|---| | λx1\lambda\|x\|_1 | Sαλ(x)\mathcal{S}_{\alpha\lambda}(x) — soft thresholding | Component-wise | | δC(x)\delta_C(x) (indicator of convex CC) | ΠC(x)\Pi_C(x) — projection onto CC | Prox = projection | | λx2\lambda\|x\|_2 | max(1αλx,0)x\max(1 - \frac{\alpha\lambda}{\|x\|},0) \cdot x | Soft shrinkage toward 0 | | λX\lambda\|X\|_* (nuclear norm) | Singular value soft-thresholding | USαλ(Σ)VTU\mathcal{S}_{\alpha\lambda}(\Sigma)V^T | | λ2x2\frac{\lambda}{2}\|x\|^2 | x1+αλ\frac{x}{1+\alpha\lambda} | Scaling |

The indicator function case reveals that projection is a special case of the proximal operator — projected gradient descent is ISTA with g=δCg = \delta_C.


3. ISTA — The Proximal Gradient Algorithm

With the proximal operator in hand, the Iterative Shrinkage-Thresholding Algorithm (ISTA) is immediate:

xk+1=proxαg ⁣(xkαf(xk))x^{k+1} = \operatorname{prox}_{\alpha g}\!\left(x^k - \alpha \nabla f(x^k)\right)

The interpretation is clean: take a gradient step on the smooth part ff, then apply the proximal operator for the non-smooth part gg. Each half of the problem is handled by the right tool.

Proximal Descent Lemma

Lemma. For LL-smooth ff and any convex gg, ISTA with α=1/L\alpha = 1/L satisfies:

F(xk+1)F(xk)12Lxk+1xk2L2/LF(x^{k+1}) \leq F(x^k) - \frac{1}{2L}\|x^{k+1} - x^k\|^2 \cdot L^2 / L

More precisely, substituting y=xk+1y = x^{k+1} and x=xkαf(xk)x = x^k - \alpha\nabla f(x^k) (the gradient step point):

F(xk+1)f(xk)+f(xk)T(xk+1xk)+L2xk+1xk2+g(xk+1)F(x^{k+1}) \leq f(x^k) + \nabla f(x^k)^T(x^{k+1}-x^k) + \frac{L}{2}\|x^{k+1}-x^k\|^2 + g(x^{k+1})

This is the LL-smoothness bound on ff plus g(xk+1)g(x^{k+1}). Using the optimality condition for the proximal step — that xk+1x^{k+1} minimizes g(u)+12αu(xkαf(xk))2g(u) + \frac{1}{2\alpha}\|u - (x^k - \alpha\nabla f(x^k))\|^2 — one shows:

F(xk+1)F(xk)12LLF(xk)2F(x^{k+1}) \leq F(x^k) - \frac{1}{2L}\|\nabla_L F(x^k)\|^2

where LF(xk)=L(xkxk+1)\nabla_L F(x^k) = L(x^k - x^{k+1}) is the proximal gradient — the composite analogue of F\nabla F.

Convergence: O(1/k)O(1/k)

Theorem. ISTA with α=1/L\alpha = 1/L on LL-smooth ff and convex gg gives:

F(xk)FLx0x22kF(x^k) - F^* \leq \frac{L\|x^0 - x^*\|^2}{2k}

The proof mirrors the smooth GD proof from Post 2 exactly, replacing f\nabla f with the proximal gradient LF\nabla_L F. The non-smoothness of gg does not affect the rate because we never take a subgradient of gg — we apply its proximal operator exactly.

This is the power of exploiting composite structure: we get the same O(1/k)O(1/k) rate as smooth GD, despite gg being non-smooth.


4. FISTA — Nesterov Acceleration for Composite Problems

Nesterov's momentum trick from Post 3 applies directly to ISTA. The result is FISTA (Fast ISTA):

Initialize x0=y1x^0 = y^1, t1=1t_1 = 1. For k=1,2,k = 1, 2, \ldots:

xk=proxαg ⁣(ykαf(yk))x^k = \operatorname{prox}_{\alpha g}\!\left(y^k - \alpha \nabla f(y^k)\right)

tk+1=1+1+4tk22t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}

yk+1=xk+tk1tk+1(xkxk1)y^{k+1} = x^k + \frac{t_k - 1}{t_{k+1}}(x^k - x^{k-1})

The only change from ISTA: the gradient step is taken at the momentum point yky^k rather than xkx^k, and yk+1y^{k+1} is an extrapolation of xkx^k beyond the previous iterate.

The momentum coefficient tk1tk+1\frac{t_k-1}{t_{k+1}} follows the specific schedule dictated by the sequence tkt_k. For large kk, tkk/2t_k \approx k/2, so the coefficient approaches k/21(k+1)/21\frac{k/2 - 1}{(k+1)/2} \to 1. The algorithm becomes increasingly aggressive about extrapolating as it progresses — the same behavior as AGD.

The tkt_k Sequence

The recurrence tk+1=1+1+4tk22t_{k+1} = \frac{1+\sqrt{1+4t_k^2}}{2} with t1=1t_1=1 gives:

t1=1,t2=1+521.618,t32.058,tkk+12t_1 = 1, \quad t_2 = \frac{1+\sqrt{5}}{2} \approx 1.618, \quad t_3 \approx 2.058, \quad t_k \approx \frac{k+1}{2}

The choice of this specific sequence is not arbitrary. The convergence proof requires tk+12tk+1tk2t_{k+1}^2 - t_{k+1} \leq t_k^2, which the recurrence satisfies by construction. This algebraic condition is what makes the Lyapunov argument go through.

Convergence: O(1/k2)O(1/k^2)

Theorem (Beck & Teboulle 2009). FISTA with α=1/L\alpha = 1/L satisfies:

F(xk)F2Lx0x2(k+1)2F(x^k) - F^* \leq \frac{2L\|x^0 - x^*\|^2}{(k+1)^2}

This matches the Nesterov lower bound — FISTA is optimal for composite minimization. The proof uses the same Lyapunov energy function as AGD:

Ek=tk2(F(xk)F)+L2vkx2E^k = t_k^2(F(x^k) - F^*) + \frac{L}{2}\|v^k - x^*\|^2

where vk=xk1+1tk1(xkxk1)v^k = x^{k-1} + \frac{1}{t_{k-1}}(x^k - x^{k-1}) is an auxiliary sequence. One shows Ek+1EkE^{k+1} \leq E^k using the proximal descent lemma and the tkt_k recurrence, then divides by tk2k2/4t_k^2 \approx k^2/4.

The rate hierarchy for composite problems now mirrors smooth optimization exactly:

| Algorithm | Rate | Optimal? | |---|---|---| | Subgradient descent | O(1/k)O(1/\sqrt{k}) | Yes (for non-smooth) | | ISTA | O(1/k)O(1/k) | No | | FISTA | O(1/k2)O(1/k^2) | Yes (for composite) |

ISTA and FISTA both handle the non-smooth gg exactly via the proximal step. The gap between them is purely the presence or absence of momentum — the same gap as between GD and AGD in the smooth setting.


5. When is the Proximal Operator Tractable?

The proximal framework is useful only when proxαg\operatorname{prox}_{\alpha g} can be computed efficiently. Fortunately, the common regularizers in machine learning all admit closed-form proximal operators.

Group sparsity g(x)=λjxGj2g(x) = \lambda \sum_j \|x_{G_j}\|_2 (where GjG_j partition the coordinates into groups):

[proxαg(x)]Gj=(1αλxGj2)+xGj[\operatorname{prox}_{\alpha g}(x)]_{G_j} = \left(1 - \frac{\alpha\lambda}{\|x_{G_j}\|_2}\right)_+ x_{G_j}

This is block soft-thresholding: entire groups are zeroed out when xGj2αλ\|x_{G_j}\|_2 \leq \alpha\lambda.

Nuclear norm g(X)=λXg(X) = \lambda\|X\|_*. Compute the SVD X=UΣVTX = U\Sigma V^T, soft-threshold the singular values, reassemble:

proxαg(X)=USαλ(Σ)VT\operatorname{prox}_{\alpha g}(X) = U \mathcal{S}_{\alpha\lambda}(\Sigma) V^T

Simplex projection g=δΔdg = \delta_{\Delta_d} where Δd={x0:ixi=1}\Delta_d = \{x \geq 0 : \sum_i x_i = 1\}: solvable in O(dlogd)O(d \log d) by sorting.

Moreau decomposition. There is a general duality relationship:

proxαg(x)+αproxg/α(x/α)=x\operatorname{prox}_{\alpha g}(x) + \alpha \cdot \operatorname{prox}_{g^*/\alpha}(x/\alpha) = x

where gg^* is the convex conjugate of gg. This means: if you know proxαg\operatorname{prox}_{\alpha g}, you get proxg\operatorname{prox}_{g^*} for free. For example, since we know the proximal operator of 1\|\cdot\|_1, Moreau gives us the proximal operator of \|\cdot\|_\infty (its dual norm) without any additional work.


The Unified Picture

Starting from Post 1's geometric insight that the 1\ell_1 ball's corners produce sparsity, we now have the complete algorithmic story:

The sparsity-inducing geometry of 1\|\cdot\|_1 translates into the soft-thresholding proximal operator, which ISTA applies at each step to handle the non-smooth regularizer exactly. FISTA adds Nesterov momentum to recover the optimal O(1/k2)O(1/k^2) rate — identical to AGD on smooth functions, now extended to the composite setting.

The next post steps back to ask: what if we cannot use proximal operators at all? When the constraint set is complex (like the set of positive semidefinite matrices with bounded trace), both projections and proximal steps become expensive. The Frank-Wolfe algorithm offers an alternative that replaces these hard operations with a much simpler one.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.