← All Posts

Mathematics of Data

Subgradients and the Lasso: When Functions Have Kinks

March 2025optimizationsubgradientslassonon-smoothconvexity

Consider f(x)=xf(x) = |x| at x=0x = 0. The left derivative is 1-1; the right derivative is +1+1. The gradient does not exist. Yet ff is convex, has a unique minimum at x=0x = 0, and is perfectly well-behaved in every other sense. Gradient descent cannot proceed — but clearly optimization should be possible.

The subgradient resolves this by generalizing the gradient from a single vector to a set. This post develops the subdifferential calculus, uses it to derive the KKT conditions for Lasso, and proves that the O(1/k)O(1/\sqrt{k}) convergence rate of subgradient descent is not a deficiency of the algorithm but a fundamental lower bound.

This post pairs with Post 6: there we handled non-smooth regularizers by applying their proximal operators exactly, bypassing subgradients. Here we ask what happens without that structure — and see the full cost of non-smoothness.


1. Why the Gradient Fails at Kinks

The gradient at xx is the unique vector gg satisfying:

limyxf(y)f(x)gT(yx)yx=0\lim_{y \to x} \frac{f(y) - f(x) - g^T(y-x)}{\|y-x\|} = 0

At x=0x = 0 for f(x)=xf(x) = |x|: the limit from the right gives g=1g = 1, from the left gives g=1g = -1. No single gg satisfies both simultaneously. The gradient does not exist.

However, the directional derivative always exists for convex functions. Define:

f(x;d)=limt0+f(x+td)f(x)tf'(x; d) = \lim_{t \to 0^+} \frac{f(x + td) - f(x)}{t}

For f(x)=xf(x) = |x| at x=0x=0: f(0;d)=df'(0; d) = |d| for all dd. The function has well-defined directional behavior — it just isn't captured by a single vector.

The directional derivative is positively homogeneous (f(x;td)=tf(x;d)f'(x; td) = t f'(x;d) for t>0t > 0) and subadditive (f(x;d1+d2)f(x;d1)+f(x;d2)f'(x; d_1 + d_2) \leq f'(x; d_1) + f'(x; d_2)). By the Hahn-Banach theorem, there exists a linear functional (a vector gg) that minorizes the directional derivative: gTdf(x;d)g^T d \leq f'(x; d) for all dd. These vectors are the subgradients.


2. The Subdifferential

Definition. A vector gRdg \in \mathbb{R}^d is a subgradient of ff at xx if:

f(y)f(x)+gT(yx)for all yRdf(y) \geq f(x) + g^T(y - x) \qquad \text{for all } y \in \mathbb{R}^d

The subdifferential f(x)\partial f(x) is the set of all subgradients at xx.

This is identical to the first-order convexity condition from Post 1 — except now we allow a set of valid linearizations instead of requiring a unique one. When ff is differentiable at xx, then f(x)={f(x)}\partial f(x) = \{\nabla f(x)\} — the subdifferential is a singleton and coincides with the gradient.

Computing x\partial|x|

For x>0x > 0: ff is differentiable, f(x)={1}\partial f(x) = \{1\}.

For x<0x < 0: ff is differentiable, f(x)={1}\partial f(x) = \{-1\}.

For x=0x = 0: We need all gg such that ygy|y| \geq gy for all yRy \in \mathbb{R}. This requires g1g \leq 1 (from y>0y > 0) and g1g \geq -1 (from y<0y < 0). So f(0)=[1,1]\partial f(0) = [-1, 1].

Visually: at a smooth point, the subdifferential is the single tangent line slope. At the kink at x=0x=0, it is the entire interval of slopes of lines that pass through (0,0)(0,0) and lie below the graph of x|x|.

        |x|
         |    /
         |   /
         |  /  ← slopes in ∂f(0) = [-1,1]
         | /
---------0---------
        /|
       / |

Computing x1\partial\|x\|_1

Since x1=ixi\|x\|_1 = \sum_i |x_i|, the subdifferential is the Cartesian product of coordinate subdifferentials:

x1=x1××xd\partial\|x\|_1 = \partial|x_1| \times \cdots \times \partial|x_d|

So gx1g \in \partial\|x\|_1 iff for each coordinate ii:

gi={sign(xi)if xi0gi[1,1]if xi=0g_i = \begin{cases} \operatorname{sign}(x_i) & \text{if } x_i \neq 0 \\ g_i \in [-1,1] & \text{if } x_i = 0 \end{cases}

At a sparse vector (with many xi=0x_i = 0), the subdifferential is a high-dimensional hypercube in the zero-coordinate directions — a whole family of valid linearizations.

Subdifferential Calculus

The standard calculus rules extend naturally:

Sum rule: (f+h)(x)=f(x)+h(x)\partial(f + h)(x) = \partial f(x) + \partial h(x) (Minkowski sum), when ff or hh is continuous.

Chain rule: If F(x)=f(Ax+b)F(x) = f(Ax + b), then F(x)=ATf(Ax+b)\partial F(x) = A^T \partial f(Ax+b).

Optimality: xx^* minimizes ff iff 0f(x)0 \in \partial f(x^*).

This last condition is the master optimality criterion for non-smooth convex optimization.


3. Lasso: KKT Conditions via Subdifferentials

The Lasso problem:

minxRdF(x)=12Axb2+λx1\min_{x \in \mathbb{R}^d} F(x) = \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1

Optimality: 0F(x)0 \in \partial F(x^*).

By the sum rule: F(x)= ⁣(12Axb2)+λx1=AT(Axb)+λx1\partial F(x^*) = \nabla\!\left(\frac{1}{2}\|Ax^*-b\|^2\right) + \lambda\partial\|x^*\|_1 = A^T(Ax^*-b) + \lambda\partial\|x^*\|_1.

So xx^* is optimal iff:

0AT(Axb)+λx10 \in A^T(Ax^* - b) + \lambda\partial\|x^*\|_1

Writing this coordinate-by-coordinate (the KKT conditions for Lasso):

[AT(Axb)]i={λsign(xi)if xi0something in [λ,λ]if xi=0[A^T(Ax^* - b)]_i = \begin{cases} -\lambda\operatorname{sign}(x^*_i) & \text{if } x^*_i \neq 0 \\ \text{something in } [-\lambda, \lambda] & \text{if } x^*_i = 0 \end{cases}

The second condition tells us exactly when coordinate ii is zero at the optimum: the ii-th residual correlation [AT(Axb)]iλ|[A^T(Ax^*-b)]_i| \leq \lambda. When the residual has small correlation with feature ii, that feature is excluded from the model.

Connection to soft thresholding. The proximal step from Post 6, xk+1=proxαg(xkαAT(Axkb))x^{k+1} = \operatorname{prox}_{\alpha g}(x^k - \alpha A^T(Ax^k-b)), is precisely solving the optimality condition above for coordinate updates. The soft-thresholding operation zeroes out coordinates where [AT(Axkb)]iαλ|[A^T(Ax^k - b)]_i| \leq \alpha\lambda — exactly the KKT condition. ISTA is the algorithm that enforces the subdifferential optimality conditions iteratively.

This completes the circle from Post 1: the 1\ell_1 ball's geometry (corners on axes) → the subdifferential of 1\|\cdot\|_1 is large at zero → the KKT conditions force small-correlation features to exactly zero → sparse solutions.


4. Subgradient Descent

Given gkf(xk)g^k \in \partial f(x^k), the subgradient descent update is:

xk+1=xkαkgkx^{k+1} = x^k - \alpha_k g^k

This looks identical to gradient descent but there is a critical difference.

Subgradient descent is not a descent method. A gradient step on a smooth function always decreases ff (for small enough α\alpha). A subgradient step can increase ff. The subgradient at a non-smooth point points "away from" the minimizer in a generalized sense, but the resulting step can move uphill in function value.

Because of this, we track the best iterate:

fbestk=min0jkf(xj)f_{\text{best}}^k = \min_{0 \leq j \leq k} f(x^j)

Convergence Analysis

Theorem. For convex ff with gG\|g\| \leq G for all gf(x)g \in \partial f(x), subgradient descent satisfies:

fbestkfx0x2+G2j=0k1αj22j=0k1αjf_{\text{best}}^k - f^* \leq \frac{\|x^0 - x^*\|^2 + G^2 \sum_{j=0}^{k-1} \alpha_j^2}{2\sum_{j=0}^{k-1} \alpha_j}

Proof. Track xk+1x2\|x^{k+1} - x^*\|^2:

xk+1x2=xkαkgkx2=xkx22αk(gk)T(xkx)+αk2gk2\|x^{k+1} - x^*\|^2 = \|x^k - \alpha_k g^k - x^*\|^2 = \|x^k - x^*\|^2 - 2\alpha_k (g^k)^T(x^k - x^*) + \alpha_k^2\|g^k\|^2

From the subgradient definition: (gk)T(xkx)f(xk)f(x)=f(xk)f(g^k)^T(x^k - x^*) \geq f(x^k) - f(x^*) = f(x^k) - f^*.

So: xk+1x2xkx22αk(f(xk)f)+αk2G2\|x^{k+1}-x^*\|^2 \leq \|x^k-x^*\|^2 - 2\alpha_k(f(x^k)-f^*) + \alpha_k^2 G^2.

Rearranging: 2αk(f(xk)f)xkx2xk+1x2+αk2G22\alpha_k(f(x^k)-f^*) \leq \|x^k-x^*\|^2 - \|x^{k+1}-x^*\|^2 + \alpha_k^2 G^2.

Sum from j=0j=0 to k1k-1 and telescope:

2j=0k1αj(f(xj)f)x0x2+G2j=0k1αj22\sum_{j=0}^{k-1}\alpha_j(f(x^j)-f^*) \leq \|x^0-x^*\|^2 + G^2\sum_{j=0}^{k-1}\alpha_j^2

Since fbestkf(xj)f_{\text{best}}^k \leq f(x^j) for all jj:

2(j=0k1αj)(fbestkf)x0x2+G2j=0k1αj22\left(\sum_{j=0}^{k-1}\alpha_j\right)(f_{\text{best}}^k - f^*) \leq \|x^0-x^*\|^2 + G^2\sum_{j=0}^{k-1}\alpha_j^2

Dividing gives the result. \square

With αk=R/(Gk)\alpha_k = R/(G\sqrt{k}) where R=x0xR = \|x^0-x^*\|:

j=1kαj2RGk,j=1kαj2R2G2logk\sum_{j=1}^k \alpha_j \approx \frac{2R}{G}\sqrt{k}, \qquad \sum_{j=1}^k \alpha_j^2 \approx \frac{R^2}{G^2}\log k

So: fbestkfR2+R2logk22RGkRG2k=O ⁣(1k)f_{\text{best}}^k - f^* \leq \frac{R^2 + R^2 \log k}{2 \cdot \frac{2R}{G}\sqrt{k}} \approx \frac{RG}{2\sqrt{k}} = O\!\left(\frac{1}{\sqrt{k}}\right)

With the optimal fixed-budget step α=RGk\alpha = \frac{R}{G\sqrt{k}} (requires knowing kk in advance): exactly RGk\frac{RG}{\sqrt{k}}.


5. The Lower Bound: O(1/k)O(1/\sqrt{k}) is Optimal

The O(1/k)O(1/\sqrt{k}) rate is not a shortcoming of subgradient descent — it is provably the best any first-order method can achieve on non-smooth problems.

Theorem (Lower Bound). For any first-order method and any k1k \geq 1, there exists a GG-Lipschitz convex function ff such that:

f(xk)fGx0x2k+1f(x^k) - f^* \geq \frac{G\|x^0 - x^*\|}{2\sqrt{k+1}}

Proof sketch. The hard instance is f(x)=Gmax1ik+1xif(x) = G \cdot \max_{1 \leq i \leq k+1} x_i (or equivalently GxG\|x\|_\infty shifted). Any first-order method, after kk steps starting from x0=0x^0 = 0, lies in the span of {f(x0),,f(xk1)}\{\nabla f(x^0), \ldots, \nabla f(x^{k-1})\}. By the structure of ff, this span cannot capture enough coordinates to get within Gx0x2k+1\frac{G\|x^0-x^*\|}{2\sqrt{k+1}} of the minimum. \square

Subgradient descent matches this lower bound. It is optimal for the class of GG-Lipschitz convex functions.


The Full Convergence Hierarchy

We now have the complete map of first-order convergence rates:

| Class | Best Algorithm | Rate | Lower Bound | |---|---|---|---| | LL-smooth, convex | AGD | O(1/k2)O(1/k^2) | Ω(1/k2)\Omega(1/k^2) ✓ | | LL-smooth, μ\mu-SC | AGD | O((11/κ)k)O((1-1/\sqrt{\kappa})^k) | Optimal ✓ | | LL-smooth + non-smooth gg | FISTA | O(1/k2)O(1/k^2) | Ω(1/k2)\Omega(1/k^2) ✓ | | GG-Lipschitz, convex | Subgradient | O(1/k)O(1/\sqrt{k}) | Ω(1/k)\Omega(1/\sqrt{k}) ✓ |

Smoothness is worth exactly a factor of k3/2k^{3/2} in convergence rate: AGD gives O(1/k2)O(1/k^2) vs O(1/k)O(1/\sqrt{k}) for non-smooth. When a function has composite structure (smooth ++ non-smooth), exploiting that structure via FISTA recovers the smooth rate entirely.

The message is architectural: structure is speed. Gradient descent, subgradient descent, ISTA, FISTA, Frank-Wolfe — these are not just different algorithms for the same problem. They are the right tools for different problem structures, each achieving a rate that matches the information-theoretic lower bound for its class.

The next post steps into deep learning theory, where these optimization foundations meet a genuinely mysterious phenomenon: models with more parameters than data that somehow generalize better as they get bigger.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.