Subgradients and the Lasso: When Functions Have Kinks
March 2025optimizationsubgradientslassonon-smoothconvexity
Consider f(x)=∣x∣ at x=0. The left derivative is −1; the right derivative is +1. The gradient does not exist. Yet f is convex, has a unique minimum at x=0, and is perfectly well-behaved in every other sense. Gradient descent cannot proceed — but clearly optimization should be possible.
The subgradient resolves this by generalizing the gradient from a single vector to a set. This post develops the subdifferential calculus, uses it to derive the KKT conditions for Lasso, and proves that the O(1/k) convergence rate of subgradient descent is not a deficiency of the algorithm but a fundamental lower bound.
This post pairs with Post 6: there we handled non-smooth regularizers by applying their proximal operators exactly, bypassing subgradients. Here we ask what happens without that structure — and see the full cost of non-smoothness.
1. Why the Gradient Fails at Kinks
The gradient at x is the unique vector g satisfying:
limy→x∥y−x∥f(y)−f(x)−gT(y−x)=0
At x=0 for f(x)=∣x∣: the limit from the right gives g=1, from the left gives g=−1. No single g satisfies both simultaneously. The gradient does not exist.
However, the directional derivative always exists for convex functions. Define:
f′(x;d)=limt→0+tf(x+td)−f(x)
For f(x)=∣x∣ at x=0: f′(0;d)=∣d∣ for all d. The function has well-defined directional behavior — it just isn't captured by a single vector.
The directional derivative is positively homogeneous (f′(x;td)=tf′(x;d) for t>0) and subadditive (f′(x;d1+d2)≤f′(x;d1)+f′(x;d2)). By the Hahn-Banach theorem, there exists a linear functional (a vector g) that minorizes the directional derivative: gTd≤f′(x;d) for all d. These vectors are the subgradients.
2. The Subdifferential
Definition. A vector g∈Rd is a subgradient of f at x if:
f(y)≥f(x)+gT(y−x)for all y∈Rd
The subdifferential∂f(x) is the set of all subgradients at x.
This is identical to the first-order convexity condition from Post 1 — except now we allow a set of valid linearizations instead of requiring a unique one. When f is differentiable at x, then ∂f(x)={∇f(x)} — the subdifferential is a singleton and coincides with the gradient.
Computing ∂∣x∣
For x>0: f is differentiable, ∂f(x)={1}.
For x<0: f is differentiable, ∂f(x)={−1}.
For x=0: We need all g such that ∣y∣≥gy for all y∈R. This requires g≤1 (from y>0) and g≥−1 (from y<0). So ∂f(0)=[−1,1].
Visually: at a smooth point, the subdifferential is the single tangent line slope. At the kink at x=0, it is the entire interval of slopes of lines that pass through (0,0) and lie below the graph of ∣x∣.
Since ∥x∥1=∑i∣xi∣, the subdifferential is the Cartesian product of coordinate subdifferentials:
∂∥x∥1=∂∣x1∣×⋯×∂∣xd∣
So g∈∂∥x∥1 iff for each coordinate i:
gi={sign(xi)gi∈[−1,1]if xi=0if xi=0
At a sparse vector (with many xi=0), the subdifferential is a high-dimensional hypercube in the zero-coordinate directions — a whole family of valid linearizations.
Subdifferential Calculus
The standard calculus rules extend naturally:
Sum rule:∂(f+h)(x)=∂f(x)+∂h(x) (Minkowski sum), when f or h is continuous.
Chain rule: If F(x)=f(Ax+b), then ∂F(x)=AT∂f(Ax+b).
Optimality:x∗ minimizes f iff 0∈∂f(x∗).
This last condition is the master optimality criterion for non-smooth convex optimization.
3. Lasso: KKT Conditions via Subdifferentials
The Lasso problem:
minx∈RdF(x)=21∥Ax−b∥2+λ∥x∥1
Optimality: 0∈∂F(x∗).
By the sum rule: ∂F(x∗)=∇(21∥Ax∗−b∥2)+λ∂∥x∗∥1=AT(Ax∗−b)+λ∂∥x∗∥1.
So x∗ is optimal iff:
0∈AT(Ax∗−b)+λ∂∥x∗∥1
Writing this coordinate-by-coordinate (the KKT conditions for Lasso):
[AT(Ax∗−b)]i={−λsign(xi∗)something in [−λ,λ]if xi∗=0if xi∗=0
The second condition tells us exactly when coordinate i is zero at the optimum: the i-th residual correlation ∣[AT(Ax∗−b)]i∣≤λ. When the residual has small correlation with feature i, that feature is excluded from the model.
Connection to soft thresholding. The proximal step from Post 6, xk+1=proxαg(xk−αAT(Axk−b)), is precisely solving the optimality condition above for coordinate updates. The soft-thresholding operation zeroes out coordinates where ∣[AT(Axk−b)]i∣≤αλ — exactly the KKT condition. ISTA is the algorithm that enforces the subdifferential optimality conditions iteratively.
This completes the circle from Post 1: the ℓ1 ball's geometry (corners on axes) → the subdifferential of ∥⋅∥1 is large at zero → the KKT conditions force small-correlation features to exactly zero → sparse solutions.
4. Subgradient Descent
Given gk∈∂f(xk), the subgradient descent update is:
xk+1=xk−αkgk
This looks identical to gradient descent but there is a critical difference.
Subgradient descent is not a descent method. A gradient step on a smooth function always decreases f (for small enough α). A subgradient step can increasef. The subgradient at a non-smooth point points "away from" the minimizer in a generalized sense, but the resulting step can move uphill in function value.
Because of this, we track the best iterate:
fbestk=min0≤j≤kf(xj)
Convergence Analysis
Theorem. For convex f with ∥g∥≤G for all g∈∂f(x), subgradient descent satisfies:
With the optimal fixed-budget step α=GkR (requires knowing k in advance): exactly kRG.
5. The Lower Bound: O(1/k) is Optimal
The O(1/k) rate is not a shortcoming of subgradient descent — it is provably the best any first-order method can achieve on non-smooth problems.
Theorem (Lower Bound). For any first-order method and any k≥1, there exists a G-Lipschitz convex function f such that:
f(xk)−f∗≥2k+1G∥x0−x∗∥
Proof sketch. The hard instance is f(x)=G⋅max1≤i≤k+1xi (or equivalently G∥x∥∞ shifted). Any first-order method, after k steps starting from x0=0, lies in the span of {∇f(x0),…,∇f(xk−1)}. By the structure of f, this span cannot capture enough coordinates to get within 2k+1G∥x0−x∗∥ of the minimum. □
Subgradient descent matches this lower bound. It is optimal for the class of G-Lipschitz convex functions.
The Full Convergence Hierarchy
We now have the complete map of first-order convergence rates:
Smoothness is worth exactly a factor of k3/2 in convergence rate: AGD gives O(1/k2) vs O(1/k) for non-smooth. When a function has composite structure (smooth + non-smooth), exploiting that structure via FISTA recovers the smooth rate entirely.
The message is architectural: structure is speed. Gradient descent, subgradient descent, ISTA, FISTA, Frank-Wolfe — these are not just different algorithms for the same problem. They are the right tools for different problem structures, each achieving a rate that matches the information-theoretic lower bound for its class.
The next post steps into deep learning theory, where these optimization foundations meet a genuinely mysterious phenomenon: models with more parameters than data that somehow generalize better as they get bigger.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.