← All Posts

Mathematics of Data

The Geometry of Optimization: Norms, Convexity, and Why Shape Matters

March 2025optimizationconvexitylinear-algebrafoundations

Every optimization algorithm lives inside a geometry. Before the first gradient is computed, before the first step is taken, the shape of the space and the shape of the function determine what is even possible. This post is about that shape.

We will cover four ideas in sequence: the geometry of norms (and why the 1\ell_1 ball's pointy corners matter enormously), matrix norms and their analogies, the precise meaning of convexity and its characterizations, and the two structural properties — smoothness and strong convexity — that will govern every convergence rate in this series.


1. Vector Norms and Their Geometry

Definition

A norm \|\cdot\| on Rd\mathbb{R}^d is a function satisfying three axioms:

  1. Positive definiteness: x0\|x\| \geq 0, with equality iff x=0x = 0
  2. Absolute homogeneity: αx=αx\|\alpha x\| = |\alpha| \|x\| for all αR\alpha \in \mathbb{R}
  3. Triangle inequality: x+yx+y\|x + y\| \leq \|x\| + \|y\|

The family of p\ell_p-norms provides a continuum of geometries:

xp=(i=1dxip)1/p,p[1,)\|x\|_p = \left(\sum_{i=1}^d |x_i|^p\right)^{1/p}, \quad p \in [1, \infty)

with the limiting case x=maxixi\|x\|_\infty = \max_i |x_i|.

The three most important cases:

  • 1\ell_1 (Manhattan): x1=ixi\|x\|_1 = \sum_i |x_i|
  • 2\ell_2 (Euclidean): x2=ixi2\|x\|_2 = \sqrt{\sum_i x_i^2}
  • \ell_\infty (Chebyshev): x=maxixi\|x\|_\infty = \max_i |x_i|

The Unit Ball Reveals the Geometry

The unit ball Bp={xR2:xp1}\mathcal{B}_p = \{x \in \mathbb{R}^2 : \|x\|_p \leq 1\} is the most direct way to see what a norm looks like. In two dimensions:

         ℓ₁ ball             ℓ₂ ball             ℓ∞ ball
           (0,1)               (0,1)            (-1,1)──(1,1)
            /\                  ╭─╮                |      |
           /  \               ╭   ╮               |      |
    (-1,0)╳    ╳(1,0)       (─     ─)          (-1,-1)─(1,-1)
           \  /               ╰   ╯
            \/                  ╰─╯
           (0,-1)              (0,-1)
         Diamond             Circle             Square

These shapes are not just visual curiosities. They encode fundamentally different notions of "small."

Why the 1\ell_1 Ball Produces Sparse Solutions

This is one of the most important geometric insights in all of machine learning, and it is elementary once you see it.

Consider the constrained optimization problem:

minxR2f(x)subject tox1τ\min_{x \in \mathbb{R}^2} f(x) \quad \text{subject to} \quad \|x\|_1 \leq \tau

Imagine the level curves of ff — ellipses if ff is a smooth convex function — shrinking inward toward the unconstrained minimum. The constrained minimum is the first point where a level curve touches the 1\ell_1 ball.

Now look at the 1\ell_1 ball. It has four corners: (±τ,0)({\pm}\tau, 0) and (0,±τ)(0, {\pm}\tau). These corners sit on the coordinate axes. A generic level curve will almost certainly hit a corner before it hits an edge. At a corner, one or more coordinates are exactly zero.

For the 2\ell_2 ball (a circle), there are no corners — the boundary is smooth everywhere, and the first contact point is generically interior to all coordinate hyperplanes. No sparsity.

This is the entire geometric explanation for why 1\ell_1 regularization (Lasso) induces sparsity. No probabilistic argument, no Bayesian prior needed — it is pure geometry. In higher dimensions, the 1\ell_1 ball is a cross-polytope whose 2d2d vertices all lie on coordinate axes, amplifying this effect.

Dual Norms

Given a norm \|\cdot\|, its dual norm is:

y=supx1x,y\|y\|_* = \sup_{\|x\| \leq 1} \langle x, y \rangle

The dual norm measures how well yy can extract value from the unit ball of \|\cdot\| — the maximum inner product achievable by a unit-norm vector.

Claim: The dual of p\ell_p is q\ell_q where 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1. In particular:

  • The dual of 1\ell_1 is \ell_\infty
  • The dual of \ell_\infty is 1\ell_1
  • The dual of 2\ell_2 is 2\ell_2 (self-dual)

This follows from Hölder's inequality, which we now prove.

Theorem (Hölder's Inequality): For 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1 with p,q1p, q \geq 1:

x,yxpyq|\langle x, y \rangle| \leq \|x\|_p \|y\|_q

Proof. If either side is zero, the result is immediate. Otherwise, by homogeneity we may assume xp=yq=1\|x\|_p = \|y\|_q = 1. We need to show ixiyi1\sum_i |x_i y_i| \leq 1.

Apply Young's inequality: for a,b0a, b \geq 0 and 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1, abapp+bqqab \leq \frac{a^p}{p} + \frac{b^q}{q}

This follows from the concavity of log\log: log ⁣(app+bqq)1plog(ap)+1qlog(bq)=log(ab)\log\!\left(\frac{a^p}{p} + \frac{b^q}{q}\right) \geq \frac{1}{p}\log(a^p) + \frac{1}{q}\log(b^q) = \log(ab), so app+bqqab\frac{a^p}{p} + \frac{b^q}{q} \geq ab.

Applying Young's to each term:

ixiyii(xipp+yiqq)=xppp+yqqq=1p+1q=1\sum_i |x_i||y_i| \leq \sum_i \left(\frac{|x_i|^p}{p} + \frac{|y_i|^q}{q}\right) = \frac{\|x\|_p^p}{p} + \frac{\|y\|_q^q}{q} = \frac{1}{p} + \frac{1}{q} = 1 \qquad \square

Deriving the dual of 1\ell_1: We want to compute supx11x,y\sup_{\|x\|_1 \leq 1} \langle x, y \rangle.

x,y=ixiyiixiyimaxjyjixiyx1y\langle x, y \rangle = \sum_i x_i y_i \leq \sum_i |x_i| |y_i| \leq \max_j |y_j| \cdot \sum_i |x_i| \leq \|y\|_\infty \cdot \|x\|_1 \leq \|y\|_\infty

The bound is achieved by setting x=ejx = e_j where j=argmaxiyij = \arg\max_i |y_i| (and adjusting sign). So y1,=y\|y\|_{1,*} = \|y\|_\infty. \square

Dual norms appear constantly in analysis — whenever you bound an inner product g,d\langle g, d \rangle (e.g., gradient times step direction), Hölder gives you g,dgd|\langle g, d \rangle| \leq \|g\|_* \|d\|. The primal and dual norms are always the right pairing for this.


2. Matrix Norms

Vectors are matrices, so norms extend. Three matrix norms will appear throughout this series.

Frobenius Norm

AF=i,jaij2=tr(ATA)\|A\|_F = \sqrt{\sum_{i,j} a_{ij}^2} = \sqrt{\operatorname{tr}(A^T A)}

This is simply the 2\ell_2 norm of the vectorized matrix. It treats the matrix as a flat bag of numbers. Differentiating through AF2\|A\|_F^2 is easy: AAF2=2A\nabla_A \|A\|_F^2 = 2A.

Spectral Norm

A2=σmax(A)=λmax(ATA)\|A\|_2 = \sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^T A)}

where σmax\sigma_{\max} is the largest singular value of AA. This is the operator norm for 2\ell_2 vectors: A2=supx2=1Ax2\|A\|_2 = \sup_{\|x\|_2 = 1} \|Ax\|_2. It measures the maximum "stretch" the matrix applies.

The spectral norm is the dual of the nuclear norm (defined below) — a fact that parallels 1\ell_1/\ell_\infty duality.

Nuclear Norm

A=iσi(A)\|A\|_* = \sum_i \sigma_i(A)

the sum of all singular values. This is the matrix analogue of the 1\ell_1 norm — and the analogy is exact.

Just as the 1\ell_1 norm promotes sparsity in vectors, the nuclear norm promotes low rank in matrices. The singular values of AA are the "coordinates" in the spectral domain. Minimizing their 1\ell_1 sum drives small singular values to zero, reducing the effective rank.

The parallel:

| Vector world | Matrix world | |---|---| | 1\ell_1 norm: ixi\sum_i \|x_i\| | Nuclear norm: iσi(A)\sum_i \sigma_i(A) | | Promotes sparse xx | Promotes low-rank AA | | Geometric: corners on coordinate axes | Geometric: corners in spectral domain | | Used in: Lasso, compressed sensing | Used in: matrix completion, recommendation |

The proximal operator of the nuclear norm is singular value soft-thresholding — the matrix analogue of the scalar soft-threshold from Lasso. We will encounter this in Post 6.


3. Convex Sets and Functions

Convex Sets

A set CRdC \subseteq \mathbb{R}^d is convex if for all x,yCx, y \in C and λ[0,1]\lambda \in [0, 1]:

λx+(1λ)yC\lambda x + (1-\lambda)y \in C

In words: the line segment between any two points in CC stays inside CC. Balls are convex. Polytopes are convex. Non-convex: donuts, stars, any set with a "dent."

The p\ell_p unit balls Bp\mathcal{B}_p for p1p \geq 1 are all convex. (For p<1p < 1, the "norm" is no longer a norm — it violates the triangle inequality — and the unit ball is non-convex with an inward-bowing boundary.)

Convex Functions

A function f:CRf: C \to \mathbb{R} is convex if for all x,yCx, y \in C and λ[0,1]\lambda \in [0,1]:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)

The function value at any convex combination of two points lies below or on the chord connecting them. Visually: the function lies below all its secant lines.

Convexity is the "holy grail" of optimization for one overwhelming reason: every local minimum is a global minimum.

Proof: Suppose xx^* is a local minimum but not global. Then there exists zz with f(z)<f(x)f(z) < f(x^*). By convexity, f(λz+(1λ)x)λf(z)+(1λ)f(x)<f(x)f(\lambda z + (1-\lambda)x^*) \leq \lambda f(z) + (1-\lambda)f(x^*) < f(x^*) for all λ(0,1]\lambda \in (0,1]. But points λz+(1λ)x\lambda z + (1-\lambda)x^* can be made arbitrarily close to xx^* (take λ0\lambda \to 0), contradicting xx^* being a local minimum. \square

First-Order Characterization

Theorem: A differentiable function ff is convex iff for all x,yx, y:

f(y)f(x)+f(x)T(yx)f(y) \geq f(x) + \nabla f(x)^T(y - x)

The tangent plane (or hyperplane in higher dimensions) is a global lower bound everywhere. This is a remarkable property — local linear information (the gradient at xx) tells you something true everywhere.

Proof (\Rightarrow): Fix x,yx, y. By convexity, for any t(0,1]t \in (0,1]:

f(x+t(yx))(1t)f(x)+tf(y)f(x + t(y-x)) \leq (1-t)f(x) + t f(y)

Rearranging: f(y)f(x)+f(x+t(yx))f(x)tf(y) \geq f(x) + \frac{f(x + t(y-x)) - f(x)}{t}.

Taking t0+t \to 0^+, the right side converges to f(x)+f(x)T(yx)f(x) + \nabla f(x)^T(y-x). \square

Proof (\Leftarrow): Suppose the tangent lower bound holds. For z=λx+(1λ)yz = \lambda x + (1-\lambda)y:

f(x)f(z)+f(z)T(xz)f(x) \geq f(z) + \nabla f(z)^T(x - z) f(y)f(z)+f(z)T(yz)f(y) \geq f(z) + \nabla f(z)^T(y - z)

Multiply the first by λ\lambda, the second by (1λ)(1-\lambda), add:

λf(x)+(1λ)f(y)f(z)+f(z)T(λ(xz)+(1λ)(yz))\lambda f(x) + (1-\lambda)f(y) \geq f(z) + \nabla f(z)^T(\lambda(x-z) + (1-\lambda)(y-z))

Since z=λx+(1λ)yz = \lambda x + (1-\lambda)y, we have λ(xz)+(1λ)(yz)=0\lambda(x-z) + (1-\lambda)(y-z) = 0. So:

λf(x)+(1λ)f(y)f(z)=f(λx+(1λ)y)\lambda f(x) + (1-\lambda)f(y) \geq f(z) = f(\lambda x + (1-\lambda)y) \qquad \square

The first-order condition has an immediate and crucial consequence: if f(x)=0\nabla f(x^*) = 0, then for all yy: f(y)f(x)+0=f(x)f(y) \geq f(x^*) + 0 = f(x^*). So any stationary point of a convex function is a global minimum. This is why checking f=0\nabla f = 0 is sufficient for convex problems.

Second-Order Characterization

Theorem: A twice-differentiable function ff is convex iff for all xx:

2f(x)0\nabla^2 f(x) \succeq 0

where 0\succeq 0 means the Hessian matrix is positive semi-definite (PSD).

A symmetric matrix HH is PSD iff any of these equivalent conditions hold:

  • All eigenvalues λi0\lambda_i \geq 0
  • vTHv0v^T H v \geq 0 for all vRdv \in \mathbb{R}^d
  • H=ATAH = A^T A for some matrix AA

Geometrically: vT2f(x)vv^T \nabla^2 f(x) v is the second directional derivative of ff in direction vv at xx. Requiring it to be non-negative for all vv means the function curves upward in every direction — no concave dip anywhere.

A few examples:

  • f(x)=x22=xTxf(x) = \|x\|_2^2 = x^T x: Hessian is 2I02I \succ 0 — strictly PSD, strictly convex.
  • f(x)=maxixif(x) = \max_i x_i: not differentiable everywhere, but convex. Second-order characterization applies where ff is C2C^2.
  • f(x)=x12x22f(x) = x_1^2 - x_2^2: Hessian is diag(2,2)\text{diag}(2, -2), indefinite. Not convex.

4. LL-Smoothness and μ\mu-Strong Convexity

Convexity alone is not enough to derive fast convergence rates. Two additional structural conditions — smoothness and strong convexity — will appear in every theorem of this series.

LL-Smoothness: Bounding the Gradient's Variation

A differentiable function ff is LL-smooth if its gradient is LL-Lipschitz:

f(x)f(y)2Lxy2x,y\|\nabla f(x) - \nabla f(y)\|_2 \leq L \|x - y\|_2 \quad \forall x, y

This says the gradient cannot change too rapidly. Equivalently, 2f(x)LI\nabla^2 f(x) \preceq LI everywhere (all eigenvalues of the Hessian are at most LL).

Lemma (Descent Lemma): If ff is LL-smooth, then for all x,yx, y:

f(y)f(x)+f(x)T(yx)+L2yx22f(y) \leq f(x) + \nabla f(x)^T(y - x) + \frac{L}{2}\|y - x\|_2^2

Proof. By the fundamental theorem of calculus and Cauchy-Schwarz:

f(y)f(x)f(x)T(yx)=01[f(x+t(yx))f(x)]T(yx)dtf(y) - f(x) - \nabla f(x)^T(y-x) = \int_0^1 [\nabla f(x + t(y-x)) - \nabla f(x)]^T (y-x)\, dt

01f(x+t(yx))f(x)2yx2dt\leq \int_0^1 \|\nabla f(x + t(y-x)) - \nabla f(x)\|_2 \|y-x\|_2\, dt

Applying LL-smoothness to f(x+t(yx))f(x)2Ltyx2\|\nabla f(x + t(y-x)) - \nabla f(x)\|_2 \leq L \cdot t\|y-x\|_2:

01Ltyx22dt=L2yx22\leq \int_0^1 L \cdot t \|y-x\|_2^2\, dt = \frac{L}{2}\|y-x\|_2^2 \qquad \square

Geometrically: a smooth function is sandwiched above by a quadratic with curvature LL. This quadratic upper bound is the workhorse of gradient descent analysis — we will use it in the next post to prove the descent lemma and derive optimal step sizes.

μ\mu-Strong Convexity: Bounding Curvature From Below

A function ff is μ\mu-strongly convex (μ>0\mu > 0) if for all x,yx, y:

f(y)f(x)+f(x)T(yx)+μ2yx22f(y) \geq f(x) + \nabla f(x)^T(y - x) + \frac{\mu}{2}\|y - x\|_2^2

This strengthens the first-order convexity condition: the tangent plane lower bound is replaced by a tighter quadratic lower bound. Equivalently, 2f(x)μI\nabla^2 f(x) \succeq \mu I everywhere — all eigenvalues of the Hessian are at least μ\mu.

If ff is μ\mu-strongly convex, then f(x)μ2x2f(x) - \frac{\mu}{2}\|x\|^2 is convex. Strong convexity means ff curves upward by at least μ\mu in every direction.

Strong convexity guarantees a unique global minimum. If f(x)=0\nabla f(x^*) = 0 and ff is μ\mu-SC, then for all yxy \neq x^*: f(y)f(x)+μ2yx2>f(x)f(y) \geq f(x^*) + \frac{\mu}{2}\|y - x^*\|^2 > f(x^*).

The Sandwich: When Both Hold

When ff is both LL-smooth and μ\mu-strongly convex (with 0<μL0 < \mu \leq L), the function is sandwiched between two paraboloids:

f(x)+f(x)T(yx)+μ2yx2    f(y)    f(x)+f(x)T(yx)+L2yx2f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2 \;\leq\; f(y) \;\leq\; f(x) + \nabla f(x)^T(y-x) + \frac{L}{2}\|y-x\|^2

For the gradients, this sandwich implies:

μxy2(f(x)f(y))T(xy)Lxy2\mu \|x - y\|^2 \leq (\nabla f(x) - \nabla f(y))^T(x - y) \leq L\|x-y\|^2

The left inequality is the co-coercivity condition. It says the gradient doesn't just change in the right direction — it changes by at least μ\mu per unit of displacement. This will be the key estimate in proving linear convergence of gradient descent.

Example: The quadratic f(x)=12xTAxbTxf(x) = \frac{1}{2}x^T A x - b^T x with μIALI\mu I \preceq A \preceq LI is μ\mu-SC and LL-smooth with μ=λmin(A)\mu = \lambda_{\min}(A), L=λmax(A)L = \lambda_{\max}(A).

The Condition Number

The ratio

κ=Lμ\kappa = \frac{L}{\mu}

is the condition number of the optimization problem. It is one of the most important quantities in numerical optimization.

When κ\kappa is small (close to 1), the function is nearly spherical — level sets are nearly round balls, and gradient descent finds the minimum quickly in any direction.

When κ\kappa is large, the level sets are elongated ellipsoids. The gradient points in a very different direction from the minimum. Gradient descent zig-zags: it overshoots in the steep direction while barely moving in the flat direction.

         κ ≈ 1               κ ≫ 1
        (nice bowl)         (narrow valley)

         ╭──╮                 ───────
        ╭    ╮              ─────────
       ╭      ╮            ─────────────
       ╰      ╯            ─────────────
        ╰    ╯              ─────────
         ╰──╯                 ───────

Concretely: gradient descent on a μ\mu-SC, LL-smooth function needs O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) iterations to reach ϵ\epsilon-accuracy. When κ=1000\kappa = 1000, that is 70007000 iterations to reach error 10310^{-3}, versus just 77 for a well-conditioned problem.

Every algorithm in this series — from Nesterov acceleration to SVRG — is, at its core, trying to reduce the dependence on κ\kappa.


Looking Forward

We have built the geometric vocabulary for the entire series:

  • Norms give us ways to measure distance and size; the 1\ell_1 norm's geometry explains sparsity
  • Dual norms and Hölder's inequality give us the right way to bound inner products
  • Matrix norms extend the story to linear operators; nuclear norm is 1\ell_1 for singular values
  • Convexity ensures local minima are global; the first-order condition makes gradients globally informative
  • LL-smoothness and μ\mu-strong convexity quantify how curved the function is, from above and below
  • The condition number κ=L/μ\kappa = L/\mu is the single number that will appear in every convergence theorem

The next post puts these tools to work: we derive the gradient descent algorithm from the descent lemma and prove, carefully, that it converges at rate O(1/k)O(1/k) for smooth convex functions and at a linear rate for strongly convex ones — and explain exactly why the condition number is the bottleneck.


This post is part of a series of mathematical notes on EE-556: Mathematics of Data (EPFL). The series covers optimization theory from first principles through modern deep learning.