Mathematics of Data
Every optimization algorithm lives inside a geometry. Before the first gradient is computed, before the first step is taken, the shape of the space and the shape of the function determine what is even possible. This post is about that shape.
We will cover four ideas in sequence: the geometry of norms (and why the ball's pointy corners matter enormously), matrix norms and their analogies, the precise meaning of convexity and its characterizations, and the two structural properties — smoothness and strong convexity — that will govern every convergence rate in this series.
A norm on is a function satisfying three axioms:
The family of -norms provides a continuum of geometries:
with the limiting case .
The three most important cases:
The unit ball is the most direct way to see what a norm looks like. In two dimensions:
ℓ₁ ball ℓ₂ ball ℓ∞ ball
(0,1) (0,1) (-1,1)──(1,1)
/\ ╭─╮ | |
/ \ ╭ ╮ | |
(-1,0)╳ ╳(1,0) (─ ─) (-1,-1)─(1,-1)
\ / ╰ ╯
\/ ╰─╯
(0,-1) (0,-1)
Diamond Circle Square
These shapes are not just visual curiosities. They encode fundamentally different notions of "small."
This is one of the most important geometric insights in all of machine learning, and it is elementary once you see it.
Consider the constrained optimization problem:
Imagine the level curves of — ellipses if is a smooth convex function — shrinking inward toward the unconstrained minimum. The constrained minimum is the first point where a level curve touches the ball.
Now look at the ball. It has four corners: and . These corners sit on the coordinate axes. A generic level curve will almost certainly hit a corner before it hits an edge. At a corner, one or more coordinates are exactly zero.
For the ball (a circle), there are no corners — the boundary is smooth everywhere, and the first contact point is generically interior to all coordinate hyperplanes. No sparsity.
This is the entire geometric explanation for why regularization (Lasso) induces sparsity. No probabilistic argument, no Bayesian prior needed — it is pure geometry. In higher dimensions, the ball is a cross-polytope whose vertices all lie on coordinate axes, amplifying this effect.
Given a norm , its dual norm is:
The dual norm measures how well can extract value from the unit ball of — the maximum inner product achievable by a unit-norm vector.
Claim: The dual of is where . In particular:
This follows from Hölder's inequality, which we now prove.
Theorem (Hölder's Inequality): For with :
Proof. If either side is zero, the result is immediate. Otherwise, by homogeneity we may assume . We need to show .
Apply Young's inequality: for and ,
This follows from the concavity of : , so .
Applying Young's to each term:
Deriving the dual of : We want to compute .
The bound is achieved by setting where (and adjusting sign). So .
Dual norms appear constantly in analysis — whenever you bound an inner product (e.g., gradient times step direction), Hölder gives you . The primal and dual norms are always the right pairing for this.
Vectors are matrices, so norms extend. Three matrix norms will appear throughout this series.
This is simply the norm of the vectorized matrix. It treats the matrix as a flat bag of numbers. Differentiating through is easy: .
where is the largest singular value of . This is the operator norm for vectors: . It measures the maximum "stretch" the matrix applies.
The spectral norm is the dual of the nuclear norm (defined below) — a fact that parallels / duality.
the sum of all singular values. This is the matrix analogue of the norm — and the analogy is exact.
Just as the norm promotes sparsity in vectors, the nuclear norm promotes low rank in matrices. The singular values of are the "coordinates" in the spectral domain. Minimizing their sum drives small singular values to zero, reducing the effective rank.
The parallel:
| Vector world | Matrix world | |---|---| | norm: | Nuclear norm: | | Promotes sparse | Promotes low-rank | | Geometric: corners on coordinate axes | Geometric: corners in spectral domain | | Used in: Lasso, compressed sensing | Used in: matrix completion, recommendation |
The proximal operator of the nuclear norm is singular value soft-thresholding — the matrix analogue of the scalar soft-threshold from Lasso. We will encounter this in Post 6.
A set is convex if for all and :
In words: the line segment between any two points in stays inside . Balls are convex. Polytopes are convex. Non-convex: donuts, stars, any set with a "dent."
The unit balls for are all convex. (For , the "norm" is no longer a norm — it violates the triangle inequality — and the unit ball is non-convex with an inward-bowing boundary.)
A function is convex if for all and :
The function value at any convex combination of two points lies below or on the chord connecting them. Visually: the function lies below all its secant lines.
Convexity is the "holy grail" of optimization for one overwhelming reason: every local minimum is a global minimum.
Proof: Suppose is a local minimum but not global. Then there exists with . By convexity, for all . But points can be made arbitrarily close to (take ), contradicting being a local minimum.
Theorem: A differentiable function is convex iff for all :
The tangent plane (or hyperplane in higher dimensions) is a global lower bound everywhere. This is a remarkable property — local linear information (the gradient at ) tells you something true everywhere.
Proof (): Fix . By convexity, for any :
Rearranging: .
Taking , the right side converges to .
Proof (): Suppose the tangent lower bound holds. For :
Multiply the first by , the second by , add:
Since , we have . So:
The first-order condition has an immediate and crucial consequence: if , then for all : . So any stationary point of a convex function is a global minimum. This is why checking is sufficient for convex problems.
Theorem: A twice-differentiable function is convex iff for all :
where means the Hessian matrix is positive semi-definite (PSD).
A symmetric matrix is PSD iff any of these equivalent conditions hold:
Geometrically: is the second directional derivative of in direction at . Requiring it to be non-negative for all means the function curves upward in every direction — no concave dip anywhere.
A few examples:
Convexity alone is not enough to derive fast convergence rates. Two additional structural conditions — smoothness and strong convexity — will appear in every theorem of this series.
A differentiable function is -smooth if its gradient is -Lipschitz:
This says the gradient cannot change too rapidly. Equivalently, everywhere (all eigenvalues of the Hessian are at most ).
Lemma (Descent Lemma): If is -smooth, then for all :
Proof. By the fundamental theorem of calculus and Cauchy-Schwarz:
Applying -smoothness to :
Geometrically: a smooth function is sandwiched above by a quadratic with curvature . This quadratic upper bound is the workhorse of gradient descent analysis — we will use it in the next post to prove the descent lemma and derive optimal step sizes.
A function is -strongly convex () if for all :
This strengthens the first-order convexity condition: the tangent plane lower bound is replaced by a tighter quadratic lower bound. Equivalently, everywhere — all eigenvalues of the Hessian are at least .
If is -strongly convex, then is convex. Strong convexity means curves upward by at least in every direction.
Strong convexity guarantees a unique global minimum. If and is -SC, then for all : .
When is both -smooth and -strongly convex (with ), the function is sandwiched between two paraboloids:
For the gradients, this sandwich implies:
The left inequality is the co-coercivity condition. It says the gradient doesn't just change in the right direction — it changes by at least per unit of displacement. This will be the key estimate in proving linear convergence of gradient descent.
Example: The quadratic with is -SC and -smooth with , .
The ratio
is the condition number of the optimization problem. It is one of the most important quantities in numerical optimization.
When is small (close to 1), the function is nearly spherical — level sets are nearly round balls, and gradient descent finds the minimum quickly in any direction.
When is large, the level sets are elongated ellipsoids. The gradient points in a very different direction from the minimum. Gradient descent zig-zags: it overshoots in the steep direction while barely moving in the flat direction.
κ ≈ 1 κ ≫ 1
(nice bowl) (narrow valley)
╭──╮ ───────
╭ ╮ ─────────
╭ ╮ ─────────────
╰ ╯ ─────────────
╰ ╯ ─────────
╰──╯ ───────
Concretely: gradient descent on a -SC, -smooth function needs iterations to reach -accuracy. When , that is iterations to reach error , versus just for a well-conditioned problem.
Every algorithm in this series — from Nesterov acceleration to SVRG — is, at its core, trying to reduce the dependence on .
We have built the geometric vocabulary for the entire series:
The next post puts these tools to work: we derive the gradient descent algorithm from the descent lemma and prove, carefully, that it converges at rate for smooth convex functions and at a linear rate for strongly convex ones — and explain exactly why the condition number is the bottleneck.
This post is part of a series of mathematical notes on EE-556: Mathematics of Data (EPFL). The series covers optimization theory from first principles through modern deep learning.