Writing

Blog

Mathematical notes on optimization, machine learning theory, and the mathematics of data — written as I work through EE-556 at EPFL. Rigorous but readable; proofs included.

Series

13 posts

Mathematics of Data

March 2025

Adversarial Robustness: Minimax Games, PGD, and the Cost of Security

Neural networks can be fooled by imperceptible perturbations. Defending against them is a minimax optimization problem. This post derives FGSM and PGD attacks, proves Danskin's theorem to justify adversarial training, analyzes catastrophic overfitting and the GradAlign fix, and quantifies the robustness-accuracy trade-off.

adversarialrobustnessminimaxoptimization

→

March 2025

Score Matching and Diffusion Models: Generating Data by Learning to Denoise

Modern generative AI rests on a single mathematical insight: the gradient of a probability distribution's log-density is equivalent to a denoising operator. This post derives Hyvärinen's score matching trick, Tweedie's formula, and the stochastic differential equations behind diffusion models.

generative-modelsdiffusionscore-matchingsde

→

March 2025

The Neural Tangent Kernel: Why Infinitely Wide Networks Are Convex

Neural network training is non-convex — yet gradient descent finds good solutions reliably. The Neural Tangent Kernel explains why: in the infinite-width limit, the loss landscape becomes convex, training dynamics reduce to a linear ODE, and convergence to zero training loss is guaranteed. This post derives the NTK, solves the ODE, and confronts the theory's limitations.

deep-learningntkkernel-methodsoptimization

→

March 2025

Implicit Bias: How the Optimizer Secretly Regularizes Your Model

When gradient descent trains an overparameterized model, there are infinitely many solutions with zero training loss. GD quietly picks a specific one — the max-margin classifier for logistic regression, the minimum nuclear norm solution for matrix factorization. This post proves both results and explains the hidden regularizer inside your optimizer.

deep-learningimplicit-biasoptimizationgeneralization

→

March 2025

Double Descent and the Failure of Classical Generalization Theory

Classical statistics says bigger models overfit. Modern deep learning violates this completely. The double descent curve is the empirical fact that broke the old theory — this post derives the bias-variance decomposition, analyzes minimum-norm interpolation, and explains why Rademacher complexity fails to explain deep learning.

deep-learninggeneralizationbias-variancerademacher

→

March 2025

Subgradients and the Lasso: When Functions Have Kinks

The gradient fails at kinks. The subgradient is its correct generalization — a set-valued object that captures all valid linearizations of a convex function at non-smooth points. This post derives the subdifferential calculus, works out the KKT conditions for Lasso, and proves the O(1/√k) lower bound for non-smooth optimization.

optimizationsubgradientslassonon-smoothconvexity

→

March 2025

Frank-Wolfe: Optimization Without Projections

When projecting onto the constraint set is expensive, Frank-Wolfe replaces the projection with a linear minimization — far cheaper and often sufficient. This post derives the algorithm, proves the O(1/k) convergence rate, and shows why it's the right tool for large-scale matrix problems.

optimizationfrank-wolfeconstrainedmatrix-completion

→

March 2025

Non-Smooth Optimization: Proximal Operators and FISTA

The best statistical models — Lasso, group sparsity, total variation — have non-differentiable objectives. Proximal operators are the right generalization of gradient steps for this world, and FISTA recovers the optimal O(1/k²) rate even without smoothness.

optimizationproximalfistalassonon-smooth

→

March 2025

Variance Reduction: Getting Linear Convergence for Free with SVRG

SGD is cheap but noisy; full GD is exact but expensive. SVRG achieves the best of both — linear convergence at near-SGD cost — by using a snapshot gradient to correct the stochastic noise. This post derives the variance bound and convergence theorem from scratch.

optimizationvariance-reductionsvrgconvergence

→

March 2025

Stochastic Gradient Descent: Speed, Noise, and the Learning Rate Dilemma

When your dataset has a billion points, computing a full gradient is impossible. SGD trades exactness for speed — this post derives exactly what that trade-off costs, why variance is the fundamental obstacle, and what iterate averaging buys you.

optimizationsgdstochasticconvergence

→

March 2025

Nesterov's Acceleration: How to Beat Gradient Descent

Standard gradient descent is provably suboptimal — there is a gap between what GD achieves and what any first-order method could theoretically achieve. Nesterov's 1983 trick closes that gap entirely, reaching the O(1/k²) lower bound with almost no extra work per iteration.

optimizationaccelerationnesterovconvergence

→

March 2025

Gradient Descent: From Intuition to Convergence Guarantees

Gradient Descent is simple to state and surprisingly subtle to analyze. This post builds the full convergence theory from scratch — the descent lemma, the O(1/k) rate for convex functions, and the linear rate for strongly convex ones — and shows why the condition number is the true bottleneck.

optimizationgradient-descentconvergenceconvexity

→

March 2025

The Geometry of Optimization: Norms, Convexity, and Why Shape Matters

Before any algorithm runs, the geometry of the space you're optimizing over determines everything. This post builds that geometric intuition from the ground up — norms, dual norms, convexity, smoothness, and the condition number that will haunt every algorithm that follows.

optimizationconvexitylinear-algebrafoundations

→