← All Posts

Mathematics of Data

The Neural Tangent Kernel: Why Infinitely Wide Networks Are Convex

March 2025deep-learningntkkernel-methodsoptimization

Neural network training is, in general, a non-convex optimization problem. The loss L(θ)=1ni(f(xi;θ),yi)L(\theta) = \frac{1}{n}\sum_i \ell(f(x_i;\theta), y_i) is non-convex in θ\theta because ff is a composition of nonlinear functions. Non-convex landscapes have local minima, saddle points, and flat regions — gradient descent could get stuck anywhere.

Yet in practice, gradient descent on neural networks reliably finds solutions with near-zero training loss and good generalization. The empirical success far outstrips the theoretical guarantees.

The Neural Tangent Kernel (NTK), introduced by Jacot, Gabriel, and Hongler (2018), provides the most rigorous explanation we have. In the limit of infinite network width, the training dynamics simplify dramatically: the loss landscape becomes convex, the kernel remains constant throughout training, and convergence is guaranteed. This post derives that result and confronts its limitations.


1. The Non-Convexity Problem

Consider a two-layer network:

f(x;W1,W2)=W2σ(W1x)f(x; W_1, W_2) = W_2 \sigma(W_1 x)

The loss L(W1,W2)=f(x;W1,W2)y2L(W_1, W_2) = \|f(x; W_1, W_2) - y\|^2 is non-convex in (W1,W2)(W_1, W_2) jointly: scaling W22W2W_2 \to 2W_2 and W1W1/2W_1 \to W_1/2 preserves ff but changes the loss landscape geometry. Non-convexity means we cannot guarantee gradient descent finds a global minimum.

Why does it work anyway? The answer requires understanding what happens as networks become very wide.


2. The Linearization Intuition

Suppose the weights barely move during training — they stay close to their initialization θ0\theta_0. Then we can Taylor-expand the network output:

f(x;θ)f(x;θ0)+θf(x;θ0)T(θθ0)f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^T (\theta - \theta_0)

In this approximation, ff is linear in θθ0\theta - \theta_0. Linear models have convex loss landscapes. If weights barely move, the optimization problem is essentially convex.

When do weights barely move? When the network is very wide. In a wide network with NN neurons per layer, each weight contributes O(1/N)O(1/N) to the output (to keep the output O(1)O(1)). The loss gradient with respect to each weight is O(1/N)O(1/N), so each weight moves by O(η/N)O(\eta/N) per gradient step. As NN \to \infty, the weights move infinitesimally — the linearization becomes exact.

This is the lazy training or kernel regime: the network computes a fixed feature map θf(x;θ0)\nabla_\theta f(x; \theta_0) and learns a linear model on top.


3. The Neural Tangent Kernel

Definition. The Neural Tangent Kernel is:

K(x,x)=θf(x;θ0)Tθf(x;θ0)RK(x, x') = \nabla_\theta f(x; \theta_0)^T \nabla_\theta f(x'; \theta_0) \in \mathbb{R}

This is an inner product of the gradient vectors (the Jacobians of the network output with respect to parameters), evaluated at two inputs xx and xx'.

Interpretation. K(x,x)K(x,x') measures how similarly inputs xx and xx' respond to weight perturbations. If changing weights has similar effects on f(x;θ)f(x;\theta) and f(x;θ)f(x';\theta), then K(x,x)K(x,x') is large. The NTK is a kernel function — symmetric, positive semi-definite — and defines a reproducing kernel Hilbert space (RKHS).

The Infinite-Width Limit

As network width NN \to \infty, two remarkable things happen:

1. KK becomes deterministic. At initialization, θ0\theta_0 is random. The NTK is a sum of O(N)O(N) terms, each O(1/N)O(1/N), so by the law of large numbers it concentrates around its expectation. In the limit, K(x,x)K(x,x') is a fixed deterministic kernel that depends only on the architecture, not on the random initialization.

2. KK stays constant during training. Because each weight moves by O(1/N)O(1/N) per step, the change in θf(x;θ)\nabla_\theta f(x;\theta) per step is O(1/N)O(1/N), and the change in K(x,x)K(x,x') is O(1/N2)O(1/N^2). As NN \to \infty, the kernel freezes at its initial value K0K_0. Training does not change the feature map — only the linear combination of features.

These two properties make the infinite-width limit analytically tractable.


4. Gradient Flow Analysis: The Linear ODE

Let y^(t)Rn\hat{y}(t) \in \mathbb{R}^n be the vector of network outputs on the training set at time tt, and yy the vector of true labels. Define the squared loss:

L(θ)=12y^(t)y2L(\theta) = \frac{1}{2}\|\hat{y}(t) - y\|^2

Under gradient flow θ˙=θL(θ)\dot{\theta} = -\nabla_\theta L(\theta), how does y^(t)\hat{y}(t) evolve?

By the chain rule:

y^˙i=θf(xi;θ)Tθ˙=θf(xi;θ)TθL(θ)\dot{\hat{y}}_i = \nabla_\theta f(x_i; \theta)^T \dot{\theta} = -\nabla_\theta f(x_i;\theta)^T \nabla_\theta L(\theta)

The gradient of LL with respect to θ\theta is:

θL=j=1n(y^jyj)θf(xj;θ)\nabla_\theta L = \sum_{j=1}^n (\hat{y}_j - y_j) \nabla_\theta f(x_j; \theta)

Substituting:

y^˙i=j=1n(y^jyj)θf(xi;θ)Tθf(xj;θ)K(xi,xj;θ)\dot{\hat{y}}_i = -\sum_{j=1}^n (\hat{y}_j - y_j) \underbrace{\nabla_\theta f(x_i;\theta)^T \nabla_\theta f(x_j;\theta)}_{K(x_i, x_j; \theta)}

In matrix form, defining the empirical NTK matrix K(t)Rn×n\mathbf{K}(t) \in \mathbb{R}^{n \times n} with [K]ij=K(xi,xj;θ(t))[\mathbf{K}]_{ij} = K(x_i, x_j; \theta(t)):

y^˙(t)=K(t)(y^(t)y)\dot{\hat{y}}(t) = -\mathbf{K}(t)(\hat{y}(t) - y)

In the infinite-width limit, K(t)=K0\mathbf{K}(t) = \mathbf{K}_0 is constant. This becomes a linear ODE:

y^˙(t)=K0(y^(t)y)\dot{\hat{y}}(t) = -\mathbf{K}_0 (\hat{y}(t) - y)

Solving the ODE

This is a linear system with constant coefficients. The solution is:

y^(t)y=eK0t(y^(0)y)\hat{y}(t) - y = e^{-\mathbf{K}_0 t}(\hat{y}(0) - y)

where eK0te^{-\mathbf{K}_0 t} is the matrix exponential. In the eigenbasis of K0\mathbf{K}_0 (which is symmetric PSD, so diagonalizable with non-negative eigenvalues λ1λn0\lambda_1 \geq \cdots \geq \lambda_n \geq 0):

[y^(t)y]k=eλkt[y^(0)y]k[\hat{y}(t) - y]_k = e^{-\lambda_k t} [\hat{y}(0) - y]_k

Each component of the error decays exponentially at rate λk\lambda_k.

Key result: If K00\mathbf{K}_0 \succ 0 (positive definite — all eigenvalues strictly positive), then λmin(K0)>0\lambda_{\min}(\mathbf{K}_0) > 0 and all components decay to zero:

y^(t)y2e2λmin(K0)ty^(0)y20\|\hat{y}(t) - y\|^2 \leq e^{-2\lambda_{\min}(\mathbf{K}_0) t} \|\hat{y}(0) - y\|^2 \to 0

Infinite-width networks achieve zero training loss under gradient flow, at a linear rate determined by the smallest eigenvalue of the NTK matrix.

This is the main theorem of NTK theory. The non-convex neural network training problem reduces, in the infinite-width limit, to fitting a kernel regression model with kernel K0K_0 — a convex problem with a known unique solution.


5. The NTK Spectrum and Generalization

The eigenvalues of K0\mathbf{K}_0 carry information beyond just convergence speed.

Convergence: Directions with large λk\lambda_k converge fast (within O(1/λk)O(1/\lambda_k) time). Directions with small λk\lambda_k converge slowly — they are effectively not learned within a finite training budget.

Generalization: After training for time TT, the network output is:

y^(T)=yeK0T(yy^(0))(IeK0T)y\hat{y}(T) = y - e^{-\mathbf{K}_0 T}(y - \hat{y}(0)) \approx (\mathbb{I} - e^{-\mathbf{K}_0 T}) y

In the eigenbasis, the kk-th component of the learned function is (1eλkT)yk(1 - e^{-\lambda_k T}) y_k. Large eigenvalue directions are learned faithfully; small eigenvalue directions are suppressed. The NTK performs spectral filtering: it implicitly regularizes by not learning directions where K0\mathbf{K}_0 has small eigenvalues.

This connects NTK to classical kernel regression. The RKHS norm of the NTK solution bounds the generalization error — the same machinery as in kernel SVM or Gaussian process regression.

μ\muP (Maximal Update Parameterization). In standard parameterization, the NTK changes with width: the kernel KK scales differently in different layers. The μ\muP parameterization (Yang & Hu, 2022) scales initialization variance and learning rates so that KK stabilizes as width grows. This enables hyperparameter transfer: the optimal learning rate found on a small model transfers directly to a large model, saving enormous compute in practice.


6. Limitations: What NTK Theory Gets Wrong

The NTK theory is mathematically beautiful and provides our most rigorous framework for understanding neural network training. But it is a theory of a regime neural networks rarely operate in.

Lazy training vs. feature learning. In the NTK regime, weights barely move — the feature map θf(x;θ)\nabla_\theta f(x;\theta) is fixed at initialization. Real networks learn features: early layers develop edge detectors, mid layers develop texture detectors, and so on. This is representation learning, and it requires weights to move significantly. NTK theory misses it entirely.

Depth. In the infinite-width limit, the NTK kernel K0K_0 is the same regardless of depth (for appropriate initialization). But depth clearly matters in practice — a 1-layer network with width 10610^6 performs much worse than a 100-layer network on natural images. NTK theory predicts no benefit to depth.

Finite width. For finite-width networks, K(t)\mathbf{K}(t) changes during training. The linear ODE becomes nonlinear, and the nice closed-form solution breaks down. Corrections are O(1/N)O(1/N) but can compound over long training.

Practical learning rates. NTK theory holds for infinitesimally small learning rates (continuous gradient flow). Practical training uses large discrete steps, which can push networks out of the lazy regime.

The NTK is best understood as a rigorous baseline: it describes the simplest possible regime of neural network training, and real networks deviate from it in exactly the ways that make them powerful. Understanding those deviations is the frontier of deep learning theory.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.