Deep double descent explained (¾)

15/06/2021

The role of inductive biases with two linear examples (linear regression with gaussian noise & Random Fourier Features)

This series of blog posts was co-written with Marc Lafon. Pdf version is available here as well.


\global\def\norm#1{\lVert #1 \rVert} \global\def\1{\mathbb{I}} \global\def\N{\mathcal{N}} \global\def\R{\mathbb{R}} \global\def\E{\mathbb{E}} \global\def\argmin{\text{argmin}} \global\def\Ap{A_{\sim p}} \global\def\Aq{A_{\sim q}} \global\def\Xp{X_{\sim p}} \global\def\Xq{X_{\sim q}} \global\def\p#1{#1_{\sim p}} \global\def\q#1{#1_{\sim q}} \global\def\vp{v_{\sim p}} \global\def\vq{v_{\sim q}} \global\def\xp{x_{\sim p}} \global\def\xq{x_{\sim q}} \global\def\yp{y_{\sim p}} \global\def\yq{y_{\sim q}} \global\def\wp{w_{\sim p}} \global\def\wq{w_{\sim q}}

In this post, we consider two settings where double descent can be empirically observed and mathematically justified, in order to give the reader some intuition on the role of inductive biases. The next post concludes with some references to recent related works studying optimization in the over-parameterized regime, or linking the double descent to a physical phenomenon named jamming.

Fully understanding the mechanisms behind this phenomenon in deep learning remains an open question, but inductive biases (introduced in the previous post) seem to play a key role.

In the over-parameterized regime, empirical risk minimizers are able to interpolate the data. Intuitively :

d1 d20 d1000

Fitting degree dd Legendre polynomials (orange curve) to n=20n=20 noisy samples (red dots), from a polynomial of degree 3 (blue curve). Gradient descent is used to minimize the squared error, which leads to the smallest norm solution (considering the norm of the vector of coefficients). Taken from this blog post.

Left: d=1d=1, middle: d=20d=20, right: d=1000d=1000

Linear Regression with Gaussian Noise

In this section we consider the family class (Hp)p[1,d](\mathcal{H}_p)_{p\in\left[ 1,d\right]} of linear functions h:RdRh:\R^d\mapsto \R where exactly pp components are non-zero (1pd1\leq p\leq d). We will study the generalization error obtained with ERM when increasing pp (which is regarded as the class complexity).

double_descent_gaussian_model

Plot of risk E[(yxTw^)2]\E[(y-x^T\hat{w})^2] as a function of pp, under the random selection model of the subset of pp features. Here w2=1\norm{w}^2=1, σ2=1/25\sigma^2=1/25, d=100d=100 and n=40n=40. Taken from Belkin et al., 2020 1.

The class of predictors Hp\mathcal{H}_p is defined as follow:

Definition 17. For p[1,d]p \in \left[ 1,d\right], Hp\mathcal{H}_p is the set of functions h:RdRh:\R^d\mapsto \R of the form:

h(u)=uTw,for uRd h(u)=u^Tw,\quad \text{for }u \in \R^d

With wRdw \in \R^d having exactly pp non-zero elements.

Let (X,ϵ)Rd×R(X, \mathbf{\epsilon})\in \R^d\times\R be independent random variables with XN(0,I)X \sim \N(0,I) and ϵN(0,σ2)\mathbf{\epsilon} \sim \N(0,\sigma^2). Let hHdh^* \in \mathcal{H}_d and define the random variable

Y=h(X)+σϵ=XTw+σϵY=h^*(X)+\sigma \mathbf{\epsilon}=X^Tw+\sigma \mathbf{\epsilon}

with σ>0\sigma>0 with wRdw \in \R^d defined by hh^*. We consider (Xi,Yi)i=1n(X_i, Y_i)_{i=1}^n nn iid copies of (X,Y)(X,Y). We are interested in the following problem:

minhHdE[(h(X)Y)2](5) \min_{h\in \mathcal{H}_d}\E[(h(X) - Y)^2] \tag{5}

Let XRn×d\mathbf{X}\in \R^{n\times d} the random matrix which rows are the XiTX_i^T and Y=(Y1,..,Yn)TRn\mathbf{Y} =(Y_1,.., Y_n)^T \in \R^n. In the following we will assume that X\mathbf{X} is full row rank and that ndn \ll d. Applying empirical risk minimization we can write:

minzRd12XzY2(6) \min_{z\in \R^d} \frac{1}{2}\norm{\mathbf{X} z - \mathbf{Y}}^2 \tag{6}

Definition 18 (Random p-submatrix/p-subvector) The notation used for the random p-submatrix and random p-subvector is not common and is introduced for clarity.. For any (p,q)[1,d]2(p,q) \in \left[ 1, d\right]^2 such that p+q=dp+q=d and matrix ARn×d\mathbf{A} \in \R^{n\times d} and column vector vRdv\in \R^d, we will denote by Ap\mathbf{\Ap} (resp. vp\vp) the sub-matrix (resp. sub-vector) obtained by randomly selecting a subset of p columns (resp. elements), and by AqRn×q\mathbf{\Aq} \in \R^{n\times q} and vqRq\vq\in \R^{q} their discarded counterpart.

In order to solve (6) we will search for a solution in HpHd\mathcal{H}_p \subset \mathcal{H}_d and increase pp progressively which is a form of structural empirical risk minimization as HpHp+1\mathcal{H}_p \subset \mathcal{H}_{p+1} for any p<dp<d.

Let p[1,d]p \in \left[ 1, d\right], we are then interested in the following sub-problem:

minzRp12Xpzyp2 \min_{z\in \R^p} \frac{1}{2}\norm{\mathbf{\Xp} z - \yp}^2

We have seen in proposition 12 of the previous post that the least norm solution is w^p=Xp+yp\p{\hat w}=\mathbf{\Xp}^+\yp. If we define w^q:=0\q{\hat w} := 0 then we will consider as a solution of the global problem (5) w^:=ϕp(w^p,w^q)\hat w:=\phi_p(\p{\hat w},\q{\hat w}) where ϕp:Rp×RqRd\phi_p: \R^p\times\R^{q}\mapsto \R^d is a map rearranging the terms of w^p\p{\hat w} and w^q\q{\hat w} to match the initial indices of ww.

Theorem 19. Let (x,ϵ)Rd×R(x, \epsilon)\in \R^d\times\R independent random variables with xN(0,I)x \sim \N(0,I) and ϵN(0,σ2)\epsilon \sim \N(0,\sigma^2), and wRdw \in \R^d. we assume that the response variable yy is defined as y=xTw+σϵy=x^Tw +\sigma \epsilon. Let (p,q)[1,d]2(p,q) \in \left[ 1, d\right]^2 such that p+q=dp+q=d, Xp\mathbf{\Xp} the randomly selected pp columns sub-matrix of X. Defining w^:=ϕp(w^p,w^q)\hat w:=\phi_p(\p{\hat w},\q{\hat w}) with w^p=Xp+yp\p{\hat w}=\mathbf{\Xp}^+\yp and w^q=0\q{\hat w} = 0.

The risk of the predictor associated to w^\hat w is:

E[(yxTw^)2]={(wq2+σ2)(1+pnp1)if pn2+if n1pn+1wp2(1np)+(wq2+σ2)(1+npn1)if pn+2 \E[(y-x^T\hat w)^2] = \begin{cases} (\norm{\wq}^2+\sigma^2)(1+\frac{p}{n-p-1}) &\quad\text{if } p\leq n-2\\ +\infty &\quad\text{if }n-1 \leq p\leq n+1\\ \norm{\wp}^2(1-\frac{n}{p}) + (\norm{\wq}^2+\sigma^2)(1+\frac{n}{p-n-1}) &\quad\text{if }p\geq n+2\end{cases}

Proof. Because xx is zero mean and identity covariance matrix, and because xx and ϵ\epsilon are independent:

E[(yxTw^)2]=E[(xT(ww^)+σϵ)2]=σ2+E[ww^2]=σ2+E[wpw^p2]+E[wqw^q2] \begin{aligned} \E\left[(y-x^T\hat w)^2\right] &= \E\left[(x^T(w-\hat w) + \sigma \epsilon)^2\right] \\ &= \sigma^2 + \E\left[\norm{w-\hat w}^2\right] \\ &= \sigma^2 + \E\left[\norm{\wp-\p{\hat w}}^2\right]+\E\left[\norm{\wq-\q{\hat w}}^2\right] \end{aligned}

and because w^q=0\q{\hat w}=0, we have:

E[(yxTw^)2]=σ2+E[wpw^p2]+wq2 \E\left[(y-x^T\hat w)^2\right] = \sigma^2 + \E\left[\norm{\wp-\p{\hat w}}^2\right]+\norm{\wq}^2

The classical regime (pnp\leq n) as been treated in Breiman & Freedman, 1983 2. We will then consider the interpolating regime (pnp \geq n). Recall that X is assumed to be of rank nn. Let η=yXpwp\eta = y - \mathbf{\Xp} \wp. We can write :

wpw^p=wpXp+y=wpXp+(η+Xpwp)=(IXp+Xp)wpXp+η \begin{aligned} \wp-\p{\hat w} &= \wp - \mathbf{\Xp}^+y \\ &= \wp - \mathbf{\Xp}^+(\eta + \mathbf{\Xp} \wp) \\ &= (\mathbf{I}- \mathbf{\Xp}^+\mathbf{\Xp})\wp - \mathbf{\Xp}^+ \eta \end{aligned}

It is easy to show (left as an exercise) that (IXp+Xp)(\mathbf{I}- \mathbf{\Xp}^+\mathbf{\Xp}) is the matrix of the orthogonal projection on Ker(Xp)\text{Ker}(\mathbf{\Xp}). Furthermore, Xp+ηIm(Xp+)=Im(XpT)-\mathbf{\Xp}^+ \eta \in \text{Im}(\mathbf{\Xp}^+)=\text{Im}(\mathbf{\Xp}^T). Then (IXp+Xp)wp(\mathbf{I}- \mathbf{\Xp}^+\mathbf{\Xp})\wp and Xp+η-\mathbf{\Xp}^+ \eta are orthogonal and the Pythagorean theorem gives:

wpw^p2=(IXp+Xp)wp2+Xp+η2 \norm{\wp-\p{\hat w}}^2 = \norm{(\mathbf{I}- \mathbf{\Xp}^+\mathbf{\Xp})\wp}^2 + \norm{\mathbf{\Xp}^+ \eta}^2

We will treat each term of the right hand side of this equality separately.

Corollary 1. Let TT be a uniformly random subset of [1,d]\left[ 1, d\right] of cardinality p. Under the setting of Theorem 19 and taking the expectation with respect to TT, the risk of the predictor associated to w^\hat w is:

E[(YXTw^)2]={((1pd)w2+σ2)(1+pnp1)if pn2w2(1nd(2dn1pn1))+σ2(1+npn1)if pn+2 \E[(Y-X^T\hat w)^2] = \begin{cases} \left((1-\frac{p}{d})\norm{w}^2+\sigma^2\right)(1+\frac{p}{n-p-1}) &\quad\text{if } p\leq n-2\\ \norm{w}^2\left(1-\frac{n}{d}(2- \frac{d-n-1}{p-n-1})\right) +\sigma^2(1+\frac{n}{p-n-1}) &\quad\text{if }p\geq n+2\end{cases}

Proof. Since T is a uniformly random subset of [1,d]\left[ 1, d\right] of cardinality p:

E[wp2]=E[iTwi2]=E[i=1dwi2IT(i)]=i=1dwi2E[IT(i)]=i=1dwi2>P[iT]=w2pd \E[\norm{\wp}^2] = \E[\sum_{i\in T}w_i^2]= \E[\sum_{i=1}^{d}w_i^2 \1_{T}(i) ]=\sum_{i=1}^{d}w_i^2 \E[\1_{T}(i) ]=\sum_{i=1}^{d}w_i^2 > \mathbb{P}[i \in T]=\norm{w}^2 \frac{p}{d}

and, similarly:

E[wq2]=w2(1pd) \E[\norm{\wq}^2] =\norm{w}^2 \left(1-\frac{p}{d}\right)

Plugging into Theorem 19 ends the proof. ◻

Random Fourier Features

In this section we consider the RFF model family (Rahimi & Recht, 2007 3) as our class of predictors HN\mathcal{H}_N.

Definition 20. We call Random Fourier Features (RFF) model any function h:RdRh: \mathbb{R}^d \rightarrow \mathbb{R} of the form :

h(x)=βTz(x) h(x) = \beta^T z(x)

With βRN\beta \in \mathbb{R}^N the parameters of the model, and

z(x)=2N[cos(ω1Tx+b1)cos(ωNTx+bN)] z(x) = \sqrt{\frac{2}{N}} \begin{bmatrix}\cos(\omega_1^T x + b_1) \\ \vdots \\ \cos(\omega_N^T x + b_N)\end{bmatrix}

i[1,N]{ωiN(0,σ2Id)biU([0,2π]) \forall i \in \left[ 1,N \right] \begin{cases}\omega_i \sim \mathcal{N}(0, \sigma^2 I_d) \\ b_i \sim \mathcal{U}([0, 2\pi])\end{cases}

The vectors ωi\omega_i and the scalars bib_i are sampled before fitting the model, and zz is called a randomized map.

The RFF family is a popular class of models that are linear w.r.t. the parameters β\beta but non-linear w.r.t. the input xx, and can be seen as two-layer neural networks with fixed weights in the first layer. In a classification setting, using these models with the hinge loss amounts to fitting a linear SVM to nn feature vectors (of dimension NN). RFF models are typically used to approximate the Gaussian kernel and reduce the computational cost when NnN \ll n (e.g. kernel ridge regression when using the squared loss and a l2l_2 regularization term). In our case however, we will go beyond N=nN=n to observe the double descent phenomenon.

Remark 7. Clearly, we have HNHN+1\mathcal{H}_N \subset \mathcal{H}_{N+1} for any N0N \geq 0.

Let k:(x,y)e12σ2xy2k:(x,y) \rightarrow e^{-\frac{1}{2\sigma^2}||x-y||^2} be the Gaussian kernel on Rd\mathbb{R}^d, and let H\mathcal{H}_{\infty} be a class of predictors where empirical risk minimizers on Dn={(x1,y1),,(xn,yn)}\mathcal{D}_n = \{(x_1, y_1), \dots, (x_n, y_n)\} can be expressed as h:xk=1nαkk(xk,x)h: x \rightarrow \sum_{k=1}^n \alpha_k k(x_k, x). Then, as NN \rightarrow \infty, HN\mathcal{H}_N becomes a closer and closer approximation of H\mathcal{H}_{\infty}.

For any x,yRdx, y \in \mathbb{R}^d, with the vectors ωkRd\omega_k \in \mathbb{R}^d sampled from N(0,σ2Id)\mathcal{N}(0, \sigma^2 I_d):

k(x,y)=e12σ2(xy)T(xy)=(1)EωN(0,σ2Id)[eiωT(xy)]=EωN(0,σ2Id)[cos(ωT(xy))]since k(x,y)R1Nk=1Ncos(ωkT(xy))=1Nk=1N2cos(ωkTx+bk)cos(ωkTy+bk)=(2)z(x)Tz(y) \begin{aligned} k(x,y) &= e^{-\frac{1}{2\sigma^2}(x-y)^T(x-y)} \\ &\overset{(1)}{=} \mathbb{E}_{\omega \sim \mathcal{N}(0, \sigma^2 I_d)}[e^{i \omega^T(x-y)}] \\ &= \mathbb{E}_{\omega \sim \mathcal{N}(0, \sigma^2 I_d)}[\cos(\omega^T(x-y))] & \text{since } k(x,y) \in \mathbb{R} \\ &\approx \frac{1}{N} \sum_{k=1}^N \cos(\omega_k^T(x-y)) \\ &= \frac{1}{N} \sum_{k=1}^N 2 \cos(\omega_k^T x + b_k) \cos(\omega_k^T y + b_k) \\ &\overset{(2)}{=} z(x)^T z(y) \end{aligned}

Where (1)(1) and (2)(2) are left as an exercise, with indications here if needed.

Hence, for hHh \in \mathcal{H}_{\infty} :

h(x)=k=1nαkk(xn,x)(k=1Nαkz(xk))Tβz(x) h(x) = \sum_{k=1}^n \alpha_k k(x_n, x) \approx \underbrace{\left(\sum_{k=1}^N \alpha_k z(x_k) \right)^T}_{\beta} z(x)

A complete definition is outside of the scope of this lecture, but H\mathcal{H}_{\infty} is actually the Reproducing Kernel Hilbert Space (RKHS) corresponding to the Gaussian kernel, for which RFF models are a good approximation when sampling the random vectors ωi\omega_i from a normal distribution.

We use ERM to find the predictor hn,NHNh_{n,_N} \in \mathcal{H}_N and, in the interpolation regime where multiple minimizers exist, we choose the one whose parameters βRN\beta \in \mathbb{R}^N have the smallest l2l_2 norm. This training procedure allows us to observe a model-wise double descent (figure below).

double_descent_rff_model

Model-wise double descent risk curve for RFF model on a subset of MNIST (n=104n=10^4, 10 classes), choosing the smallest norm predictor hn,Nh_{n,N} when N>nN > n. The interpolation threshold is achieved at N=104N=10^4. Taken from Belkin et al., 2019 4, which uses an equivalent but slightly different definition of RFF models.

Indeed, in the under-parameterized regime, statistical analyses suggest choosing Nnlog(n)N \propto \sqrt{n} \log(n) for good test risk guarantees (Rahimi & Recht, 2007). And as we approach the interpolation point (around N=nN = n), we observe that the test risk increases then decreases again.

In the over-parameterized regime (NnN \geq n), multiple predictors are able to fit perfectly the training data. As HNHN+1\mathcal{H}_N \in \mathcal{H}_{N+1}, increasing NN leads to richer model classes and allows constructing interpolating predictors that are more regular, with smaller norm (eventually converging to hn,h_{n,\infty} obtained from H\mathcal{H}_{\infty}). As detailed in theorem 22 (in a noiseless setting), the small norm inductive bias is indeed powerful to ensure small generalization error.

Theorem 22 (Belkin et al., 2019). Fix any hHh^* \in \mathcal{H}_{\infty}. Let (X1,Y1),,(Xn,Yn)(X_1, Y_1), \dots ,(X_n, Y_n) be i.i.d. random variables, where XiX_i is drawn uniformly at random from a compact cube ΩRd\Omega \in \mathbb{R}^d, and Yi=h(Xi)Y_i = h^*(X_i) for all ii. There exists constants A,B>0A,B > 0 such that, for any interpolating hHh \in \mathcal{H}_{\infty} (i.e., h(Xi)=Yih(X_i) = Y_i for all ii), so that with high probability :

supxΩh(x)h(x)<AeB(n/logn)1/d(hH+hH) \sup_{x \in \Omega}|h(x) - h^*(x)| < A e^{-B(n/\log n)^{1/d}} (||h^*||_{\mathcal{H}_{\infty}} + ||h||_{\mathcal{H}_{\infty}})

Proof. We refer the reader directly to Belkin et al., 2019 for the proof. ◻


  1. M. Belkin, D. Hsu, J. Xu, Two models of double descent for weak features, 2020 

  2. L. Breiman, D. Freedman, How many variables should be entered in a regression equation?, 1983 

  3. A. Rahimi, B. Recht, others, Random Features for Large-Scale Kernel Machines., 2007 

  4. M. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off, 2019