The role of inductive biases with two linear examples (linear regression with gaussian noise & Random Fourier Features)
This series of blog posts was co-written with Marc Lafon. Pdf version is available here as well.
In this post, we consider two settings where double descent can be
empirically observed and mathematically justified, in order to give the
reader some intuition on the role of inductive biases. The next post concludes with
some references to recent related works studying optimization in the
over-parameterized regime, or linking the double descent to a physical
phenomenon named jamming.
Fully understanding the mechanisms behind this phenomenon in deep
learning remains an open question, but inductive biases (introduced
in the previous post) seem to play a
key role.
In the over-parameterized regime, empirical risk minimizers are able to
interpolate the data. Intuitively :
Near the interpolation point, there are very few solutions that fit the training data perfectly. Hence, any noise in the data or model mis-specification will destroy the global structure of the model, leading to an irregular solution that generalizes badly (figure with d=20).
As effective model capacity grows, many more interpolating solutions exist, including some that generalize better and can be selected thanks to the right inductive bias, e.g. smaller norm (figure with d=1000), or ensemble methods.
Fitting degree d Legendre polynomials (orange curve) to n=20 noisy samples (red dots), from a polynomial of degree 3 (blue curve).
Gradient descent is used to minimize the squared error, which leads to the smallest norm solution (considering the norm of the vector of coefficients). Taken from this blog post.
Left: d=1, middle: d=20, right: d=1000
Linear Regression with Gaussian Noise
In this section we consider the family class
(Hp)p∈[1,d] of linear functions
h:Rd↦R where exactly p components are non-zero
(1≤p≤d). We will study the generalization error obtained with
ERM when increasing p (which is regarded as the class complexity).
Plot of risk E[(y−xTw^)2] as a function of p, under the random selection model of the subset of p features. Here ∥w∥2=1, σ2=1/25, d=100 and n=40. Taken from Belkin et al., 2020 1.
The class of predictors Hp is defined as follow:
Definition 17.
For p∈[1,d], Hp is the set of
functions h:Rd↦R of the form:
h(u)=uTw,for u∈Rd
With w∈Rd having
exactly p non-zero elements.
Let (X,ϵ)∈Rd×R be independent random
variables with X∼N(0,I) and
ϵ∼N(0,σ2). Let h∗∈Hd and
define the random variable
Y=h∗(X)+σϵ=XTw+σϵ
with
σ>0 with w∈Rd defined by h∗. We consider
(Xi,Yi)i=1nn iid copies of (X,Y). We are interested in the
following problem:
h∈HdminE[(h(X)−Y)2](5)
Let X∈Rn×d the random matrix which rows are the
XiT and Y=(Y1,..,Yn)T∈Rn. In the following we
will assume that X is full row rank and that n≪d. Applying
empirical risk minimization we can write:
z∈Rdmin21∥Xz−Y∥2(6)
Definition 18 (Random p-submatrix/p-subvector) The notation used for the random p-submatrix and random p-subvector is not common and is introduced for clarity..
For any (p,q)∈[1,d]2 such that p+q=d and
matrix A∈Rn×d and column vector v∈Rd, we
will denote by A∼p (resp. v∼p) the sub-matrix (resp.
sub-vector) obtained by randomly selecting a subset of p columns (resp.
elements), and by A∼q∈Rn×q and v∼q∈Rq
their discarded counterpart.
In order to solve (6) we will search for a solution in
Hp⊂Hd and increase p progressively
which is a form of structural empirical risk minimization as
Hp⊂Hp+1 for any p<d.
Let p∈[1,d], we are then interested in the
following sub-problem:
z∈Rpmin21∥X∼pz−y∼p∥2
We have seen in proposition 12 of
the previous post
that the least norm solution is w^∼p=X∼p+y∼p. If we define w^∼q:=0 then we will
consider as a solution of the global problem (5)w^:=ϕp(w^∼p,w^∼q)
where ϕp:Rp×Rq↦Rd is a map rearranging the
terms of w^∼p and w^∼q to match the initial indices of
w.
Theorem 19.
Let (x,ϵ)∈Rd×R
independent random variables with x∼N(0,I) and
ϵ∼N(0,σ2), and w∈Rd. we assume that the
response variable y is defined as y=xTw+σϵ. Let
(p,q)∈[1,d]2 such that p+q=d, X∼p
the randomly selected p columns sub-matrix of X. Defining
w^:=ϕp(w^∼p,w^∼q) with w^∼p=X∼p+y∼p
and w^∼q=0.
The classical regime (p≤n) as been treated in Breiman & Freedman, 1983 2.
We will then consider the interpolating regime (p≥n). Recall that
X is assumed to be of rank n. Let η=y−X∼pw∼p. We can
write :
It is easy to show (left as an exercise) that
(I−X∼p+X∼p) is the matrix of the orthogonal
projection on Ker(X∼p). Furthermore,
−X∼p+η∈Im(X∼p+)=Im(X∼pT). Then
(I−X∼p+X∼p)w∼p and −X∼p+η are orthogonal
and the Pythagorean theorem gives:
∥w∼p−w^∼p∥2=∥(I−X∼p+X∼p)w∼p∥2+∥X∼p+η∥2
We will treat each term of the right hand side of this equality
separately.
∥(I−X∼p+X∼p)w∼p∥2
X∼p+X∼p is the matrix of the orthogonal projection on
Im(X∼pT)=Im(X∼p+), then using again the
Pythagorean theorem gives:
∥(I−X∼p+X∼p)w∼p∥2=∥w∼p∥2−∥X∼p+X∼pw∼p∥2
Because X∼p+X∼p is the matrix of the orthogonal
projection on Im(X∼pT) we can write
X∼p+X∼pw∼p as a linear combination of rows of Xp,
then using the fact that the xi are i.i.d and of standard normal
distribution we have:
E[∥X∼p+X∼pw∼p∥2]=∥w∼p∥2pn
then
E[∥(I−X∼p+X∼p)w∼p∥2]=∥w∼p∥2(1−pn)
∥X∼p+η∥2:
The calculation of this term used
the “trace trick” and the notion of distribution of
inverse-Wishart for pseudo-inverse matrices and is beyond the scope
of this lecture. It can be shown that:
Corollary 1.
Let T be a uniformly random subset of [1,d] of
cardinality p. Under the setting of Theorem 19 and taking the expectation with
respect to T, the risk of the predictor associated to w^ is:
In this section we consider the RFF model family (Rahimi & Recht, 2007 3) as our
class of predictors HN.
Definition 20.
We call Random Fourier Features (RFF) model any function
h:Rd→R of the form :
h(x)=βTz(x)
With β∈RN the parameters of the model, and
z(x)=N2⎣⎡cos(ω1Tx+b1)⋮cos(ωNTx+bN)⎦⎤
∀i∈[1,N]{ωi∼N(0,σ2Id)bi∼U([0,2π])
The vectors ωi and the scalars bi are sampled before fitting
the model, and z is called a randomized map.
The RFF family is a popular class of models that are linear w.r.t. the
parameters β but non-linear w.r.t. the input x, and can be seen
as two-layer neural networks with fixed weights in the first layer. In a
classification setting, using these models with the hinge loss amounts
to fitting a linear SVM to n feature vectors (of dimension N). RFF
models are typically used to approximate the Gaussian kernel and reduce
the computational cost when N≪n (e.g. kernel ridge regression when
using the squared loss and a l2 regularization term). In our case
however, we will go beyond N=n to observe the double descent
phenomenon.
Remark 7.Clearly, we have HN⊂HN+1 for any
N≥0.
Let k:(x,y)→e−2σ21∣∣x−y∣∣2 be the
Gaussian kernel on Rd, and let H∞ be a
class of predictors where empirical risk minimizers on
Dn={(x1,y1),…,(xn,yn)} can be expressed as
h:x→∑k=1nαkk(xk,x). Then, as
N→∞, HN becomes a closer and closer
approximation of H∞.
For any x,y∈Rd, with the vectors
ωk∈Rd sampled from N(0,σ2Id):
Where (1) and (2) are left as an exercise, with
indications here if needed.
Hence, for h∈H∞ :
h(x)=k=1∑nαkk(xn,x)≈β(k=1∑Nαkz(xk))Tz(x)
A complete definition is outside of the scope of this lecture, but
H∞ is actually the Reproducing Kernel Hilbert Space
(RKHS) corresponding to the Gaussian kernel, for which RFF models are a
good approximation when sampling the random vectors ωi from a
normal distribution.
We use ERM to find the predictor hn,N∈HN and, in
the interpolation regime where multiple minimizers exist, we choose the
one whose parameters β∈RN have the smallest l2
norm. This training procedure allows us to observe a model-wise double
descent (figure below).
Model-wise double descent risk curve for RFF model on a subset of MNIST (n=104, 10 classes), choosing the smallest norm predictorhn,N when N>n. The interpolation threshold is achieved at N=104. Taken from Belkin et al., 2019 4, which uses an equivalent but slightly different definition of RFF models.
Indeed, in the under-parameterized regime,
statistical analyses suggest choosing N∝nlog(n) for
good test risk guarantees (Rahimi & Recht, 2007). And as we approach the
interpolation point (around N=n), we observe that the test risk
increases then decreases again.
In the over-parameterized regime (N≥n), multiple predictors are
able to fit perfectly the training data. As
HN∈HN+1, increasing N leads to richer
model classes and allows constructing interpolating predictors that are
more regular, with smaller norm (eventually converging to hn,∞
obtained from H∞). As detailed in theorem 22 (in a noiseless setting), the small norm
inductive bias is indeed powerful to ensure small generalization error.
Theorem 22 (Belkin et al., 2019).
Fix any h∗∈H∞. Let (X1,Y1),…,(Xn,Yn) be
i.i.d. random variables, where Xi is drawn uniformly at random from a
compact cube Ω∈Rd, and Yi=h∗(Xi) for all
i. There exists constants A,B>0 such that, for any interpolating
h∈H∞ (i.e., h(Xi)=Yi for all i), so that
with high probability :