Statistics Question – Achiever Papers

Chapter 4, problem set. There are 14 questions on “Problems” page 58~62.

STAT 578: Topics in
High Dimensional Statistics
Jingbo Liu
This version: March 2, 2021
Contents
0 Course Information
3
1 Linear Regression
1.1 Gaussian Sequence Model . . . . . . . . . . . . . . .
1.1.1 Sparsity adaptive thresholding . . . . . . . . .
1.1.2 Stein’s paradox and the James-Stein estimator
1.2 Gaussian Linear Model and the Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 `0 and `1 Regularized Regressions . . . . . . . . . . .
1.4 Basis Pursuit and the Null Space Condition . . . . .
1.5 NSC for Random Designs . . . . . . . . . . . . . . .
1.6 More on Convex Geometry . . . . . . . . . . . . . .
1.6.1 A General Result on the Intersection of Convex
Cones . . . . . . . . . . . . . . . . . . . . . .
1.6.2 Implication for Basis Pursuit . . . . . . . . .
1.6.3 Proof Sketch . . . . . . . . . . . . . . . . . .
1
9
10
11
14
18
19
25
28
32
33
34
36
2 Graphical Lasso
43
2.1 Gaussian Graphical Model . . . . . . . . . . . . . . . 43
2.2 Dual Formulation . . . . . . . . . . . . . . . . . . . 45
2.3 Blockwise Coordinate Descent . . . . . . . . . . . . . 48
3 Matrix Estimation
51
3.1 Multivariate Regression: Setup . . . . . . . . . . . . 52
3.2 Penalization by Rank . . . . . . . . . . . . . . . . . 53
3.3 Matrix Completion . . . . . . . . . . . . . . . . . . . 56
4 Problem Set
58
2
Chapter 0
Course Information
Welcome! This is a graduate level topics course on high dimensional statistics. Traditional statistics (parametric statistics), which
concerns methods and the analysis of settings with fixed number
of parameters and increasing number of samples, is by now wellunderstood. Over the past decades, however, it becomes increasingly
important to consider models in which the number of parameters
(a.k.a., features, predictors, covariates) grows with the number of
samples. The reasons are probably that with higher computation
powers, it is possible to handle larger number of parameters and
sample sizes; and for larger sample sizes, better prediction can be
achieved using more model parameters.
Foundational work on high dimensional statistics and the related
nonparametric statistics were laid by the Russian school since the
seventies and by Donoho and Johnstone in the early nineties. The
3
simplest example problem in high dimensional statistics is sparse
linear regression, which can be solved by the Lasso algorithm for
reasonable data sizes. We will discuss Lasso, its analysis, as well as
other similar algorithms. Moreover, while sparse recovery is certainly
well-studied, recent years have seen growing interests in matrix or
tensor recovery, which share some common ideas and techniques.
Below is a tentative list of topics by weeks.
1. Introduction; Gaussian sequence model
2. Lasso; restricted eigenvalue condition; fast rate
3. Nullspace condition
4. Statistical dimension
5. Graphical Lasso
6. Matrix estimation
7. Robust PCA
8. Information-theoretic technique for lower bounds
9. Empirical distribution of error: leave-one-out
10. Empirical distribution of error: replica
11. Iterative soft/hard thresholding
4
12. Approximate message passing
1-5 are mostly sparse linear regression or the related. 2-4 introduce
some tools for analyzing Lasso and provide guarantees on the risk. 8
is about general ideas of showing lower bounds on the risk. 9, 10 are
techniques for more refined analysis that even provides the empirical
distribution of the errors, and 11,12 additional algorithms for sparse
regression. 9, 10, 12 are conceptually more challenging than the
others in the list, but are also of greater interest in recent research
(in my opinion). Papers on these topics are generally challenging to
read, which, in my opinion, is partly because their primary aims are
to present original results to get credits. In this course, however, my
goal is tutorial, so I will “deconstruct” their technique by working
on the simplest model possible. Nevertheless, you are not required
to understand deeply all the topics (see grading below).
Required background.
Basic courses in probability and
statistics are assumed. You should also know basic notions of linear
algebra, such as eigenvalue decomposition or singular value decomposition.
Grading.
• Homework (50%) I will leave a few problems as homework (see
last chapter of this document). There is no TA for this course,
so the homework will not be graded weekly. I recommend
submitting once by the midterm and once by the end of the
term. Also, you are not required to learn all the topics, and
5
hence not required to solve all the problems. Number of credits
will be given next to each problem, and you only need to solve
enough so that the sum ≥ 50.
• Midterm (25%) Take home exam, more or less a homework but
within a fixed time (perhaps 1 week).
• Final project (25%) Read any of the papers below (or some
other you find interesting; drop me an email to discuss). Write
a 10-page report. Also recommend presenting the summary
to the class (or upload a video). When I was doing a reading
project as a PhD student, I was always able to find something
new and publish a paper. It would be great if you can do this,
but it is not required.
Readings Project Examples [updated periodically]
1. D. Amelunxen, M. Lotz, M. McCoy, J. Tropp, Living on the
edge: Phase transitions in convex programs with random data,
Information and Inference, 2014.
2. D. Donoho and J. Tanner, Neighborliness of randomly projected simplices in high dimensions, PNAS, 2005.
3. T. Cover, Geometrical and statistical properties of systems
of linear inequalities with applications in pattern recognition,
IEEE trans. on electronic computers, 1965.
6
4. P. Sur, Y. Chen, and E. Candes, The likelihood ratio test in
high-dimensional logistic regression is asymptotically a rescaled
chi-square, 2017.
5. Narayana P. Santhanam, Martin J. Wainwright, InformationTheoretic Limits of Selecting Binary Graphical Models in High
Dimensions https://people.eecs.berkeley.edu/~wainwrig/
Papers/SanWai12.pdf
6. I. Daubechies M. Defrise C. De Mol, An iterative thresholding
algorithm for linear inverse problems with a sparsity constraint
https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.
20042
7. Benjamin Recht, Weiyu Xu, Babak Hassibi, Null Space Conditions and Thresholds for Rank Minimization https://people.
eecs.berkeley.edu/~brecht/papers/10.RecXuHas.Thresholds.
pdf
8. Yuchen Zhang, Distributed machine learning with communication constraints https://www2.eecs.berkeley.edu/Pubs/
TechRpts/2016/EECS-2016-47.pdf
9. Noureddine El Karoui, Derek Bean, Peter Bickel, Chingway
Lim and Bin Yu, On robust regression with high-dimensional
predictors https://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.294.5920&rep=rep1&type=pdf
7
10. Noureddine El Karoui,, On the impact of predictor geometry
on the performance on high-dimensional ridge-regularized generalized robust regression estimators, Probab. Theory Relat.
Fields (2018) 170:95-175
11. Mohsen Bayati, Andrea Montanari, The dynamics of message
passing on dense graphs, with applications to compressed sensing https://arxiv.org/pdf/1001.3448.pdf
12. Yash Deshpande, Andrea Montanari, Information-theoretically
Optimal Sparse PCA https://arxiv.org/pdf/1402.2238.
pdf
13. The replica trick for the analysis of random matrices https://
meisong541.github.io/jekyll/update/2019/08/04/Replica_
method_1.html#ref1
8
Chapter 1
Linear Regression
Regression is a fundamental problem in statistics. Given paired observations (X1, Y1), (X2, Y2),. . . , (Xn, Yn), the statistician wants to
find a relationship f between the predictors Xi’s and the responses
Yi’s. The response for a fresh input X in future can then be predicted
as f (X).
Often, each Yi is a real number, whereas each Xi is a vector in Rd.
Making the ansatz that f (X) = X >θ is a linear function, we may
reduce the problem to linear regression, i.e. to finding the unknown
coefficient vector θ.
To evaluate algorithms for regression, it is useful to analyze simple
generative models for the data. The Gaussian linear model is given
by
Yi = Xi>θ + ξi,
9
i = 1, . . . , n
(1.1)
where ξ1, . . . , ξn are i.i.d. Gaussian; or, in the matrix form:
Y = Xθ + ξ.
(1.2)
Regarding the genesis of the matrix X (equivalently, its row vectors X1,. . . ,Xn), there are two types of models that are commonly
studied:
• Fixed design models:
where X is a deterministic matrix
chosen by the statistician.
• Random design models: where X1,. . . ,Xn are i.i.d. random
vectors drawn from some distribution on Rd.
1.1
Gaussian Sequence Model
The model is
Yi = θi + ξi,
i = 1, . . . , d
(1.3)
where ξ1,. . . ,ξd are i.i.d. N (0, σ 2) random variables. This corresponds to a Gaussian linear model with identity fixed design matrix
X = Id. The Gaussian sequence model is perhaps the simplest since
the observations of the coordinates of the coefficient vector are decoupled. Yet, the model sheds light on many key ideas that will carry
over into general linear regression models or beyond. One key idea,
which we are going to explore in the next, is that we can do better
than simply returning θ̂ = Y , by using an idea called shrinkage.
10
1.1.1
Sparsity adaptive thresholding
Let us explore how shrinkage helps in the case of sparse signals. We
shall use the following notations: for θ ∈ Rd,
kθk0 :=
d
X
1{θi 6= 0}.
(1.4)
i=1
The set of k-sparse vectors is denoted as
B0(k) := {θ : kθk0 = k}.
(1.5)
Note that the naive estimator θ̂(Y ) = Y has squared error
Ekθ̂(Y ) − θk22 = Ekθ̂(θ + ξ) − θk22
= Ekξk22
= dσ 2.
(1.6)
(1.7)
(1.8)
In contrast, if the estimator has the oracle information of supp(θ) :=
{i : θi 6= 0}, then the statistician can use the estimator
θ̂i = Yi1{i ∈ supp(θ)},
(1.9)
and the squared error should be kσ 2 instead, by a similar calculation
as above. If k scales linearly with d, the naive estimator achieves the
performance of the oracle estimator up to a constant factor of d/k.
However, more interesting is the case where k is much smaller than
d asymptotically (as a model example, think of k as a polynomial of
d, say k = d−0.5). This is the setting where the following estimator
shows power:
11
Hard thresholding estimator:
• Input: vector θ ∈ Rd and parameter τ ∈ (0, ∞).
• Output: θ̂ where
θ̂i := Yi1{|Yi| > τ }.
(1.10)
How should we pick the threshold τ ? The idea to pick it just large
enough so that there is no type-I error, with constant probability.
Using the union bound, we can see that
p
max |ξi| ≤ σ 2 log(2d/δ)
(1.11)
1≤i≤d
with probability at least 1 − δ.
Theorem 1. Consider the Gaussian sequence model and the
hard thresholding estimator with
p
τ = 2σ 2 log(2d/δ).
(1.12)
For any θ ∈ B0(k), we have
Ekθ̂ − θk22 . σ 2k log
with probability at least 1 − δ.
12
2d
δ
(1.13)
Proof. As mentioned, with probability at least 1 − δ there is
max |ξi| ≤ τ /2
1≤i≤d
(1.14)
which we call the event A. Now under A, we have
|θ̂i − θi| = |Yi − θi|1{|Yi| > τ } + |θi|1{|Yi| ≤ τ }
= |ξi|1{|θi + ξi| > τ } + |θi|1{|Yi| ≤ τ }
τ
≤ 1{|θi + ξi| > τ } + |θi|1{|Yi| ≤ τ }
2
τ
τ
3
≤ 1{|θi| > } + |θi|1{|θi| ≤ τ }
2
2
2
. min{|θi|, τ }
(1.15)
(1.16)
(1.17)
(1.18)
(1.19)
for each i = 1, . . . , d. This yields
kθ̂ −
θk22
.
d
X
min{|θi|2, τ 2} . kθk0τ 2.
(1.20)
i=1
Remarkably, within a factor of log 2d
δ , the hard thresholding estimator achieved the performance of the oracle estimator. Moreover,
the estimator is adaptive in the sense that τ is selected without
reference to the sparsity level k.
The following estimator shares similar desirable properties (achieving O(σ 2k log 2d
δ ) error with probability at least 1 − δ, and adaptivity), while having the advantage of being continuous:
13
Soft thresholding estimator:
• Input: vector θ ∈ Rd and parameter τ ∈ (0, ∞).
• Output: θ̂ where
θ̂i := (Yi − τ )1{Yi > τ } + (Yi + τ )1{Yi < −τ }. 1.1.2 (1.21) Stein’s paradox and the James-Stein estimator In this section we do not impose any sparsity constraint on ground truth θ. By translation invariance (Is it really true? See Exercise 3), it then seems that nothing more can be done than the naive estimator θ̂naive = Y. (1.22) It is therefore paradoxical that the following actually holds: Theorem 2. The estimator θ̂naive is inadmissible, in the sense that there exists another estimator θ̂ such that Ekθ̂ − θk22 ≤ Ekθ̂naive − θk22 (1.23) for any ground truth θ ∈ Rd, and moreover, there exists at least one θ such that the inequality is strict. 14 The James-Stein estimator is one such estimator θ̂. It is defined as θ̂JS := σ 2(d − 2) 1− Y. kY k2 (1.24) Clearly, the idea is still based on shrinkage. In retrospect, the paradox is resolved since the problem is not completely translation invariant: the noise vector ξ is centered (see Exercise 3). Intuitively, a larger kY k suggests a larger “signal-to-noise ratio”, and that the estimator should shrink less. Stein considered estimators of the following form, which clearly encapsulates the estimator of (1.24). θ̂ = g(Y )Y, (1.25) where g : Rd → R is a function to be chosen. Stein’s unbiased risk estimator (SURE) gives an estimate of the mean squared error (risk) for (1.25): Lemma 3. Assume that g is a function satisfying regularity conditions1. Suppose that θ̂ is given by (1.25). Then an unbiased estimator of the risk E[kθ̂ − θk2] is given by SURE = σ 2d(2g(Y ) − 1) + 2σ 2 d X i=1 Yi ∂g (Y ) + kY k22(1 − g(Y ))2. ∂yi (1.26) 1 See [Tsy08]. 15 Note that SURE is a function only of Y (not of the unknown θ), and it will be shown that its expectation equals E[kθ̂ − θk2]; this is the meaning of “unbiased estimator of the risk”. Let us remark that estimators of the risk is broadly useful in statistics: for example in nonparametric statistics the nuisance parameters (such as the bandwidth) can be tuned by minimizing an estimator of the risk which can be computed from the observations. Another common technique for risk estimation is cross-validation. Proof of Lemma 3. We begin with the expansion Ekθ̂ − θk22 = = d X i=1 d X E(g(Y )Yi − θi)2 (1.27) E(Yi − θi)2 + 2E[(θi − Yi)(1 − g(Y ))Yi] i=1 +E[Yi2(1 − g(Y ))2] . (1.28) Clearly E(Yi − θi)2 = σ 2. We now apply the Gaussian integration by parts (Exercise 4) to find ∂g E[(θi − Yi)(1 − g(Y ))Yi] = −σ 2E 1 − g(Y ) − Yi (Y ) (1.29) ∂yi which get rids of the dependence on θ, and the claim follows. Applying Lemma 3 to the James-Stein estimator, we obtain the following 16 Theorem 4. Let d ≥ 3. For any θ ∈ Rd, 4 2 σ (d − 2) Ekθ̂JS − θk2 = dσ 2 − E kY k2 (1.30) which is strictly smaller than EkY − θk2. In particular, θ̂ = Y is inadmissible. Proof. From the expression in SURE we see that Ekθ̂ − θk2 = dσ 2 + E[W (Y )], (1.31) where we have defined the function W (y) := −2σ 2d(1 − g(y)) + 2σ 2 d X i=1 yi ∂g (y) + kyk22(1 − g(y))2. ∂yi (1.32) c If g(y) = 1 − kyk 2 for some c > 0, then we can explicitly compute

1
2
2
2
W (y) :=
−2dcσ + 4σ c + c .
kyk2
(1.33)
Minimizing W (y) over c > 0 yields
copt = σ 2(d − 2),
and the claim follows by plugging in copt into SURE.
17
(1.34)
1.2
Gaussian Linear Model and the
Least Squares Regression
The next a few sections concern the Gaussian linear model (1.2):
Y = Xθ + ξ
(1.35)
where X ∈ Rn×d. First, let us review the basic least square estimator,
which will reveal the typical behavior of the risk. Solving
θ̂LS ∈ argminθ∈Rd kY − Xθk22
(1.36)
θ̂LS = (X>X)−1X>Y
(1.37)
gives
when X has full column rank. In general, (X>X)−1 is replaced by
the Moore-Penrose pseudoinverse of X>X.
Theorem 5. The mean square error of the least square estimator
is
EkXθ̂LS − Xθk22 = dσ 2.
18
(1.38)
Proof. Assuming σ 2 = 1, we have
EkXθ̂LS − Xθk22 = EkX(X>X)−1X>Y − Xθk22
(1.39)
= EkX(X>X)−1X>ξk22
(1.40)
= tr(X(X>X)−1X>)
(1.41)
= tr((X>X)−1X>X)
= tr(Id)
= d.
(1.42)
(1.43)
(1.44)
Remark 1. We see that the scaling of the mean square risk is the
same as in the Gaussian sequence model (1.8), which is the degree of
freedom multiplied by the noise variance. Moreover, the result does
not depend on (the full rank) X at all!
Remark 2. In general if X has rank r, we can show that EkXθ̂LS −
Xθk22 = rσ 2. Moreover, if the noise ξi is subgaussian instead of
Gaussian, a similar result of EkXθ̂LS − Xθk22 . rσ 2 holds, using a
different derivation via the maximal inequality [Rig, Theorem 2.2].
1.3
`0 and `1 Regularized Regressions
The least squares estimator does not apply to the case of d > n.
Moreover, as we saw in the case of Gaussian sequence model (Theorem 1), when θ is k-sparse we should expect the mean square error
19
to scale as kσ 2 (up to logarithmic factors) instead of dσ 2. Motivated by the soft and hard thresholding estimators in the Gaussian
sequence model, it is natural to consider the Bayes Information Criterion (BIC) estimator and the Lasso estimator:

BIC
θ̂
∈ argminθ∈Rd kY − Xθk22 + τ 2kθk0 ;
(1.45)

θ̂L ∈ argminθ∈Rd kY − Xθk22 + 2τ kθk1
(1.46)
where we recall that
kθk0 :=
kθk1 :=
d
X
i=1
d
X
1{θi 6= 0};
(1.47)
|θi|.
(1.48)
i=1
Indeed, we can check that when n = d and X = Id, BIC and Lasso
are reduced to the hard-thresholding and the soft thresholding estimators (Exercise 5).
Let us remark that the contents of this and the following sections
√
are mostly from [Rig], where X has spectral norm scaling as n.
However, in this note we rescale X so that the spectral norm is
order 1, so that the correspondence to the Gaussian sequence model
(X = Id case) is cleaner. Correspondingly, there is no extra n1 factor
in front of the 2-norms in (1.45)-(1.46).
20
Computation issues.
Exactly solving BIC is known to be NP-hard, since it requires enumerating all possible sparsity patterns supp(θ) := {i : θi 6= 0}. In
contrast, the Lasso estimator is a convex optimization and can be
solved efficiently by a number of algorithms. One of the most popular
methods is coordinate gradient descent, which has been implemented
in the glmnet package in R.
Fast rate for the Lasso.
Under a restricted eigenvalue condition for X, we will show that
Lasso can achieve mean square error of the same kσ 2 log(. . . ) order
as the soft/hard-thresholding estimator for the Gaussian sequence
model. This is called the fast rate for the Lasso, as opposed to the
slow rate which holds for a more relaxed class of X (see [Rig, Section
2.4]).
Definition 6. We say that X satisfies the (k, κ)-restricted eigenvalue condition (RE) if
kXθk22
inf inf
≥κ
|S|≤k θ∈CS kθS k22
(1.49)
where CS = {θ : kθS c k1 ≤ 3kθS k1}.
Remark 3. Our definition (1.49) is slighted different from some literature by a factor of n and also the denominator has θS instead
21
of θ; in [RWY10] the restricted eigenvalue condition is defined as
kXθk22
1
n inf |S|≤k inf θ∈CS kθk22 ≥ κ (and as a consequence other parts in the
theory are rescaled).
Remark 4. A “typical behavior” for random matrices is that RE
is satisfied when n ≥ k log d; see Corollary 1 and Section 3.2 in
[RWY10] for more precise statements.
The following result is taken from [Rig, Theorem 2.18]
Theorem 7. Consider the Gaussian linear model (1.2). Let
n ≥ 2. Assume that the ground truth kθ∗k0 ≤ k, and X satisfies the (k, 1/2)-restricted eigenvalue condition and that each
column satisfies kXj k22 ≤ 2. Then the Lasso estimator θ̂L with
regularization parameter satisfying
!
r
p
1
log(2d) + log
(1.50)
2τ = 8σ
δ
satisfies
kXθ̂L − Xθ∗k22 . kσ 2 log(2d/δ)
(1.51)
with probability at least 1 − δ.
Proof. First, we write out the optimality condition for the Lasso
estimator:
kY − Xθ̂Lk22 ≤ kY − Xθ∗k22 + 2τ kθ∗k1 − 2τ kθ̂Lk1.
22
(1.52)
Comparing with our goal of bounding kXθ̂L − Xθ∗k2, we see that
we should use Y = Xθ∗ + ξ to open up the squares in the above
inequality. This yields
kXθ̂L − Xθ∗k22 ≤ 2ξ >X(θ̂L − θ∗) + 2τ kθ∗k1 − 2τ kθ̂Lk1.
(1.53)
The hard part now is to control ξ >X(θ̂L − θ∗). The difficulty lies
in the fact that θ̂L and ξ are correlated. A few intuitive ideas for
handling similar situations in high-dimensional statistics include:
√
ξ > X(θ̂L −θ∗ )
>
• “sup-out” θ̂, i.e., use kX(θ̂L−θ∗)k ≤ supx : kxk2≤1 ξ x . n
2
√
(with high probability), so that ξ >X(θ̂L − θ∗) . nkX(θ̂L −
θ∗)k2. An example of this argument can be found in [Rig,
Theorem 2.2] about the mean square error of the least squares
estimator.
• The `1-regularization forces θL to be “low-complexity”, so that
it cannot be too correlated with ξ. (In the extreme case where
θL is a constant, we obviously have E[ξ >X(θ̂L − θ∗) = 0].)
The derivation here will be essentially fleshing out the second idea.
With probability ≥ 1 − δ we have
ξ >X(θ̂L − θ∗) ≤ kξ >Xk∞kθ̂L − θ∗k1
τ
≤ kθ̂L − θ∗k1
2
23
(1.54)
(1.55)
where the first step follows by Hölder’s inequality and the second
follows by the assumption kXj k22 ≤ 2 for each column, and the
bound on the Gaussian max in Exercise 1. Combining (1.55) and
(1.53), we obtain
kXθ̂L − Xθ∗k22 ≤ τ kθ̂L − θ∗k1 + 2τ kθ∗k1 − 2τ kθ̂Lk1
(1.56)
= τ kθ̂SL − θS∗ k1 − τ kθ̂SLc k1 + 2τ kθS∗ k1 − 2τ kθ̂SLk1
(1.57)
≤ 3τ kθ̂SL − θ∗k1 − τ kθ̂SLc k1
(1.58)
where we have used θS∗ = θ∗ and the triangle inequality. Thus, 3kθ̂SL −
θ∗k1 ≥ kθ̂SLc k1 holds, and by Cauchy-Schwarz and the restricted
eigenvalue condition,
√
√
L
L
∗
∗
(1.59)
kθ̂S − θ k1 ≤ kkθ̂S − θ k2 ≤ 2kkX(θ̂L − θ∗)k2.
√
L
∗ 2
Plugging this into (1.58), we find kXθ̂ − Xθ k2 ≤ 3τ 2kkX(θ̂L −
θ∗)k2 or equivalently
√
L
∗
kXθ̂ − Xθ k2 ≤ 3τ 2k.
(1.60)
24
1.4
Basis Pursuit and the Null Space
Condition
In this section we introduce the Basis Pursuit (BP) algorithm and
its exact recovery property in the noiseless setting under a null space
condition. Recall that the Lasso algorithm (1.46) has inputs Y and
X, and selects θ̂ by solving the following:

θ̂L ∈ argminθ∈Rd kY − Xθk22 + 2τ kθk1 .
(1.61)
Now if X has full column rank and τ ↓ 0 (meaning that kY − Xθk22
has a much higher weight than kθk1), we see that θ̂L converges to the
solution of the following basis pursuit linear program (introduced
by [CHE98]):
θ̂BP ∈ argminθ : Y =Xθ kθk1.
(1.62)
Definition 8. Fix X ∈ Rn×d and k ∈ {1, 2, . . . }. We say X
satisfies the exact recovery property of order k if for any k-sparse
vector θ∗ ∈ Rd and
y := Xθ∗,
(1.63)
the BP algorithm with inputs X and y returns the solution θ̂BP = θ∗.
Note that there is no additive noise in (1.63), hence the generative
model is deterministic and we used the lowercase letter y.
25
Definition 9. Fix X ∈ Rn×d and k ∈ {1, 2, . . . }. We say X satisfies
the null space condition of order k if
kzS k1 < kzS c k1, ∀z ∈ N (X) \ {0}, S : |S| ≤ k (1.64) where N (X) denotes the null space of X. The definition of the null space condition appeared in earlier work of Donoho and Huo [DH06]; Feuer and Nemirovski [FN03], and was further explored by Cohen et al. [CDD09]. The significance of the null space condition is seen in its equivalence to the exact recovery by BP in the noiseless setting: Theorem 10. For given X ∈ Rn×d and k ∈ {1, 2, . . . }, the null space condition and the exact recovery condition of order k are equivalent. The equivalence of the null space condition and exact recovery has been discussed in a number of papers (Cohen et al.[CDD09]; Donoho and Huo [DH06]; Elad and Bruckstein [EB02]; Feuer and Nemirovski [FN03]) for more discussion of restricted nullspaces and equivalence to exact recovery of basis pursuit. Proof of Theorem 10. First, assume that the null space condition is satisfied. Assume that θ∗ is a k-sparse vector with support S, and y := Xθ∗. To show the exact recovery property, we need to show that for any θ0 ∈ Rd, θ0 6= θ∗ satisfying Xθ0 = Xθ∗, there is kθ0k1 > kθ∗k1.
26
(1.65)
Define z := θ0 − θ∗ 6= 0, we see that z ∈ N (X), and hence kzS k1 < kzS c k1 by the null space property. Therefore, kθ0k1 = kθS0 k1 + kθS0 c k1 = kθS0 k1 + kzS c k1 > kθS0 k1 + kzS k1
≥ kθS0 − zS k1
= kθS∗ k1
= kθ∗k1
(1.66)
(1.67)
(1.68)
(1.69)
(1.70)
(1.71)
as desired, hence the exact recovery property holds.
Conversely, suppose that the exact recovery property holds. Now
pick any z ∈ N (X), z 6= 0. Define θ∗ := zS (where we recall that
zS is defined by zS (i) := z(i)1i∈S , i − 1, . . . , d) and θ0 := −zS c . We
have Xθ∗ = Xθ0 and therefore the exact recovery property implies
kθ∗k1 < kθk1. This is equivalent to kzS k1 < kzS c k1 and hence the null space property holds. The equivalence between exact recovery and the null space condition can be extended to certain nonconvex optimization problems (Exercise 6). Furthermore, a version of the null space condition can be shown to be equivalent to the robust recovery property under the noisy observation model y = Xθ∗ + ξ, that is, kθ̂BP − θ∗k2 ≤c kξk2 27 (1.72) for some constant c independent of ξ and θ∗; see [LJG15] for details. Comparison with RE. Proposition 11. The restricted eigenvalue condition (RE) in Definition 6 is stronger than the null space property. Proof. Suppose that (k, κ)-RE holds for X but the null space property of order k fails. Then there exists z ∈ N (X), z 6= 0 such that kzS k1 ≥ kzS c k1 where |S| = k. This implies zS 6= 0 and z ∈ CS where CS is as defined in Definition 6. The restricted eigenvalue condition implies kXzk22 ≥ κkzS k22 > 0
(1.73)
where we assumed κ > 0, This contradicts z ∈ N (X), hence the
null space condition must hold.
1.5
NSC for Random Designs
In this section we show that the null space conditions holds with
high probability under the random design where X ∈ Rn×d has i.i.d.
N (0, 1/n) entries and k . n/ log d. Up to the logarithmic term, this
is the best we can expect. Our proof is based on Gordon’s escape
through the mesh theorem.
28
Definition 12. The Gaussian width of a set K ⊆ Rd is defined as
w(K) := E sup G>x
(1.74)
x∈K
where G ∼ N (0, Id).
Theorem 13 (Gordon). Let K be a subset of the unit Euclidean
sphere S d−1 in Rd. Let ν be a uniformly distributed2 random
(d − n)-dimensional subspace of Rd. Assume that
√
w(K) < n. (1.75) Then ν ∩ K = ∅ with probability at least √ 2 (n/ n + 1 − w(K)) . 1 − 2.5 exp − 18 (1.76) The proof of Theorem 13 in [Gor88] is based on the Gaussian comparison inequalities, which is a type of results establishing inequalities between the (expectations of the) extremal values of two Gaussian processes with related covariance structures. This is usually a topic of a high dimensional probability course which we shall not get into. With the help of Gordon’s theorem we can establish the null space property under the i.i.d. Gaussian random designs when n & k log d [RV08], which is essentially the best we can hope for. 2 That is, the distribution of the subspace is rotation-invariant; or, the distribution is the Haar measure on the Grassmannian. 29 Theorem 14. Let d ≥ 10, k ≥ 1, and n ≥ 400k log d. Let X ∈ Rn×d be a random matrix where each entry is i.i.d. N (0, 1/n). Then X satisfies the null space condition of order k with probability at least 1 − 2.5d−k/18. Proof. For any S ⊆ {1, 2, . . . , d}, define KS := {z ∈ S d−1 : kzS c k1 ≤ kzS k1} (1.77) and define K := [ KS . (1.78) S : |S|=k We now wish to upper bound w(K) by controlling the upper tail of supx∈K G>x, which, in turn, is based controlling the upper tail fo
supx∈KS G>x and applying the union bound over S. Note that for
>
any x ∈ KS , we have G>x = G>
S xS + GS c xS c . Moreover,
>
G>
S xS ≤ kGS k2
(1.79)
>
G>
S c xS c ≤ kGS c k∞ kxS c k1
≤ kGk∞kxS c k1
≤ kGk∞kxS k1
√
≤ kGk∞ kkxS k2
√
≤ kkGk∞
(1.80)
(1.81)
(1.82)
and
30
(1.83)
(1.84)
√
Therefore supx∈K G>x ≤ supS : |S|=k kGS k2 + kkGk∞. We have
√
√ √
E[ kkGk∞] ≤ k( 2 log 2d+1) (this can be shown from the bound
on the Gaussian max Exercise 1), whereas
E[ sup kGS k2]
S : |S|=k
Z ∞
=
P[ sup kGS k2 > λ]dλ
(1.85)
S : |S|=k
0
Z ∞
p
√
√
√
√
P[ sup kGS k2 > k + 2t]d( 2t)
≤ k + 2k log d +
S : |S|=k
t=k log d
(1.86)
≤2
p
k
Z
∞
−t
2k log d + 2d
p
= 2 2k log d + 2dk
p
≤ 2 2k log d + 2
√
e d( 2t)
log d
Z k∞
√
e−s
2 /2
ds
(1.87)
(1.88)
2k log d
(1.89)
where (1.87) follows by the tail bound on the chi-square distribution (Exercise (7)) together with the union bound which gives an
additional kd ≤ dk factor. Thus we have shown that
p
√ p
w(k) ≤ k( 2 log 2d + 1) + 2 2k log d + 2
(1.90)
p
≤ 10 k log d
(1.91)
which means that the condition (1.75) in Gordon’s theorem is satisfied. Applying Gordon’s theorem with ν := N (X) shows that
31
N (X) ∩ K = ∅ with probability at least 1 − 2.5d−k/18.
Remark 5. For the restricted eigenvalue condition (which is stronger
than the null space condition according to Proposition 11), it is also
true that n & k log d suffices; see Corollary 1 in [RWY10] for the
i.i.d. random designs. In fact, the bound in Corollary 1 in [RWY10]
holds for the more general class of random designs where the rows
are arbitrary Gaussian vectors with possibly correlated entries. The
argument of [RWY10] is more lengthy than our Theorem 14 but the
ideas are rather standard by today’s standards: the goal is essentially
to bound the tail probability of a Rayleigh quotient. This is based
on two components 1) the expectation of the Rayleigh quotient is
bounded using a comparison theorem for the Gaussian process (similar to Gordon’s theorem). 2) The deviation from the expectation is
controlled using concentration inequalities. Again, these are common topics in a high dimensional probability course, which we will
not touch deeply in our high dimensional statistics course.
1.6
More on Convex Geometry
In the previous section we have seen a proof of the null space condition using Gordon’s theorem, which says that a set on the sphere with
small Gaussian width will miss a random hyperplane with high probability. As mentioned, the Gaussian width is essentially the mean
width from convex geometry. In this section, we discuss a different
32
approach for establishing the null space condition (among other results in high dimensional regression) by Amelunxen, Lotz, McCoy
and Tropp [ALMT14]. While Gordon’s theorem [Gor88] relied on
the Gaussian comparison theorem, the approach of [ALMT14] has a
more geometric flavor and also yields more general results.
1.6.1
A General Result on the Intersection of
Convex Cones
First definition of the statistical dimension.
A set K in Rd is said to be a convex cone if it is a cone (i.e., x ∈ K
implies λx ∈ K for all λ ≥ 0) and is convex. We define the statistical
dimension of a closed convex cone K as
δ(K) := E[kPK (G)k22]
(1.92)
where G ∼ N (0, Id) and PK (G) := argminz∈K kG − zk2.
An equivalent definition of the statistical dimension (which applies to general sets in Rd) will be given in Section 1.6.3.
The statistical dimension of K equals the square of Gaussian
width of K ∩ S d−1 up to an additive constant; see Exercise 9.
The following result is found in Theorem I in [ALMT14].
Theorem 15. Fix η ∈ (0, 1). Let C and K be closed convex cones
in Rd, and let Q ∈ Rd×d be a random orthogonal matrix (with
33
rotationally invariant distribution). Then
√
δ(C) + δ(K) ≤ d − aη d =⇒ P[C ∩ QK 6= {0}] ≤ η
√
δ(C) + δ(K) ≥ d + aη d =⇒ P[C ∩ QK 6= {0}] ≥ 1 − η
p
where aη := 8 log(4/η).
(1.93)
(1.94)
If K is a (d − n)-dimensional subspace, it is easy to see from
the definition that
√ δ(K) = d − n. Then (1.93) implies that if
δ(C) ≤ n − aη d then P[C ∩ QK 6= {0}] ≤ η. This is already
implied by Gordon’s theorem (Theorem 13) since w2(C ∩ S d−1) ≤
δ(C) ≤ w2(C ∩ S d−1) + 1 (Exercise 9). On the other hand, Gordon’s
theorem did not provide the converse part (1.94). Note, however,
a converse of Gordon’s theorem for specific choices of C are easy to
check (Exercise 8), and for the application in null space condition it
is easy to see the near sharpness of the n & k log d bound. Nevertheless, Theorem 15 is surprising since it shows that the statistical
dimension/Gaussian width along is enough to determine the intersection probability, and that a phase
√ transition exists for all convex
cones (with a window size of ∼ d for the statistical dimension).
1.6.2
Implication for Basis Pursuit
Given θ ∈ Rd, define the descent cone
D(θ) := {z ∈ Rd : ∃ > 0, kθ + zk1 ≤ kθk1},
34
(1.95)
that is, the set of directions at which the objective of the basis pursuit is decreased. Using a similar argument as in the proof of the
equivalence between NSC and the exact recovery (Section 1.4), it is
easy to see that
Given X, θ, BP exactly recovers θ ⇐⇒ D(θ) ∩ N (X) = {0}.
(1.96)
The difference between this claim and Theorem 10 is that the former
is about a condition of X and θ while the latter is about a condition
of X and k.
If θ ∈ Rd is k-sparse, using an argument similar to Theorem 14
we can show that δ(D(θ)) = Θ(k log d) (i.e. equals k log d up to
constant factors); see Exercise 10. Combining this observation with
Theorem 15, we obtain:
Proposition 16. Suppose that X ∈ Rn×d and its null space is
a (uniformly) random (d − n)-subspace, and θ ∈ Rd is k-sparse
(d ≥ 2k). There there exists universal constants c, C > 0 such
that
r

1
n ≥ C k log d + d log
=⇒ P[BP exactly recovers θ] ≥ 1 − η;
η
(1.97)
r

1
n ≤ c k log d − d log
=⇒ P[BP exactly recovers θ] ≤ η.
η
(1.98)
35
The part (1.97) is already implied by Theorem 14. Moreover,
Theorem 14 is stronger since it provides a lower bound on the probability that BP recovers any k-sparse vector. Let us also remark that
the factor log d in front of k is not removed in (1.97) even though we
are looking at a specific instance of θ; indeed, while the log d factor
can be removed
√ in (1.89) when we bound E[kGS k2] instead, the log d
factor in E[ kkGk∞] still remains.
Finally, let us remark that Theorem 15 has many applications in
linear inverse problems beyond the analysis of BP, including demixing
and low rank matrix recovery. Please refer to [ALMT14] for more
details.
1.6.3
Proof Sketch
In this section we mention some ingredients for the proof of Theorem 15 in [ALMT14]. The materials require more background and
are only intended for interested readers.
Intrinsic volumes.
A central idea of the proof is to define the so-called intrinsic volume
random variable [MT14, LMN+20], show its concentration property,
and its connection to the statistical dimension. The intrinsic volume
random variable has a distribution on {1, 2, . . . , d}, where the probabilities are called the (normalized) intrinsic volumes. There are
36
two equivalent definitions:
• One definition given in Definition 5.1 in [ALMT14] is as following: First, if K is a polyhedral cone (i.e., an intersection finitely
many half spaces), then Ṽj (K) is defined as the probability that
PK (G) lies in the relative interior of a j-dimensional face of K.
As before, PK (G) denotes the projection of a standard Gaussian vector onto K. Then the definition can be extended to a
general closed cone K via approximation (in the conic Hausdorff metric) by polyhedral cones. It is clear from the definition
that
d
X
Ṽj (K) = 1.
(1.99)
j=0
• Another definition is through the spherical Steiner formula,
for which we refer to [MT14]. Here for simplicity, let us describe
instead the intrinsic volume for convex bodies (compact convex
sets with nonempty interior) via the standard (i.e. not spherical) Steiner formula. Note, though, that convex bodies are
not cones and the corresponding intrinsic volumes are different
albeit related concepts. The intrinsic volumes for convex bodies are closer to our intuitions about volumes in the Euclidean
space.
The following definition through the mixed volumes is taken
from Definition 1.6 in [LMN+20]. Let K be a convex body
37
(compact and having nonempty interior). Let B be the unit
ball in Rd. The Steiner formula states that for any t ≥ 0,
d
X
d
(d)
vol(K + tB) =
Wj (K)tj
(1.100)
j
j=0
(d)
where Wj (K), j = 0, . . . , d are nonnegative numbers, called
the quermassintegrals. In fact, in general, vol(K + tC) is a
polynomial in t for any convex sets K and C, which can be
shown by expressing the volume using the support function of
the convex sets (see e.g. [SVH19]) with coefficients being the
mixed volumes. The intrinsic volumes are defined by

1 d
(d)
Vd−j (K) =
Wj (K)
(1.101)
κj j
for j = 0, . . . , d, where κj denotes the volume of the unit ball
in Rj . As a matter of fact, the intrinsic volumes are intrinsic: if
K is embedded in a higher dimensional space, intrinsic volumes
defined by (1.101) do not change. Now the normalized intrinsic
volumed are defined by
Ṽj (K) :=
Vj (K)
,
W (K)
j = 0, . . . , d
(1.102)
Vj (K).
(1.103)
where
W (K) :=
d
X
j=0
38
Finally, let us comment that the distribution of the (conic) intrinsic volume random variable is invariant under scaling, since if K
is a closed convex cone, then tK = K for any t > 0. In contrast,
the distribution of the intrinsic volume random variable of a convex
body is not invariant under scaling. Indeed, Vj is j-homogenous; In
fact, limt→∞ Vj (tK) = 1{j = d} for convex body K.
Second definition of the statistical dimension.
The statistical dimension defined in (1.92) can be alternatively defined as the expectation of the intrinsic volume random variable. The
proof of equivalence is shown in [MT14].
Conic kinematic formula.
The proof of Theorem 15 in [ALMT14] relied on the following conic
kinematic formula:
Theorem 17. Let C and K be closed convex cones in Rd, one of
which not a subspace3, and Q be a (uniformly) random orthogo3
The formula does not hold when both C and K are linear subspaces, in which case Ṽ
becomes the indicator of the dimension. One might think of approximating of a subspace by
convex cones which are not subspaces and applying a continuity argument for Ṽ ; however,
approximation of a subspace is not so easy due to the convex cone constraint.
39
nal matrix. Then
i−1
d X
X
(1 + (−1)i−j+1)Ṽi(C)Ṽd−j (K).
P[C ∩ QK 6= {0}] =
i=0 j=0
(1.104)
We refer this result to p260 in [SW08]. In the well-cited paper
[ALMT14], this result was presented in Fact 2.1 as the following
formula:
d
d
X
X
P[C ∩ QK 6= {0}] =
(1 + (−1)i+1)
Ṽi(C)Ṽd+i−j (K) (1.105)
i=0
j=i
which appears to be incorrect (the formula is not symmetric in the
roles of C and K, and it fails when we consider the example where
the intrinsic volume random variables for C and K are concentrated
around some i0 and j0).
In fact, (5.8) stated an equivalent but more compact formula:
P[C ∩ QK 6= {0}] =
2d
X
(1 + (−1)k−d+1)Ṽk (C × K)
(1.106)
k=d+1
where C × K denotes the product set in R2d. Using the fact that the
conic intrinsic volume of a product set can be computed by convolution (Corollary 5.1 in [MT14]),
X
Ṽk (C × K) =
Ṽi(C)Ṽj (K),
(1.107)
i+j=k
40
we obtain (1.104).
Concentration of the intrinsic volume random variable.
The last key technical ingredient for the proof of Theorem 15 is the
concentration property of the intrinsic volume random variable. This
was shown in Theorem 6.1 in [ALMT14] and improved in [MT14] for
convex cones and in Theorem 1.11 in [LMN+20] for convex bodies (via a beautiful information theoretic argument). Concentration
states that the intrinsic volume random variable is close to its mean
with high probability. For example, a basic (though not the strongest
possible) result from Theorem 1.11 in [LMN+20] states that
Var(ZK ) ≤ 4d
(1.108)
where ZK denotes the intrinsic random variable associated with the
convex body K. (This is nontrivial since a random variable over
{1, 2, . . . , d} may have a variance as large as n2/4.) Analogous variance bound for the conic intrinsic volume random variable can be
found in Theorem 4.5 in [LMN+20]. High probability bounds can
then be deduced from the variance bound via the Chebyshev inequality.
Using Theorem 17 and the concentration of the intrinsic volume
random variables ZK and ZC , Theorem 15 can be proved using about
one page (see [ALMT14]). Here we give a short heuristic proof just
to illustrate the idea. Note that the double sum in (1.104) is over
41
i and j such that i − j is odd. An interlacing property [ALMT14]
shows that we can approximate by alleviating this parity constraint
while reducing a factor of 2. Thus
P[C ∩ QK 6= {0}] ≈
i−1
d X
X
Ṽi(C)Ṽd−j (K)
(1.109)
Ṽi(C)Ṽl (K)1i+l≥d+1
(1.110)
i=0 j=0
=
d X
d
X
i=0 l=0
= E[1{ZC + ZK ≥ d + 1}]
≈ 1{E[ZC + ZC ] ≥ d + 1}
(1.111)
(1.112)
where the last approximation follows since ZC and ZK are close
to their expectations with high probability (concentration). Theorem 15 then follows since E[ZC ] = δ(C).
42
Chapter 2
Graphical Lasso
This chapter is about graphical lasso, which is an algorithm for learning the covariance matrix under a sparsity assumption on the precision matrix (inverse of the covariance matrix).
2.1
Gaussian Graphical Model
Consider a random vector X ∈ Rp following the normal distribution
N (0, Σ). The precision matrix is defined as
Θ = Σ−1.
(2.1)
We can construct an undirected graph (V, E) to encode the conditional independence structure of the coordinates of X: let V =
{1, . . . , p} and E ⊆ V × V where
(u, v) ∈
/ E ⇐⇒ Xu ⊥ Xv |XV\{u,v}
43
(2.2)
where Xu ⊥ Xv |XV\{u,v} denotes the conditional independence (that
is, there is a Markov chain Xu −XV\{u,v} −Xv ). By the HammersleyClifford theorem1, for any u, v ∈ {1, . . . , p},
Xu ⊥ Xv |XV\{u,v} ⇐⇒ Θu,v = 0.
(2.3)
In fact, for Gaussian distributions, we can directly prove (2.3) by
computing the conditional covariance (Exercise 11). By (2.3), we see
that in many applications, Θ is a sparse matrix, since the underlying
graphs in many graphical models are sparse.
Suppose that there are n i.i.d. samples x(1),. . . ,x(n) from the distribution N (0, Θ−1), how can we estimate the covariance? The first
idea is maximum likelihood estimation. Note that up to additive
constants, the log-likelihood function is

n
X
1
`(Θ) =
(2.4)
log det1/2(Θ) exp − x(i)>Θx(i)
2
i=1
n
n
1X
= log det(Θ) −
tr(x(i)x(i)>Θ)
(2.5)
2
2 i=1
n
n
= log det(Θ) − tr(SΘ)
(2.6)
2
2
P
where S = n1 ni=1 x(i)x(i)> denotes the sample covariance matrix.
1
The theorem states that for a general Markov network, the Markov and factorization
properties are equivalent, if the probability distribution has a strictly positive mass or density
(e.g. [LW13]). The latter is always fulfilled in the Gaussian case.
44
The classical theory tells that for fixed p, the maximum likelihood
estimator (MLE) converges to the truth as n → ∞. However, in the
regime of p > n, MLE does not even exist (Exercise 12).
The graphical lasso [FHT08] addresses this issue by leveraging
the sparsity of Θ and adding an `1 penalty to the objective function:
minimizeΘ0
− log det Θ + tr(SΘ) + λkΘk1.
(2.7)
Here, A B means A − B is a positive-semidefinite matrix; kΘk1
denotes the sum of the absolute values of entries of Θ, and λ >
0 is a tuning parameter. This is a convex optimization problem
(Exercise 13) with ∼ p2 parameters to optimize.
2.2
Dual Formulation
The problem (2.7) is convex since each summand is convex in Θ.
The dual convex problem is the following:
maximizeΓ : kΓk∞≤λ
log det(S + Γ) + p.
(2.8)
The equivalence between (2.7) and (2.8) can be shown by using the
KKT condition to obtain properties of the optimizers [MH12]. Here,
however, we introduce a systematic way of writing out the dual of
any convex optimization:
Theorem 18. Let Λj : Rp → R ∪ {+∞} be convex functions,
j = 1, . . . , k for some positive integer k. Suppose that there exist
45
some u1, . . . , uk such that u1 + · · · + uk = 0 and Λj (uj ) < ∞, j = 1, . . . , k, and Λj is upper semicontinuous at uj for some j. Then − infp v∈R k X j=1 Λ∗j (v) = inf u1 ,...,uk ∈Rp : u1 +···+uk =0 p X Λj (uj ) (2.9) j=1 where Λ∗j denotes the convex conjugate of Λj . Theorem 18 is a generalization of the Fenchel’s duality theorem in convex analysis, the latter being the special case of k = 2 and also presented in a slightly different form. In fact, Theorem 18 holds in a very general setting where Rp is replaced by a general topological vector space; see Theorem 4 in [LCCV18]. Equipped with Theorem 18, we can easily show: Theorem 19. (2.7) and (2.8) are equivalent. Proof. We would like to have Λ∗1 (Θ) = − log det(Θ); Λ∗2 (Θ) = tr(SΘ); Λ∗3 (Θ) = λkΘk1 2 (2.10) (2.11) (2.12) for Θ ∈ Rp ; this is a p2-dimensional space with inner product tr(·). If Θ is not positive definite then it is understood that Λ∗1 (Θ) = +∞. 46 Assuming that the double-conjugates are the functions themselves, we can compute Λ1(Γ) = sup {tr(ΓΘ) + log det(Θ)} (2.13) p2 Θ∈R = sup {tr(ΓΘ) + log det(Θ)} (2.14) Θ is psd = −p − log det(−Γ); Λ2(Γ) = sup {tr(ΓΘ) − tr(SΘ)} (2.15) (2.16) 2 Θ∈Rp = 0 Γ=S ∞ otherwise; Λ3(Γ) = sup {tr(ΓΘ) − λkΘk1} (2.17) (2.18) 2 Θ∈Rp = 0 kΓk∞ ≤ λ ∞ otherwise. (2.19) Indeed, after obtaining these formulae we can directly check that Λ∗j (Θ) = supΓ{tr(ΓΘ) − Λj (Γ)} is true. Now sup {−Λ1(Γ1) − Λ2(Γ2) − Λ3(Γ3)} Γ1 +Γ2 +Γ3 =0 = sup {p + log det(Γ2 + Γ3)} (2.20) kΓ3 k∞ ≤λ,Γ2 =S = sup {p + log det(S + Γ3)} kΓ3 k∞ ≤λ 47 (2.21) which is the formula for the dual of the graphical lasso. The equivalence to the graphical lasso is seen from Theorem 18. 2.3 Blockwise Coordinate Descent The problem (1.46) can be solved very quickly using the coordinate descent algorithm [WL+08] whereby the coordinates are updated iteratively via line search. The graphical lasso problem (2.7) can also be solved by a blockwise coordinate descent algorithm (Friedman, Hastie, and Tibshirani [FHT08]), which can easily handle problem size of p = 1000. The blockwise coordinate descent algorithm actually uses the lasso solver as a subroutine. The latter, of course, is convex optimization and can be solved in the primal or the dual form. Moreover, the problem (1.46) can also be solved in either the primal or the dual form. Thus there are at least four versions of the algorithm from these combinations [MH12]. Here, we briefly describe the one which is perhaps the simplest - the so called primal-glasso (P-lasso) in [MH12]. By setting the gradient in (2.7) to zero, we see that the optimizer Θ must satisfy −Θ−1 + S + λ sign(Θ) = 0. (2.22) The algorithm proceeds by iteratively updating Θ, leveraging (2.22). Suppose that in i-th iteration, the matrix obtained previously is Θ(i), 48 and we pick an index l ∈ {1, . . . , p}, and in i-th iteration we make updates on the l-th row and column of Θ. The index l is selected cyclically through the iterations; for simplicity of the subsequent discussions, we suppose that l = p is the last index. Then the matrix Θ(i) can then be viewed as a 2 × 2 block matrix, which can be represented as ! (i) (i) Θ11 Θ12 . (2.23) (i) (i) Θ21 Θ22 (i) Hereafter, note that notations such as Θ12 denote the submatrix in the block form (rather than a coordinate in the original matrix). Checking on the condition (2.22) for the upper right block, we get 1 −1 Θ11 Θ12 + S12 + λsign(Θ12) = 0 (2.24) −1 Θ22 − Θ21Θ11 Θ12 where we have invoked the Schur complement theorem for the block matrix inverse. While (2.24) is a complicated formula for Θ, the idea of iterative algorithms is to replace the complicated parts by the values of the previous round. We posit that (i) (i) AΘ12 + S12 + λsign(Θ12 ) = 0 (2.25) where A := 1 (i−1) Θ22 − (i−1) (Θ11 )−1. (i−1) (i−1) (i−1) Θ21 (Θ11 )−1Θ12 49 (2.26) (i) However, (2.25) indicates that we can find Θ12 by solving the following lasso problem: minimizeβ∈Rp−1 1 >
>
β Aβ + S12
β + λkβk1.
2
(i)
(2.27)
(i)
Once Θ12 is obtained, we compute Θ22 : Checking on the condition
(2.22) for the lower right block, we get
(Θ−1)22 = s22 + λ.
(2.28)
This suggests that we can compute
(i)
Θ22 =
1
(Θ−1)22
(i)
(i−1)
(i)
+ Θ21 (Θ11 )−1Θ12
(2.29)
according to the Schur complement theorem, where (Θ−1)22 is com(i)
(i)
puted from (2.28). Finally, after Θ12 (and consequently, Θ21 ) and
(i)
Θ22 are computed, the i-th iteration is completed. The iterates continues until convergence.
50
Chapter 3
Matrix Estimation
The problem of estimating a sparse vector discussed in Chapter 1
has been well studied, and many of its basic properties are now
understood. In the recent years, matrices and tensors are becoming
increasingly popular in high-dimensional statistics. Some argue that
many practical problems are matrix recovery in disguise; one famous
application is collaborative filtering (as popularized by the Netflix
prize). For a recent video tutorial on the subject, see [YS]. While
matrices and tensors can certainly be thought of as extensions of
vectors, not all the properties and results immediately carry over,
and many basic statistical and computational questions about matrix
and tensor estimation are under very active research today.
In this chapter we get a glimpse of matrix estimation through a
basic problem of prediction in an additive noise model with linear
measurements. There are other common variants of the problem,
51
such as estimating a matrix from incomplete measurements of its
entries, which we will not touch on. However, a common theme
in these problems is that the low-rankness of the matrix is taken
advantage of, just as sparsity is employed in vector estimation.
3.1
Multivariate Regression: Setup
We consider the following multivariate linear regression model:
Y = XΘ∗ + E
(3.1)
where X ∈ Rn×d is the design matrix, Θ∗ ∈ Rd×T is the matrix of
unknown parameters, E is Gaussian noise matrix with independent
N (0, σ 2) components, and Y ∈ Rn×T are the observations. The goal
is prediction, i.e., to estimate XΘ∗ given X and Y.
The following is an example scenario of application: suppose that
there are n movies, d features of the movies (e.g., duration, music
quality, plot quality. . . ), and T persons. The matrix X contains
values quantifying the qualities of each movie regarding each feature.
The matrix Θ∗ contains values quantifying the preference of each
person for each feature. Thus, it makes sense that Y quantifies the
level of preference of each person for each movie.
Note that each row of Θ∗ indicates how a particular feature is
preferred by all the persons. It is possible that a certain feature (e.g.
duration of the movie) is not quite relevant for the rating, in which
52
case the corresponding row in Θ∗ is zero. In general, we might expect
that there are many zero rows in Θ∗; in other words, the columns of
Θ∗ share the same sparsity pattern.
If we assume that there are at most k nonzero rows in Θ∗, then
naively we can just apply Lasso for each column of Θ∗ to obtain Θ̂.
By Theorem 7, the mean square error is then
EkXΘ̂ − XΘ∗k2F . σ 2kT log d
(3.2)
qP
2
with constant probability. Here, kAkF :=
ij Aij denotes the
Frobenius norm.
In the next section, we will show that we can do better than (3.2)
by taking into account of the interactions of the columns of Θ∗. First,
the sparsity level k can be reduced to the rank of Θ∗. Second, the
factor log d, which is the price to pay for not knowing the sparsity
pattern, can be get rid of.
3.2
Penalization by Rank
The low-rankness of Θ∗ prompts the consideration of the following
estimator:

1
Θ̂RK := argminΘ∈Rd×T
kY − XΘk2F + 2τ 2 rank(Θ) . (3.3)
n
We call this estimator by rank penalization with regularity parameter τ 2. At first sight, this looks similar to the BIC estimator
53
introduced in Chapter 1. However, unlike BIC, Θ̂RK can be computed efficiently (Exercise 14).
Now, let us also show that Θ̂RK enjoys good statistical property1:
Theorem 20. Let
r
log(12) max{d, T }
+ 2σ
n
Then with probability 1 − δ,
r
τ := 4σ
2 log(1/δ)
.
n
1
kXΘ̂RK − XΘ∗k2F ≤ 8n rank(Θ∗)τ 2 . σ 2 rank(Θ∗)(max{d, T } + log ).
δ
(3.4)
It is interesting to note that X does not enter the bound.
Proof. As usual, the optimality condition implies that
kY − XΘ̂RK k2F + 2nτ 2 rank(Θ̂RK ) ≤ kY − XΘ∗k2F + 2nτ 2 rank(Θ∗)
(3.5)
which is equivalent to
kXΘ̂
RK
−
XΘ∗k2F
D
E
RK
∗
≤ 2 E, XΘ̂ − XΘ
− 2nτ 2 rank(Θ̂RK ) + 2nτ 2 rank(Θ̂∗).
(3.6)
As usual, the inner product term is the main item we need to control; intuitively we need to leverage the “low complexity” of Θ̂RK
1
Adapted from Theorem 5.5 in [Rig]
54
to show that the two terms in the inner product are approximately
uncorrelated. By Young’s inequality,
D
E
1
RK
∗
2 E, XΘ̂ − XΘ ≤ 2 hE, U i2 + kXΘ̂RK − XΘ∗k2F (3.7)
2
where
XΘ̂RK − XΘ∗
(3.8)
U :=
RK
∗
kXΘ̂ − XΘ kF
is a unit direction. We then “sup-out” the unit direction to decouple
the noise E and the optimizer Θ̂RK . For notation simplicity, let us
assume below that X is full column rank (if not, we can consider all
inner products in the column space of X instead; this can be done as
an exercise, or see Theorem 5.5 [Rig]). Then
hE, U i2 ≤ kEk2opkU k2s1
(3.9)
≤ kEk2op rank(U )
(3.10)
= kEk2op rank(Θ̂RK − Θ∗)
(3.11)
≤ kEk2op(rank(Θ̂RK ) + rank(Θ∗))
(3.12)
where k · ksp denotes the Schatten p-norm; k · kop = k · ks∞ denotes
the operator norm of a matrix, i.e., the largest singular value. The
first inequality above is the matrix Hölder inequality; the inequality
kU k2
follows since by Cauchy-Schwarz, kU k2s1 = kU ks12 ≤ rank(U ). Finally,
F
random matrix theory provides us the estimate
kEk2op ≤ nτ 2
55
(3.13)
for the top singular value with probability 1 − δ, which gives
hE, U i2 ≤ nτ 2(rank(Θ̂RK ) + rank(Θ∗)).
(3.14)
But (3.6) and (3.7) tell us
kXΘ̂RK − XΘ∗k2F ≤ 4 hE, U i2 − 4nτ 2 rank(Θ̂RK ) + 4nτ 2 rank(Θ̂∗)
(3.15)
which, together with (3.14), gives the desired result.
Remark 6. More general, the result of Theorem 20 holds in the
more general setting where the error matrix E is subgaussian, i.e.,
has gaussian-like tail when projected to any direction; see [Rig].
Remark 7. While the rank-penalized estimator can be computed
efficiently, it is also worth considering the “`1 counterpart”, that is,

1
kY − XΘk2F + τ kΘks1 .
(3.16)
argminΘ∈Rd×T
n
Its error property is similar to (3.4) but with an additional multiplicative factor of the condition number of X. Its analysis, however,
is quite involved, rather than a simple adaptation of the proof for
the error estimates in Lasso. See [KLT+11].
3.3
Matrix Completion
Let Θ∗ ∈ Rd×T be an unknown matrix. Let Y be a set of incomplete
observations of the entries of Θ∗; we may think of Y as a matrix
56
obtained by replacing some entries of Θ∗ by a question mark.
More generally, we may formulate a problem where Y = A(Θ∗)
is a set of linear measurements of Θ∗:
Y1 := A1(Θ∗) = hA1, Θ∗i
…
Yn := An(Θ∗) = hAn, Θ∗i
(3.17)
(3.18)
(3.19)
where A1, . . . , An ∈ Rd×T .
If Θ∗ is low-rank, we may recover it from Y (and the known
A1,. . . ,An) by the following
argminΘ : Y =A(Θ∗) rank(Θ).
(3.20)
Unfortunately, unlike the multivariate regression problem where the
linear observation is of the specialized form A(Θ∗) = XΘ∗, we cannot
solve (3.20) efficiently. Nevertheless, we can consider the convex
relaxation
argminΘ : Y =A(Θ∗) kΘks1
(3.21)
which can also be recast as a semidefinite programming problem; see
[CR09].
57
Chapter 4
Problem Set
[updated weekly. A total of ≥ 25 points of problems due before the
midterm and another ≥ 25 points of problems due before the final.]
Exercise 1 (2 points). Use the union bound to prove the estimate
of the Gaussian max in (1.11). In fact, note that the bound holds
even when ξi’s are correlated.
Exercise 2 (2 points). Show that the soft thresholding estimator
(see (1.21)) achieves the oracle `2 error up to a logarithmic factor.
Exercise 3 (2 points). In the section on the James-Stein estimator,
suppose that the noise may be biased, i.e., ξi ∼ N (µ, σ 2) for some
µ ∈ R. Find an analogue of the James-Stein estimator in this setting,
so that the naive estimator θ̂naive = Y −µ is shown to be inadmissible.
58
Exercise 4 (4 points). [Gaussian integration by parts] Suppose that
f is a smooth function with compact support on Rd. Let Y ∼
N (θ, σ 2Id×d). Use integration by parts to show that

∂f
Eθ [(θi − Yi)f (Y )] = −2Eθ
(Y )
(4.1)
∂yi
for i = 1, . . . , d.
Note: once the formula is proven for nice (smooth, compactly
supported function), it can be extended to more general class of
functions by standard approximation arguments.
Exercise 5 (2 points). Show that when n = d and X = Id, BIC and
Lasso are reduced to the hard-thresholding and the soft thresholding
estimators for the Gaussian sequence model.
Exercise 6 (4 points). Suppose that the basis pursuit algorithm in
(1.62) is replaced by the following1
θ̂BP ∈ argminθ : Y =Xθ kθkq .
(4.2)
P
where q ∈ (0, ∞) and kθkq := ( dj=1 |θj |q )1/q . For such general q,
write a counterpart of the null space condition (Definition 9), and
prove a counterpart of the equivalence result Theorem 10.
1
Note this is no longer a convex optimization when q < 1. In practice we can try to solve this optimization by gradient descent and the performance turns out to be quite good empirically, although there is no theoretical guarantees of finding the global minimum. 59 Exercise 7 (2 points). Use Chernoff’s inequality to prove the following Chi-square tail bound: for X d ∼ N (0, Id) and any t > 0,
√
d 2
P[|kX k2 − d| ≥ 2 dt + 2t] ≤ 2e−t.
(4.3)
[Hint: p. 43 [BLM13]]
Exercise 8 (4 points). Observe that Gordon’s theorem (Theorem 13)
√
shows that whenever w(K) ≤ 0.5 n and n is large, than K ∩ ν = ∅
√
with high probability. Find an example of K satisfying w(K) ≤ n
yet K ∩ ν 6= ∅ almost surely. (This shows the sharpness of Gordon’s
theorem.)
Exercise 9 (3+3 points). Let K be a convex cone in Rd. Show the
following relation between the statistical dimension and the Gaussian
width:
w2(K ∩ S d−1) ≤ δ(K) ≤ w2(K ∩ S d−1) + 1
(4.4)
where S d−1 denotes the unit sphere in Rd. Each side of the inequality
is worth 3 points. [Hint: the left inequality follows relatively easily
from Jensen’s inequality. To show the right inequality, we need
Var(supz∈K∩S d−1 z >G) ≤ 1, which can be shown using the fact
that a 1-Lipschitz function of a Gaussian vector has variance
≤ 1 (Gaussian poincare inequality).]
Exercise 10 (4 points). Recall the descent cone D(θ) defined in
(1.95). Show that if θ ∈ Rd is k-sparse (d ≥ 2k) then the statistical
60
dimension has the order δ(D(θ)) = Θ(k log d). [Hint: similar to
Theorem 14. The difference is that the set of interest here is not
union over all k-sparse vectors.]
Exercise 11 (2 points). Suppose that X ∈ Rp follows the distribution N (0, Θ−1). Let u, v ∈ {1, . . . , p} and u 6= v. Show that the
distribution of (Xu, Xv ) given X{1,…,p}\{u,v} = 0 is N (0, Θ̄−1) where
Θ̄ ∈ R2×2 denotes the restriction of Θ on the u-th and v-th rows and
columns. Then verify the claim in (2.3).
Exercise 12 (2 points). Let S be a rank-deficient positive-semidefinite
matrix. Show that the maximum likelihood estimate of the precision
matrix in the Gaussian graphical model (see (2.6)) does not exist
(i.e. there is no Θ achieving the maximum).
Exercise 13 (2 points). Verify that graphical lasso (2.7) is convex
optimization.
Exercise 14 (2+2 points). Let Y ∈ Rn×T be a given matrix.
• Let k ≤ min{n, T }. Find an explicit formula for
argminΘ∈Rn×T , rank(Θ)≤k kY − Θk2F .
[Hint: consider the singular value decomposition of Y.]
• Let X ∈ Rn×d be another given matrix. Let k ≤ min{d, T }.
Find an explicit formula for
argminΘ∈Rd×T , rank(Θ)≤k kY − XΘk2F .
61
[Hint: use the decomposition kY−XΘk2F = kY− Ȳk2F +kȲ−
XΘk2F where Ȳ := X(X>X)†X>Y denotes the orthogonal
projection of Y onto the image space of X.]
62
Bibliography
[ALMT14] Dennis Amelunxen, Martin Lotz, Michael B McCoy, and
Joel A Tropp. Living on the edge: Phase transitions in
convex programs with random data. Information and
Inference: A Journal of the IMA, 3(3):224–294, 2014.
[BLM13]
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.
Concentration inequalities: A nonasymptotic theory
of independence. Oxford university press, 2013.
[CDD09]
Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and best k-term approximation. Journal of the American mathematical society,
22(1):211–231, 2009.
[CHE98]
SS CHEN. Atomic decomposition by basis pursuit. SIAM
Journal on Scientific Computing, 20(1):33–61, 1998.
[CR09]
E. Candes and B. Recht. Exact matrix completion via
63
convex optimization. Foundations of Computational
Mathematics, 2009.
[DH06]
DL Donoho and X Huo. Uncertainty principles and ideal
atomic decomposition. IEEE Transactions on Information Theory, 47(7):2845–2862, 2006.
[EB02]
Michael Elad and Alfred M Bruckstein. A generalized
uncertainty principle and sparse representation in pairs
of bases. IEEE Transactions on Information Theory,
48(9):2558–2567, 2002.
[FHT08]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
Sparse inverse covariance estimation with the graphical
lasso. Biostatistics, 9(3):432–441, 2008.
[FN03]
Arie Feuer and Arkadi Nemirovski. On sparse representation in pairs of bases. IEEE Transactions on Information Theory, 49(6):1579–1581, 2003.
[Gor88]
Yehoram Gordon. On milman’s inequality and random
subspaces which escape through a mesh in ? n. In Geometric aspects of functional analysis, pages 84–106.
Springer, 1988.
[KLT+11] Vladimir Koltchinskii, Karim Lounici, Alexandre B Tsybakov, et al. Nuclear-norm penalization and optimal rates
64
for noisy low-rank matrix completion. The Annals of
Statistics, 39(5):2302–2329, 2011.
[LCCV18] Jingbo Liu, Thomas A Courtade, Paul W Cuff, and
Sergio Verdú. A forward-reverse brascamp-lieb inequality: Entropic duality and gaussian optimality. Entropy,
20(6):418, 2018.
[LJG15]
Jingbo Liu, Jian Jin, and Yuantao Gu. Robustness of
sparse recovery via f -minimization: A topological viewpoint. IEEE Transactions on Information Theory,
61(7):3996–4014, 2015.
[LMN+20] Martin Lotz, Michael B McCoy, Ivan Nourdin, Giovanni
Peccati, and Joel A Tropp. Concentration of the intrinsic
volumes of a convex body. In Geometric Aspects of
Functional Analysis, pages 139–167. Springer, 2020.
[LW13]
Po-Ling Loh and Martin J Wainwright. Structure estimation for discrete graphical models: Generalized covariance
matrices and their inverses. The Annals of Statistics,
pages 3022–3049, 2013.
[MH12]
Rahul Mazumder and Trevor Hastie. The graphical lasso:
New insights and alternatives. Electronic journal of
statistics, 6:2125, 2012.
65
[MT14]
Michael B McCoy and Joel A Tropp. From steiner formulas for cones to concentration of intrinsic volumes. Discrete & Computational Geometry, 51(4):926–963, 2014.
[Rig]
Philippe
Rigollet.
High
dimensional
statistics
lecture
notes.
https: //
ocw. mit. edu/ courses/ mathematics/
18-s997-high-dimensional-statistics-spring-2015/
lecture-notes/ .
[RV08]
Mark Rudelson and Roman Vershynin. On sparse reconstruction from fourier and gaussian measurements. Communications on Pure and Applied Mathematics: A
Journal Issued by the Courant Institute of Mathematical Sciences, 61(8):1025–1045, 2008.
[RWY10] Garvesh Raskutti, Martin J Wainwright, and Bin Yu.
Restricted eigenvalue properties for correlated gaussian
designs. The Journal of Machine Learning Research,
11:2241–2259, 2010.
[SVH19]
Yair Shenfeld and Ramon Van Handel. Mixed volumes
and the bochner method. Proceedings of the American
Mathematical Society, 147(12):5385–5402, 2019.
[SW08]
Rolf Schneider and Wolfgang Weil. Stochastic and integral geometry. Springer Science & Business Media, 2008.
66
[Tsy08]
Alexandre B Tsybakov. Introduction to nonparametric
estimation. Springer Science & Business Media, 2008.
[WL+08]
Tong Tong Wu, Kenneth Lange, et al. Coordinate descent algorithms for lasso penalized regression. Annals
of Applied Statistics, 2(1):224–244, 2008.
[YS]
Christina Lee Yu and Devavrat Shah. What do we know
about matrix estimation?
67
Gaussian
Week 1
( see also
typed lecture
parameter f
→
in
data
fxi
Y, )
data
1×2
→ data
fo
fo
Yi
‘t
response
linear
p
regression
Y
=
XO
–
Gaussian linear model
–
Fixed
Random
design
design
:
:
xn in
fixed
or
)
random
function
Xi E
,
Ht
Rd
,
FERD
noise
m)
3
i
–
,
–
‘
Gn
T
–
)
+ noise
–
noise
(
linear
fTXi
T
H
‘t
predictors
is
=
:
welcome)
noise
↳
fo Ki )
,
→
!
either
ki ) t
fo
i
compass
notes
T
Model
Sequence
*
X
=
,
–
–
–
Xn
iid
)
is
iid
No ol )
,
deterministic
random
vectors
in
Rd
.
Gaussian sequence model
*
:
Yi
⇐
Remark
in
:
Id
=
=
nonparametric regression
SIX)
f-
=
Xi
-13
i
¥✓
,
¥11
)
( x)
+
W
:
S
:
f
Si
Fi
=
t
unknown
Hallo
Bo
naive
Risk
:
( R)
=
It
,
Fourier
:
Ioi to
fo
:
Hollo
–
–
k}
EHF Olli
–
Hashi
–
:
if supple )
for:b
=
Yi Ii
e
–
,
do
known
is
supp
–
H’
–
:
of
Fourier
FHKY
:
–
oracle
=
.
I
]
function
of f
.
{ 3 i3
:
[o
smooth
Fourier trans
:
S
Define
motion
observation
:
F
Si
Brownian
s
of
Edie }
Risk
Improves
Input
ing
f
:
Output
–
naive
threshold
Hard
E
for
estimator
Rd
–
T
)
,
( with
Fi ed
:
Fein !
.
:
13
–
⇐
>
I
>
d.
E
E
for
T
so
)
discovery)
( false
.
probability
constant
Fito
but
large enough
just
:
fit
By
A
–
error
–
IP (
.
Yi ‘t Hit T
=
Type
Fi O
want :
Ccd
:
T E lo
s
k
drastically if
estimator
picking
-1
no
‘
f
:
ai
Idea
ko
–
]
to
HT
8
a
past
.
no
–
d ZIP [ Sisto)
–
ZI
–
e
-64252
.
that
.
want
g
=
T.ir#g2gdF
Thin 1
with
:
for
T
2T
–
–
.
AE Bock )
any
EHF Olli
probability
Example
k
:
I
7
–
–
H)
in
as
have
we
,
:
,
notation
:
asb
f
ifac
Td
–
To
otkdegzgd
E
–
with
where
.
#
E HIM
–
const .b
olliarirdlyd
Ttlloorade Allie ord
Elton -01122 sad
–
‘m
”
⇐
pneofi
with
probability
FEED Kil
under t
F til
–
I
e
–
f
,
-42
←
event
A
,
=
=
⇐
E
Hi
–
13:12
oil ‘t# set toil
Ilyi let
to its
Iz It oitsil
E-
–
.
pet
>
E t
Hill Hitt
tail Itoi -15 let
–
.
Hail Et toi
>
Hoti left
S
Hoh Olli
E
–
min
FE
flail
min
,
.
)
T
{ If it
,
I
‘
)
Hollo
fu
T
‘
t
Remarkg
HT
:
b/c I
②
z.fr#g
=
soft
For
:
I
have
we
up
:
.
I
–
–
–
ing
ly; H1
)
t÷
HT
which
threshold
similar
S
!
actually adaptive
is
–
cist +
–
properties
estimator
Hit E)
(
depend
doesn’t
i.e
.
on
k
.
where
It
–
.
e
Rish
–
e
=O(oZklyzgd)
Week 5
Problem
distribution No
Precision
Define
E
,
undirected
E(u
fl
=
=
)
2-
:c
.
–
,
EIRPXP
p)
vxv
.DE/E#XutXv/Xvyu.v3:-. -.3XitX3/Xz
Hammersley Clifford
tf
theorem
luv )
xutxvlxvyu.us#0n.v–
•
Gaussian
( V. E )
–
In
unknown
an
-2-1
graph
.
in
covariance
.
④
matrix
V
Graphical
the
estimate
:
Lasso
the
iid
Gaussian
samples
case
x
.
can
”
verify
‘
–
–
x
”
O
H)
–
H)
directly
No
”
,
.
)
log
likelihood
–
:
l 0l-iIlogfde Y4@jexpfIxli T0xliYJ_hzbydetO-tzE.tr
fxlilxci
S
”
‘
‘
:
“”
does
MLE
cannot
.
Graphical
–
tzeydeeonnttrlso )
use
lasso
A
,
MLG
if
S
if
p
not
full
–
rank !
– n
.
hgdet @ this @ ) -1111011
ft )
,
–
UEL
↳ derivative
auto
> o
ft )
is
:
–
④ 70
H
exist
not
minimize
H
)
‘T
–
1=0
is
convex
opt !
=
O
① Is -4 sign
④
‘
=5
=o
^
Dual
of graphical
lasso
:
dog dee (Stl ) xp
maximize
(t)
Hust
IIT
symmetric matrix IRPXP
in
Claim
•
( t)
:
Convex
=
H)
conjugate
It
suppose
:
‘
:
R’ → Ru {to
)
Define
AMV )
ft VER
: -_
suzppplcu.rs
”
N’
×
→
Thm
:
Fenchel ‘s
duality
Let Aj
Regularity
iii.
#H
*
.
:
conditions
for
h=z
IRP
→
( see
=
”
if
IRV Ito ]
lecture
Alu) )
–
notes
It
convex
)
is convex
j
,
–
–
Then
it
“”
convex
convex
is
.
¥i*in¥÷÷÷:# E.
–
is
I
(&
regularity )
–
.
–
k
.
–
check weak
¥
”
Proof
duality
hfjw )
f
i
z
–
v
ERP
,
tf
u.
–
–
‘
Un
EIRP
,
u.tn
.
-1ha
II. lljlujk
:÷n¥ii*÷÷÷÷:÷÷”
,
of Claim
( D= # )
11¥10
‘
i
want
EIRP
A ④
)=-bgd’←=+
Nylon )
=
-1450 )
if
,
1¥12
④
is
-2104
not
pet
.
HN4aY1li Tt-sfePpirftrlP@l-AFhOlf-f.Y3p.d
Af ( 01=1110111
.
④ ET
Arlt )
EHF ( Otta
{ trlP④i- dogdeeuo )
Pt
”
–
=
to
dog deett )
son ))
s.jp#trlTOl-tr(
p
–
–
–
to
f
=
As IT
{
CHITO ) -71101k )
.
Htlhs
0
Yw
*
f7%3=0
p,
Tis
=
–
HT
S
–
–
01W
+ is
)=ePµp
=
T
O
” ( T ‘ ) -1kHz )
‘
-11343)
)
)µgy¥ fpxlogdeelstts) }
,
•
”
Blockwise Coordinate
want
in
–
:
each
column
and
Descent
④ Is -11 sign
iteration
now
,
choose
⑦
1=0
left
.
.
–
p)
,
optimize
b- th
For
simplicity
Schur complement .ae
④
assume
we
,
.
d
=p
..
‘
ion
=
.
0¥10.io
)
.
bio ,
top right
A④
where
Define
solve
A
A
t
n
Su
+
A
=D
I
:=
,-
”
(O
t
”
④21 ④
–
If
minimize
‘
,
① n’
”
←
Ali ” p
–
,
,
–
p
+
I
⑦
”
( Oni : )
”
⑤ 22
see
sign ( ⑨ a)
n
siif
tall
Pll ,
–
I
lower right block
:@ I
”
compute
Finally
‘
Note
for
(⑤y=
+
=
,
:
the
repeat
”
⑦d
iterations
original graphical
dual
-11
Su
problem
”
‘
( ① ii I ⑦ ii
–
until
lasso
instead !
–
convergence
by
‘
.
LFHT 08 ] works
Week 2
.
paradox
Stein ‘s
4=0+3
3
,
Bk
.
=
‘ve
=
put
{
Tt
and
We
the
can
–
I
f
:
o
)
‘
inequality
:
F
–
–
ktminriai !
is
13kt at Bk
,Rd+a= IRD
inadmissible
,
.
i.e
.
such that
–
choose
is
naive
Hoh Ollis Elton
Eis
)
sparse
translatiminrarimt
is
exists
h
is
Thm ( Stein ‘s paradox)
there
,
T
find :O
Rd
No
ERD
D-
Ena
iid
;
is
as
(i
–
.ve
–
strict
the
Olli
for
James
,
some
–
ly
to
D-
Stein
.
EIRA
.
estimator
:
general
In
,
I
consider
GH) Y
=
↳
.
Stein
‘s
estimator
unbiased risk
,
function
estimator
of
Rd
form
the
→
R
( SURE )
:
An
unbiased
.is#……ia…ifmIian
÷i::÷
of
Y ( not
on
EILSURELYD
o)
=
E- HE
–
that
I¥¥%mmi
cross
–
validation
→
n*” 1mm
^
estimated risk
as
a
function of
–
the bandwidth (and the
signal)
Proof of
SURE
:
Elton th
–
Ed El god
–
–
‘
Yi oil
–
,
=¥ ¢-o2EUoi-tiHHM
Gaussian
use
:
X
~
No
,
l
)
integration by parts
②
=
toe
If
:
Eth
Eli geo
By
function
–
o
,
oh
c-
‘
–
–
SURE
Ef Hi]
ti
–
–
Rd
do
.
,
then
E[oYdy;)
‘
–
doe
=
Proof
is
#xD
< : ¢ E - - d> 3
Ess
if
:
then
,
Eiko KD
-1hm
–
or
=
②
flyby }
%
Et
+
Ell Y
E IIF OH
–
oh
‘
‘
–
.
=
do’t
EE WH ) )
on
R
,
where
wig ,
:-.
–
–
+
choose
g ly)
)t2E%Yiyily )
zrzdli gigs
Hylli
I
=
–
It
Tipp
Least squares
regression
cope
‘
C
,
minimising Wly) gives
←
gigs )
–
=
>
0
old
–
2)
•
.
Y=Xf¥5
X
Its
ER
‘d
d
”
,
en
,
argm.gl/Y-xEl i- fXTx5’xTy
=
in
in
Thin
Remark
Proof
general
Elliff
:
:
If
E H
:
=
rank Cx ) =D
”
–
,
Moore
E) Hi
rank ( X )
–
–
=
–
Penrose
do
red
,
then
–
EHXKTXHXTY
inverse
of
XTX
.
–
Hors Alli
‘
–
pseudo
xflk
‘s
ONE ro ?
E 11×18
–
EH xcxtxs
–
=
=ETr( Xlxtx )
‘
‘
–
x’
5115
x’ 53T Xfxtx
)
Txt
=Tr(xlxtx5’xTEfxKTx5
‘
Ftd
=
# ( XIXTXHXT )
or Tr
=
[ XTXLXTX )
or Tr
=
d
=
”
)
L
cyclic property
Tr LAB ]
:
–
=
KIBA )
[ Id ]
ol
BE
”
Information
Bayes
QB
”
:
–
–
( BK)
Criterion
afgemaindfll
estimator
Y XO ” i
–
-15110110
”
”
¥0
Likelihood e-
‘
) ko-ue-%1a.to
:
prior
Note
:
when
X
–
Id
,
density
n
–
–
d
p,
,
Italo
–
a
e
BIC #
Hard threshold
.org
Lasso
‘
:
I’
–
HY
{
afq.iq
–
Hii -121110111
HOH
)
–
¥
toil
,
when
X
Id
–
–
had
,
lay oui -12-4101113
arginine
–
,
til
{ Hi
angnfin
=
derivative
–
2.
‘
–
oil -1211 Oil )
Hi Fi ) -12T
Sign #-)
–
Yi
=
–
c-
=o
#i
siglo )
;
T.ae#y
←
^
Computation
Fast
Definition
( RE )
where
:
is
NP
Lasso
is
convex
for
rate
if
Cs
=
the
satisfies
X
{
hard
Btc
:
f
–
opt
ti
!
.
( glmnet
in
R)
Lasso
(k
,
K
)
–
restricted
eigenvalue
condition
insfh into.at#szk-Os.i=fO-iific-s
:
Hosek ,
e
311011,3
O
.
fiefs
‘
‘
Remark
if
Thin
:
u
:
–
,
RE
condition
RE ✓
:
truth
for
Hf ‘t Ho Ek
,
each column H Xj
,
regularization parameter
random
X
design
satisfies
IKEZ
Let
.
satisfy
I
2-1–847 -1¥)
Then
.
yxco.no#isko2lykd/sl
probability
with
Notes
ground
Assume
( k 1/2 )
Lasso
typical behavior
>
klogd
:
Of Eg
I
7
S
–
there
means
f
such that
②
③
f- Kg
,
or
=
f
e
–
exists
a
g
constant
if fsg
g
r
a
>
0
.
!
tear
=
.
Gd ly’T
t
RE
and
get
.
Proof
KY
opt
:
XE
–
–
in
Hi
for
condition
Elly
–
X
Lasso
:
0*112+21110*11 -2711841 ,
,
Y
Hot ED -15
–
‘
H XIA At ski EZGTX IF
–
–
‘t
O
) -12211041 -2211841 ,
,
STxfo.no#tEtdisineriiEix*uoeo*H
“)
,
zIHFt-OH.fi
E
(2)
↳ Uartstxj ) EHXJHIEZ
fact
if Zi
use
the
are
Gaussian ( not
with
var E
2
:
,
–
–
–
Zd
independent)
then
¥¥dzieP
with
( proof
(D
&
at
probability
by
union
3
bound
I
–
f
)
XCEL -0*1122
⇐
THE 0*11 , -121110*11, -2-41841
f*s=f* )
THE} of ti THE}cH -121110 Elli -21118511,
il
,
( use
=
It
“”
–
–
‘
3T
,
Hots Otsu , -1118841
–
,
(3)
In
particular
,
311
Ohs Ots
–
So
Candy
–
11
It -0*4
,
E
⇐
Plugging
it
into
x
CEL
1113110%11
,
=
LIFE
–
Esch ,
Schwarz
Tk HEE
–
Eth
52k LINE-0*1112
(3)
–
0*112%31
.
Tak
HxCEh-0
EB
Gaussian
Week 1
( see also
typed lecture
parameter f
→
in
data
fxi
Y, )
data
1×2
→ data
fo
fo
Yi
‘t
response
linear
p
regression
Y
=
XO
–
Gaussian linear model
–
Fixed
Random
design
design
:
:
xn in
fixed
or
)
random
function
Xi E
,
Ht
Rd
,
FERD
noise
m)
3
i
–
,
–
‘
Gn
T
–
)
+ noise
–
noise
(
linear
fTXi
T
H
‘t
predictors
is
=
:
welcome)
noise
↳
fo Ki )
,
→
!
either
ki ) t
fo
i
compass
notes
T
Model
Sequence
*
X
=
,
–
–
–
Xn
iid
)
is
iid
No ol )
,
deterministic
random
vectors
in
Rd
.
Gaussian sequence model
*
:
Yi
⇐
Remark
in
:
Id
=
=
nonparametric regression
SIX)
f-
=
Xi
-13
i
¥✓
,
¥11
)
( x)
+
W
:
S
:
f
Si
Fi
=
t
unknown
Hallo
Bo
naive
Risk
:
( R)
=
It
,
Fourier
:
Ioi to
fo
:
Hollo
–
–
k}
EHF Olli
–
Hashi
–
:
if supple )
for:b
=
Yi Ii
e
–
,
do
known
is
supp
–
H’
–
:
of
Fourier
FHKY
:
–
oracle
=
.
I
]
function
of f
.
{ 3 i3
:
[o
smooth
Fourier trans
:
S
Define
motion
observation
:
F
Si
Brownian
s
of
Edie }
Risk
Improves
Input
ing
f
:
Output
–
naive
threshold
Hard
E
for
estimator
Rd
–
T
)
,
( with
Fi ed
:
Fein !
.
:
13
–
⇐
>
I
>
d.
E
E
for
T
so
)
discovery)
( false
.
probability
constant
Fito
but
large enough
just
:
fit
By
A
–
error
–
IP (
.
Yi ‘t Hit T
=
Type
Fi O
want :
Ccd
:
T E lo
s
k
drastically if
estimator
picking
-1
no
‘
f
:
ai
Idea
ko
–
]
to
HT
8
a
past
.
no
–
d ZIP [ Sisto)
–
ZI
–
e
-64252
.
that
.
want
g
=
T.ir#g2gdF
Thin 1
with
:
for
T
2T
–
–
.
AE Bock )
any
EHF Olli
probability
Example
k
:
I
7
–
–
H)
in
as
have
we
,
:
,
notation
:
asb
f
ifac
Td
–
To
otkdegzgd
E
–
with
where
.
#
E HIM
–
const .b
olliarirdlyd
Ttlloorade Allie ord
Elton -01122 sad
–
‘m
”
⇐
pneofi
with
probability
FEED Kil
under t
F til
–
I
e
–
f
,
-42
←
event
A
,
=
=
⇐
E
Hi
–
13:12
oil ‘t# set toil
Ilyi let
to its
Iz It oitsil
E-
–
.
pet
>
E t
Hill Hitt
tail Itoi -15 let
–
.
Hail Et toi
>
Hoti left
S
Hoh Olli
E
–
min
FE
flail
min
,
.
)
T
{ If it
,
I
‘
)
Hollo
fu
T
‘
t
Remarkg
HT
:
b/c I
②
z.fr#g
=
soft
For
:
I
have
we
up
:
.
I
–
–
–
ing
ly; H1
)
t÷
HT
which
threshold
similar
S
!
actually adaptive
is
–
cist +
–
properties
estimator
Hit E)
(
depend
doesn’t
i.e
.
on
k
.
where
It
–
.
e
Rish
–
e
=O(oZklyzgd)

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Statistics Question ”

Get high-quality paper

NEW! AI matching with writer