nips nips2010 nips2010-265 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
Reference: text
sentIndex sentText sentNum sentScore
1 The LASSO risk: asymptotic results and real world examples Mohsen Bayati Stanford University bayati@stanford. [sent-1, score-0.167]
2 In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. [sent-5, score-0.093]
3 For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. [sent-7, score-0.212]
4 Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. [sent-8, score-0.363]
5 Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications. [sent-10, score-0.424]
6 1 Introduction Let x0 ∈ RN be an unknown vector, and assume that a vector y ∈ Rn of noisy linear measurements of x0 is available. [sent-11, score-0.064]
7 The problem of reconstructing x0 from such measurements arises in a number of disciplines, ranging from statistical learning to signal processing. [sent-12, score-0.158]
8 In many contexts the measurements are modeled by y = Ax0 + w , (1. [sent-13, score-0.103]
9 1) where A ∈ Rn×N is a known measurement matrix, and w is a noise vector. [sent-14, score-0.073]
10 Further v p ≡ ( i=1 vi )1/p p denotes the ℓp -norm of a vector v ∈ R (the subscript p will often be omitted if p = 2). [sent-22, score-0.069]
11 For instance, the best bound on the mean square error (MSE) of the estimator (1. [sent-29, score-0.177]
12 Their result estimates the mean square error only up to an unknown numerical multiplicative factor. [sent-33, score-0.123]
13 Work by Candes and Tao [CT07] on the analogous Dantzig selector, upper bounds the mean square error up to a factor C log N , under somewhat different assumptions. [sent-34, score-0.123]
14 The objective of this paper is to complement this type of ‘rough but robust’ bounds by proving asymptotically exact expressions for the mean square error. [sent-35, score-0.17]
15 Our asymptotic result holds almost surely for sequences of random matrices A with fixed aspect ratio and independent gaussian entries. [sent-36, score-0.441]
16 While this setting is admittedly specific, the careful study of such matrix ensembles has a long tradition both in statistics and communications theory and has spurred many insights [Joh06, Tel99]. [sent-37, score-0.127]
17 Further, our main result provides asymptotically exact expressions for other operating characteristics of the LASSO as well (e. [sent-38, score-0.088]
18 We carried out simulations on real data matrices with continuous entries (gene expression data) and binary feature matrices (hospital medical records). [sent-41, score-0.538]
19 Although our rigorous results are asymptotic in the problem dimensions, numerical simulations have shown that they are accurate already on problems with a few hundreds of variables. [sent-43, score-0.305]
20 Further, they seem to enjoy a remarkable universality property and to hold for a fairly broad family of matrices [DMM10]. [sent-44, score-0.31]
21 Both these phenomena are analogous to ones in random matrix theory, where delicate asymptotic properties of gaussian ensembles were subsequently proved to hold for much broader classes of random matrices. [sent-45, score-0.359]
22 Also, asymptotic statements in random matrix theory have been replaced over time by concrete probability bounds in finite dimensions. [sent-46, score-0.213]
23 As a consequence, universality and non-asymptotic results in random matrix theory cannot be directly exported to the present problem. [sent-49, score-0.103]
24 Our proof is based on the analysis of an efficient iterative algorithm first proposed by [DMM09], and called AMP, for approximate message passing. [sent-51, score-0.119]
25 Extensive simulations [DMM10] showed that, in a number of settings, AMP performances are statistically indistinguishable to the ones of LASSO, while its complexity is essentially as low as the one of the simplest greedy algorithms. [sent-53, score-0.065]
26 Here instead we prove an asymptotically exact characterization of a rather non-trivial iterative algorithm. [sent-58, score-0.079]
27 The algorithm constructs a sequence of estimates xt ∈ RN , and residuals z t ∈ Rn , according to the iteration xt+1 = η(A∗ z t + xt ; θt ), (1. [sent-64, score-0.508]
28 Here A∗ denotes the transpose of matrix A, and xt 0 is number of nonzero entries of xt . [sent-66, score-0.634]
29 z t = y − Axt + 2 As already mentioned, we will consider sequences of instances of increasing sizes, along which the LASSO behavior has a non-trivial limit. [sent-72, score-0.06]
30 i=1 (b) The empirical distribution of the entries of w(N ) converges weakly to a probability measure pW on R with bounded second moment. [sent-76, score-0.136]
31 i=1 (c) If {ei }1≤i≤N , ei ∈ RN denotes the standard basis, then maxi∈[N ] A(N )ei 2 , mini∈[N ] A(N )ei 2 → 1, as N → ∞ where [N ] ≡ {1, 2, . [sent-78, score-0.058]
32 For a converging sequence of instances, and an arbitrary sequence of thresholds {θt }t≥0 (independent of N ), the asymptotic behavior of the recursion (1. [sent-82, score-0.514]
33 ) Our next proposition that was conjectured in [DMM09] and proved in [BM10a]. [sent-88, score-0.123]
34 Let {x0 (N ), w(N ), A(N )}N ∈N be a converging sequence of instances with the entries of A(N ) iid normal with mean 0 and variance 1/n and let ψ : R × R → R be a pseudoLipschitz function. [sent-92, score-0.501]
35 Then, almost surely 1 N →∞ N N ψ xt+1 , x0,i = E ψ η(X0 + τt Z; θt ), X0 i lim , (1. [sent-93, score-0.165]
36 In order to establish the connection with the LASSO, a specific policy has to be chosen for the thresholds {θt }t≥0 . [sent-95, score-0.063]
37 In other words, 2 the sequence {τt }t≥0 is given by the recursion τt+1 = F(τt2 , ατt ). [sent-97, score-0.098]
38 2 Main result Before stating our results, we have to describe a calibration mapping between α and λ that was introduced in [DMM10] (Propositions 2, 3 and Corollary 4). [sent-100, score-0.092]
39 1 The probability distribution that puts a point mass 1/N at each of the N entries of the vector. [sent-109, score-0.136]
40 This function defines a correspondence (calibration) between the sequence of thresholds {θt }t≥0 and the regularization parameter λ. [sent-111, score-0.119]
41 It should be intuitively clear that larger λ corresponds to larger thresholds and hence larger α since both cases yield smaller estimates of x0 . [sent-112, score-0.063]
42 Let {x0 (N ), w(N ), A(N )}N ∈N be a converging sequence of instances with the entries of A(N ) iid normal with mean 0 and variance 1/n. [sent-125, score-0.501]
43 Then, almost surely 1 N →∞ N N lim ψ xi , x0,i = E ψ η(X0 + τ∗ Z; θ∗ ), X0 , (1. [sent-127, score-0.165]
44 The assumption of a converging problem-sequence is important for the result to hold, while the hypothesis of gaussian measurement matrices A(N ) is necessary for the proof technique to be correct. [sent-133, score-0.368]
45 Several papers provide performance guarantees for the LASSO or similar convex optimization methods [CRT06, CT07], by proving upper bounds on the resulting mean square error. [sent-138, score-0.082]
46 These works assume an appropriate ‘isometry’ condition to hold for A. [sent-139, score-0.106]
47 While such condition hold with high probability for some random matrices, it is often difficult to verify them explicitly. [sent-140, score-0.106]
48 Guarantees have been proved for correct support recovery in [ZY06], under an appropriate ‘irrepresentibility’ assumption on A. [sent-143, score-0.086]
49 model selection), the metric considered in the present paper (mean square error) provides complementary information and is quite standard in many different fields. [sent-146, score-0.082]
50 Closer to the spirit of this paper [RFG09] derived expressions for the mean square error under the same model considered here. [sent-147, score-0.17]
51 These papers argue that a sharp asymptotic characterization of the LASSO risk can provide valuable 4 guidance in practical applications. [sent-149, score-0.254]
52 2 Numerical illustrations Theorem 2 assumes that the entries of matrix A are iid gaussians. [sent-153, score-0.301]
53 We expect however that our predictions to be robust and hold for much larger family of matrices. [sent-154, score-0.061]
54 Rigorous evidence in this direction is presented in [KM10] where the normalized cost C(x)/N is shown to have a limit as N → ∞ which is universal with respect to random matrices A with iid entries. [sent-155, score-0.231]
55 ) ij ij Further, our result is asymptotic, while and one might wonder how accurate it is for instances of moderate dimensions. [sent-157, score-0.06]
56 Numerical simulations were carried out in [DMM10] and suggest that the result is robust and relevant already for N of the order of a few hundreds. [sent-158, score-0.065]
57 As an illustration, we present in Figures 1-3 the outcome of such simulations for four types of real data and random matrices. [sent-159, score-0.065]
58 We generated the signal vector randomly with entries in {+1, 0, −1} and P(x0,i = +1) = P(x0,i = −1) = 0. [sent-160, score-0.183]
59 Continuous lines corresponds to the asymptotic prediction by Theorem 2 for 2 ψ(a, b) = (a − b)2 , namely MSE = limN →∞ N −1 x − x0 2 = E η(X0 + τ∗ Z; θ∗ ) − X0 = 2 δ(τ∗ − σ 2 ). [sent-171, score-0.223]
60 The four figures correspond to measurement matrices A: Figure 1(a): Data consist of 2253 measurements of expression level of 7077 genes (this data is provided to us by Broad Institute). [sent-173, score-0.3]
61 From this matrix we took sub-matrices A of aspect ratio δ for each N . [sent-174, score-0.147]
62 We standardized all columns of A to have mean 0 and variance 1. [sent-176, score-0.061]
63 Figure 1(b): From a data set of 1932 patient records we extracted 4833 binary features describing demographic information, medical history, lab results, medications etc. [sent-177, score-0.193]
64 Similar to genes data, for each N , the sub-matrices A with aspect ratio δ were selected and standardized. [sent-180, score-0.101]
65 √ Figure 2(a): Random ±1 matrices with aspect ratio δ. [sent-181, score-0.213]
66 Figure 2(b): Random gaussian matrices with aspect ratio δ and iid N(0, 1/n) entries (as in Theorem 2). [sent-183, score-0.468]
67 Also the asymptotic prediction has a minimum as a function of λ. [sent-185, score-0.223]
68 The length of each error bar 5 (a) Gene expression data (b) Hospital records 0. [sent-190, score-0.183]
69 5 Figure 1: Mean square error (MSE) as a function of the regularization parameter λ compared to the asymptotic prediction for δ = . [sent-212, score-0.346]
70 In plot (a) the measurement matrix A is a real valued (standardized) matrix of gene expression data and in plot (b) A is a (standardized) 0-1 feature matrix of hospital records. [sent-215, score-0.414]
71 Each point in these plots is generated by finding the LASSO predictor x using a measurement vector y = Ax0 + w for an independent signal vector x0 and an independent noise vector w. [sent-216, score-0.12]
72 5 √ Figure 2: As in Figure 1, but the measurement matrix A has iid entries that are equal to ±1/ n with equal probabilities in plot (a), and has iid N(0, 1/n) entries in plot (b). [sent-239, score-0.629]
73 The predictions for FPR and TPR are obtained by applying Theorem 2 to ψf pr (a, b) ≡ I{a=0} I{b=0} and ψtpr (a, b) = I{a=0} I{b=0} , which yields 1 1 lim FPR = 2Φ(−α), lim TPR = Φ(−α + ) + Φ(−α − ) (2. [sent-244, score-0.208]
74 3 A structural property and proof of the main theorem We will prove the following theorem which implies our main result, Theorem 2. [sent-248, score-0.191]
75 Then limt→∞ limN →∞ N −1 xt (N ) − x(λ; N ) 2 = 0, almost surely. [sent-252, score-0.226]
76 5 Figure 3: Average of MSE, FPR and TPR versus λ for medical data, using 20 samples per λ and N . [sent-288, score-0.062]
77 The rest of the paper is devoted to the proof of this theorem. [sent-291, score-0.091]
78 2 uses this property together with a few lemmas to prove Theorem 3. [sent-295, score-0.076]
79 Proofs of lemmas and more details can be found in [BM10b]. [sent-296, score-0.076]
80 We then obtain i 1 N →∞ N N lim 1 t→∞ N →∞ N N ψ xt+1 , x0,i = E{ψ(η(X0 + τ∗ Z; θ∗ ), X0 )} , i ψ(xi , x0,i ) = lim lim i=1 i=1 where we used Theorem 1 and Proposition 2. [sent-299, score-0.312]
81 Also define m 1 m the scalar product u, v ≡ m i=1 ui vi for u, v ∈ R . [sent-303, score-0.069]
82 In particular, remember that the subgradient of the ℓ1 norm, x → x 1 is given by ∂ x 1 = v ∈ Rm such that |vi | ≤ 1 ∀i and xi = 0 ⇒ vi = sign(xi ) . [sent-305, score-0.123]
83 2 Proof of Theorem 3 The proof is based on a series of Lemmas that are used to check the assumptions of Lemma 1 The next lemma implies that submatrices of A constructed using the first t iterations of the AMP algorithm are non-singular (more precisely, have singular values bounded away from 0). [sent-331, score-0.157]
84 Then there exists a1 = a1 (c) > 0 (independent of t) and a2 = a2 (c, t) > 0 (depending on t and c) such that minS ′ {σmin (AS∪S ′ ) : S ′ ⊆ [N ], |S ′ | ≤ a1 N } ≥ a2 , with probability converging to 1 as N → ∞. [sent-340, score-0.13]
85 We will apply this lemma to a specific choice of the set S. [sent-341, score-0.065]
86 Namely, defining vt ≡ 1 θt−1 (xt−1 + A∗ z t−1 − xt ) , (3. [sent-342, score-0.226]
87 2) our last lemma shows convergence of a particular sequence of sets provided by v t . [sent-343, score-0.121]
88 There exist constants γ1 ∈ (0, 1), γ2 , γ3 > 0 and tmin < ∞ such that, for any t ≥ tmin , min σmin (ASt (γ1 )∪S ′ ) : S ′ ⊆ [N ] , |S ′ | ≤ γ2 N ≥ γ3 with probability converging to 1 as N → ∞. [sent-349, score-0.328]
89 We apply Lemma 1 to x = xt , the AMP estimate and r = x − xt the distance from the LASSO optimum. [sent-351, score-0.452]
90 , c5 > 0 and, for each ε > 0 some t = t(ε) such that 1–5 hold with probability going to 1 as N → ∞. [sent-356, score-0.061]
91 Condition 1 holds since limN →∞ x, x and limN →∞ xt , xt for all t are finite. [sent-357, score-0.452]
92 Further it can be shown that v t = (1/λ)[A∗ (y − Axt ) + sg(C, xt )], with sg(C, xt ) a subgradient satisfying limt→∞ limN →∞ N −1 sg(C, xt ) 2 = 0. [sent-365, score-0.732]
93 This proves condition 3 and condition 4 holds by Proposition 5. [sent-366, score-0.09]
94 Montanari, The LASSO risk: asymptotic results and real world examples, Long version (in preparation), 2010. [sent-377, score-0.167]
95 Montanari, The dynamics of message passing on dense graphs, with applications to compressed sensing, Proceedings of IEEE International Symposium on Inform. [sent-380, score-0.203]
96 , The LASSO risk for gaussian matrices, 2010, preprint available in http://arxiv. [sent-384, score-0.099]
97 Tao, Stable signal recovery from incomplete and inaccurate measurements, Communications on Pure and Applied Mathematics 59 (2006), 1207–1223. [sent-396, score-0.084]
98 Shamai, A single-letter characterization of optimal noisy compressed sensing, 47th Annual Allerton Conference (Monticello, IL), September 2009. [sent-413, score-0.13]
99 Tanaka, A typical reconstruction limit for compressed sensing based on lp-norm minimization, J. [sent-433, score-0.192]
100 Goyal, Asymptotic analysis of map estimation via the replica method and applications to compressed sensing, PUT NIPS REF, 2009. [sent-445, score-0.146]
wordName wordTfidf (topN-words)
[('lasso', 0.273), ('amp', 0.248), ('mse', 0.232), ('tpr', 0.229), ('xt', 0.226), ('montanari', 0.218), ('fpr', 0.187), ('asymptotic', 0.167), ('sg', 0.146), ('limn', 0.14), ('entries', 0.136), ('converging', 0.13), ('bayati', 0.125), ('iid', 0.119), ('matrices', 0.112), ('rn', 0.106), ('lim', 0.104), ('sensing', 0.1), ('hospital', 0.093), ('compressed', 0.092), ('records', 0.091), ('tao', 0.09), ('candes', 0.087), ('limt', 0.086), ('square', 0.082), ('lemmas', 0.076), ('proposition', 0.074), ('measurement', 0.073), ('rigorous', 0.073), ('axt', 0.071), ('kkl', 0.071), ('pseudolipschitz', 0.071), ('pw', 0.071), ('tmin', 0.071), ('donoho', 0.07), ('vi', 0.069), ('theorem', 0.069), ('message', 0.066), ('simulations', 0.065), ('lemma', 0.065), ('aspect', 0.064), ('measurements', 0.064), ('thresholds', 0.063), ('medical', 0.062), ('bpdn', 0.062), ('bento', 0.062), ('hold', 0.061), ('standardized', 0.061), ('pursuit', 0.061), ('surely', 0.061), ('instances', 0.06), ('gene', 0.059), ('ei', 0.058), ('universality', 0.057), ('maleki', 0.057), ('prediction', 0.056), ('sequence', 0.056), ('min', 0.056), ('estimator', 0.054), ('subgradient', 0.054), ('replica', 0.054), ('selector', 0.054), ('romberg', 0.054), ('proof', 0.053), ('rm', 0.053), ('expression', 0.051), ('preprint', 0.05), ('risk', 0.049), ('proved', 0.049), ('dantzig', 0.049), ('calibration', 0.049), ('signal', 0.047), ('expressions', 0.047), ('reconstructing', 0.047), ('matrix', 0.046), ('stanford', 0.045), ('condition', 0.045), ('passing', 0.045), ('communications', 0.045), ('stating', 0.043), ('recursion', 0.042), ('error', 0.041), ('ax', 0.041), ('asymptotically', 0.041), ('broad', 0.041), ('false', 0.04), ('patient', 0.04), ('developments', 0.04), ('singular', 0.039), ('contexts', 0.039), ('remarkable', 0.039), ('basis', 0.038), ('devoted', 0.038), ('characterization', 0.038), ('corollary', 0.038), ('recovery', 0.037), ('ratio', 0.037), ('st', 0.036), ('ensembles', 0.036), ('denoising', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
2 0.17060249 148 nips-2010-Learning Networks of Stochastic Differential Equations
Author: José Pereira, Morteza Ibrahimi, Andrea Montanari
Abstract: We consider linear models for stochastic dynamics. To any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T . We analyze the ℓ1 -regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufficiently high. This result substantiates the notion of a well defined ‘time complexity’ for the network inference problem. keywords: Gaussian processes, model selection and structure learning, graphical models, sparsity and feature selection. 1 Introduction and main results Let G = (V, E) be a directed graph with weight A0 ∈ R associated to the directed edge (j, i) from ij j ∈ V to i ∈ V . To each node i ∈ V in this network is associated an independent standard Brownian motion bi and a variable xi taking values in R and evolving according to A0 xj (t) dt + dbi (t) , ij dxi (t) = j∈∂+ i where ∂+ i = {j ∈ V : (j, i) ∈ E} is the set of ‘parents’ of i. Without loss of generality we shall take V = [p] ≡ {1, . . . , p}. In words, the rate of change of xi is given by a weighted sum of the current values of its neighbors, corrupted by white noise. In matrix notation, the same system is then represented by dx(t) = A0 x(t) dt + db(t) , p (1) 0 p×p with x(t) ∈ R , b(t) a p-dimensional standard Brownian motion and A ∈ R a matrix with entries {A0 }i,j∈[p] whose sparsity pattern is given by the graph G. We assume that the linear system ij x(t) = A0 x(t) is stable (i.e. that the spectrum of A0 is contained in {z ∈ C : Re(z) < 0}). Further, ˙ we assume that x(t = 0) is in its stationary state. More precisely, x(0) is a Gaussian random variable 1 independent of b(t), distributed according to the invariant measure. Under the stability assumption, this a mild restriction, since the system converges exponentially to stationarity. A portion of time length T of the system trajectory {x(t)}t∈[0,T ] is observed and we ask under which conditions these data are sufficient to reconstruct the graph G (i.e., the sparsity pattern of A0 ). We are particularly interested in computationally efficient procedures, and in characterizing the scaling of the learning time for large networks. Can the network structure be learnt in a time scaling linearly with the number of its degrees of freedom? As an example application, chemical reactions can be conveniently modeled by systems of nonlinear stochastic differential equations, whose variables encode the densities of various chemical species [1, 2]. Complex biological networks might involve hundreds of such species [3], and learning stochastic models from data is an important (and challenging) computational task [4]. Considering one such chemical reaction network in proximity of an equilibrium point, the model (1) can be used to trace fluctuations of the species counts with respect to the equilibrium values. The network G would represent in this case the interactions between different chemical factors. Work in this area focused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for fixed p, as T → ∞, while we will tackle here the regime in which both p and T diverge. Before stating our results, it is useful to stress a few important differences with respect to classical graphical model learning problems: (i) Samples are not independent. This can (and does) increase the sample complexity. (ii) On the other hand, infinitely many samples are given as data (in fact a collection indexed by the continuous parameter t ∈ [0, T ]). Of course one can select a finite subsample, for instance at regularly spaced times {x(i η)}i=0,1,... . This raises the question as to whether the learning performances depend on the choice of the spacing η. (iii) In particular, one expects that choosing η sufficiently large as to make the configurations in the subsample approximately independent can be harmful. Indeed, the matrix A0 contains more information than the stationary distribution of the above process (1), and only the latter can be learned from independent samples. (iv) On the other hand, letting η → 0, one can produce an arbitrarily large number of distinct samples. However, samples become more dependent, and intuitively one expects that there is limited information to be harnessed from a given time interval T . Our results confirm in a detailed and quantitative way these intuitions. 1.1 Results: Regularized least squares Regularized least squares is an efficient and well-studied method for support recovery. We will discuss relations with existing literature in Section 1.3. In the present case, the algorithm reconstructs independently each row of the matrix A0 . The rth row, A0 , is estimated by solving the following convex optimization problem for Ar ∈ Rp r minimize L(Ar ; {x(t)}t∈[0,T ] ) + λ Ar 1 , (2) where the likelihood function L is defined by L(Ar ; {x(t)}t∈[0,T ] ) = 1 2T T 0 (A∗ x(t))2 dt − r 1 T T 0 (A∗ x(t)) dxr (t) . r (3) (Here and below M ∗ denotes the transpose of matrix/vector M .) To see that this likelihood function is indeed related to least squares, one can formally write xr (t) = dxr (t)/dt and complete the square ˙ for the right hand side of Eq. (3), thus getting the integral (A∗ x(t) − xr (t))2 dt − xr (t)2 dt. ˙ ˙ r The first term is a sum of square residuals, and the second is independent of A. Finally the ℓ1 regularization term in Eq. (2) has the role of shrinking to 0 a subset of the entries Aij thus effectively selecting the structure. Let S 0 be the support of row A0 , and assume |S 0 | ≤ k. We will refer to the vector sign(A0 ) as to r r the signed support of A0 (where sign(0) = 0 by convention). Let λmax (M ) and λmin (M ) stand for r 2 the maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by Amin the smallest absolute value among the non-zero entries of row A0 . r When stable, the diffusion process (1) has a unique stationary measure which is Gaussian with covariance Q0 ∈ Rp×p given by the solution of Lyapunov’s equation [5] A0 Q0 + Q0 (A0 )∗ + I = 0. (4) Our guarantee for regularized least squares is stated in terms of two properties of the covariance Q0 and one assumption on ρmin (A0 ) (given a matrix M , we denote by ML,R its submatrix ML,R ≡ (Mij )i∈L,j∈R ): (a) We denote by Cmin ≡ λmin (Q0 0 ,S 0 ) the minimum eigenvalue of the restriction of Q0 to S the support S 0 and assume Cmin > 0. (b) We define the incoherence parameter α by letting |||Q0 (S 0 )C ,S 0 Q0 S 0 ,S 0 and assume α > 0. (Here ||| · |||∞ is the operator sup norm.) −1 |||∞ = 1 − α, ∗ (c) We define ρmin (A0 ) = −λmax ((A0 + A0 )/2) and assume ρmin (A0 ) > 0. Note this is a stronger form of stability assumption. Our main result is to show that there exists a well defined time complexity, i.e. a minimum time interval T such that, observing the system for time T enables us to reconstruct the network with high probability. This result is stated in the following theorem. Theorem 1.1. Consider the problem of learning the support S 0 of row A0 of the matrix A0 from a r sample trajectory {x(t)}t∈[0,T ] distributed according to the model (1). If T > 104 k 2 (k ρmin (A0 )−2 + A−2 ) 4pk min log , 2 α2 ρmin (A0 )Cmin δ (5) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36 log(4p/δ)/(T α2 ρmin (A0 )) . The time complexity is logarithmic in the number of variables and polynomial in the support size. Further, it is roughly inversely proportional to ρmin (A0 ), which is quite satisfying conceptually, since ρmin (A0 )−1 controls the relaxation time of the mixes. 1.2 Overview of other results So far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant statements, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics, with parameter η > 0: x(t) = x(t − 1) + ηA0 x(t − 1) + w(t), t ∈ N0 . (6) Here x(t) ∈ Rp is the vector collecting the dynamical variables, A0 ∈ Rp×p specifies the dynamics as above, and {w(t)}t≥0 is a sequence of i.i.d. normal vectors with covariance η Ip×p (i.e. with independent components of variance η). We assume that consecutive samples {x(t)}0≤t≤n are given and will ask under which conditions regularized least squares reconstructs the support of A0 . The parameter η has the meaning of a time-step size. The continuous-time model (1) is recovered, in a sense made precise below, by letting η → 0. Indeed we will prove reconstruction guarantees that are uniform in this limit as long as the product nη (which corresponds to the time interval T in the previous section) is kept constant. For a formal statement we refer to Theorem 3.1. Theorem 1.1 is indeed proved by carefully controlling this limit. The mathematical challenge in this problem is related to the fundamental fact that the samples {x(t)}0≤t≤n are dependent (and strongly dependent as η → 0). Discrete time models of the form (6) can arise either because the system under study evolves by discrete steps, or because we are subsampling a continuous time system modeled as in Eq. (1). Notice that in the latter case the matrices A0 appearing in Eq. (6) and (1) coincide only to the zeroth order in η. Neglecting this technical complication, the uniformity of our reconstruction guarantees as η → 0 has an appealing interpretation already mentioned above. Whenever the samples spacing is not too large, the time complexity (i.e. the product nη) is roughly independent of the spacing itself. 3 1.3 Related work A substantial amount of work has been devoted to the analysis of ℓ1 regularized least squares, and its variants [6, 7, 8, 9, 10]. The most closely related results are the one concerning high-dimensional consistency for support recovery [11, 12]. Our proof follows indeed the line of work developed in these papers, with two important challenges. First, the design matrix is in our case produced by a stochastic diffusion, and it does not necessarily satisfies the irrepresentability conditions used by these works. Second, the observations are not corrupted by i.i.d. noise (since successive configurations are correlated) and therefore elementary concentration inequalities are not sufficient. Learning sparse graphical models via ℓ1 regularization is also a topic with significant literature. In the Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples [13]. In the context of binary pairwise graphical models, Ref. [11] proves high-dimensional consistency of regularized logistic regression for structural learning, under a suitable irrepresentability conditions on a modified covariance. Also this paper focuses on i.i.d. samples. Most of these proofs builds on the technique of [12]. A naive adaptation to the present case allows to prove some performance guarantee for the discrete-time setting. However the resulting bounds are not uniform as η → 0 for nη = T fixed. In particular, they do not allow to prove an analogous of our continuous time result, Theorem 1.1. A large part of our effort is devoted to producing more accurate probability estimates that capture the correct scaling for small η. Similar issues were explored in the study of stochastic differential equations, whereby one is often interested in tracking some slow degrees of freedom while ‘averaging out’ the fast ones [14]. The relevance of this time-scale separation for learning was addressed in [15]. Let us however emphasize that these works focus once more on system with a fixed (small) number of dimensions p. Finally, the related topic of learning graphical models for autoregressive processes was studied recently in [16, 17]. The convex relaxation proposed in these papers is different from the one developed here. Further, no model selection guarantee was proved in [16, 17]. 2 Illustration of the main results It might be difficult to get a clear intuition of Theorem 1.1, mainly because of conditions (a) and (b), which introduce parameters Cmin and α. The same difficulty arises with analogous results on the high-dimensional consistency of the LASSO [11, 12]. In this section we provide concrete illustration both via numerical simulations, and by checking the condition on specific classes of graphs. 2.1 Learning the laplacian of graphs with bounded degree Given a simple graph G = (V, E) on vertex set V = [p], its laplacian ∆G is the symmetric p × p matrix which is equal to the adjacency matrix of G outside the diagonal, and with entries ∆G = ii −deg(i) on the diagonal [18]. (Here deg(i) denotes the degree of vertex i.) It is well known that ∆G is negative semidefinite, with one eigenvalue equal to 0, whose multiplicity is equal to the number of connected components of G. The matrix A0 = −m I + ∆G fits into the setting of Theorem 1.1 for m > 0. The corresponding model (1.1) describes the over-damped dynamics of a network of masses connected by springs of unit strength, and connected by a spring of strength m to the origin. We obtain the following result. Theorem 2.1. Let G be a simple connected graph of maximum vertex degree k and consider the model (1.1) with A0 = −m I + ∆G where ∆G is the laplacian of G and m > 0. If k+m 5 4pk T ≥ 2 · 105 k 2 , (7) (k + m2 ) log m δ then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36(k + m)2 log(4p/δ)/(T m3 ). In other words, for m bounded away from 0 and ∞, regularized least squares regression correctly reconstructs the graph G from a trajectory of time length which is polynomial in the degree and logarithmic in the system size. Notice that once the graph is known, the laplacian ∆G is uniquely determined. Also, the proof technique used for this example is generalizable to other graphs as well. 4 2800 Min. # of samples for success prob. = 0.9 1 0.9 p = 16 p = 32 0.8 Probability of success p = 64 0.7 p = 128 p = 256 0.6 p = 512 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 T=nη 350 400 2600 2400 2200 2000 1800 1600 1400 1200 1 10 450 2 3 10 10 p Figure 1: (left) Probability of success vs. length of the observation interval nη. (right) Sample complexity for 90% probability of success vs. p. 2.2 Numerical illustrations In this section we present numerical validation of the proposed method on synthetic data. The results confirm our observations in Theorems 1.1 and 3.1, below, namely that the time complexity scales logarithmically with the number of nodes in the network p, given a constant maximum degree. Also, the time complexity is roughly independent of the sampling rate. In Fig. 1 and 2 we consider the discrete-time setting, generating data as follows. We draw A0 as a random sparse matrix in {0, 1}p×p with elements chosen independently at random with P(A0 = 1) = k/p, k = 5. The ij process xn ≡ {x(t)}0≤t≤n is then generated according to Eq. (6). We solve the regularized least 0 square problem (the cost function is given explicitly in Eq. (8) for the discrete-time case) for different values of n, the number of observations, and record if the correct support is recovered for a random row r using the optimum value of the parameter λ. An estimate of the probability of successful recovery is obtained by repeating this experiment. Note that we are estimating here an average probability of success over randomly generated matrices. The left plot in Fig.1 depicts the probability of success vs. nη for η = 0.1 and different values of p. Each curve is obtained using 211 instances, and each instance is generated using a new random matrix A0 . The right plot in Fig.1 is the corresponding curve of the sample complexity vs. p where sample complexity is defined as the minimum value of nη with probability of success of 90%. As predicted by Theorem 2.1 the curve shows the logarithmic scaling of the sample complexity with p. In Fig. 2 we turn to the continuous-time model (1). Trajectories are generated by discretizing this stochastic differential equation with step δ much smaller than the sampling rate η. We draw random matrices A0 as above and plot the probability of success for p = 16, k = 4 and different values of η, as a function of T . We used 211 instances for each curve. As predicted by Theorem 1.1, for a fixed observation interval T , the probability of success converges to some limiting value as η → 0. 3 Discrete-time model: Statement of the results Consider a system evolving in discrete time according to the model (6), and let xn ≡ {x(t)}0≤t≤n 0 be the observed portion of the trajectory. The rth row A0 is estimated by solving the following r convex optimization problem for Ar ∈ Rp minimize L(Ar ; xn ) + λ Ar 0 where L(Ar ; xn ) ≡ 0 1 2η 2 n 1 , (8) n−1 2 t=0 {xr (t + 1) − xr (t) − η A∗ x(t)} . r (9) Apart from an additive constant, the η → 0 limit of this cost function can be shown to coincide with the cost function in the continuous time case, cf. Eq. (3). Indeed the proof of Theorem 1.1 will amount to a more precise version of this statement. Furthermore, L(Ar ; xn ) is easily seen to be the 0 log-likelihood of Ar within model (6). 5 1 1 0.9 0.95 0.9 0.7 Probability of success Probability of success 0.8 η = 0.04 η = 0.06 0.6 η = 0.08 0.5 η = 0.1 0.4 η = 0.14 0.3 η = 0.22 η = 0.18 0.85 0.8 0.75 0.7 0.65 0.2 0.6 0.1 0 50 100 150 T=nη 200 0.55 0.04 250 0.06 0.08 0.1 0.12 η 0.14 0.16 0.18 0.2 0.22 Figure 2: (right)Probability of success vs. length of the observation interval nη for different values of η. (left) Probability of success vs. η for a fixed length of the observation interval, (nη = 150) . The process is generated for a small value of η and sampled at different rates. As before, we let S 0 be the support of row A0 , and assume |S 0 | ≤ k. Under the model (6) x(t) has r a Gaussian stationary state distribution with covariance Q0 determined by the following modified Lyapunov equation A0 Q0 + Q0 (A0 )∗ + ηA0 Q0 (A0 )∗ + I = 0 . (10) It will be clear from the context whether A0 /Q0 refers to the dynamics/stationary matrix from the continuous or discrete time system. We assume conditions (a) and (b) introduced in Section 1.1, and adopt the notations already introduced there. We use as a shorthand notation σmax ≡ σmax (I +η A0 ) where σmax (.) is the maximum singular value. Also define D ≡ 1 − σmax /η . We will assume D > 0. As in the previous section, we assume the model (6) is initiated in the stationary state. Theorem 3.1. Consider the problem of learning the support S 0 of row A0 from the discrete-time r trajectory {x(t)}0≤t≤n . If nη > 4pk 104 k 2 (kD−2 + A−2 ) min log , 2 DC 2 α δ min (11) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = (36 log(4p/δ))/(Dα2 nη). In other words the discrete-time sample complexity, n, is logarithmic in the model dimension, polynomial in the maximum network degree and inversely proportional to the time spacing between samples. The last point is particularly important. It enables us to derive the bound on the continuoustime sample complexity as the limit η → 0 of the discrete-time sample complexity. It also confirms our intuition mentioned in the Introduction: although one can produce an arbitrary large number of samples by sampling the continuous process with finer resolutions, there is limited amount of information that can be harnessed from a given time interval [0, T ]. 4 Proofs In the following we denote by X ∈ Rn×p the matrix whose (t + 1)th column corresponds to the configuration x(t), i.e. X = [x(0), x(1), . . . , x(n − 1)]. Further ∆X ∈ Rn×p is the matrix containing configuration changes, namely ∆X = [x(1) − x(0), . . . , x(n) − x(n − 1)]. Finally we write W = [w(1), . . . , w(n − 1)] for the matrix containing the Gaussian noise realization. Equivalently, The r th row of W is denoted by Wr . W = ∆X − ηA X . In order to lighten the notation, we will omit the reference to xn in the likelihood function (9) and 0 simply write L(Ar ). We define its normalized gradient and Hessian by G = −∇L(A0 ) = r 1 ∗ XWr , nη Q = ∇2 L(A0 ) = r 6 1 XX ∗ . n (12) 4.1 Discrete time In this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem 3.1. We start by stating a set of sufficient conditions for regularized least squares to work. Then we present a series of concentration lemmas to be used to prove the validity of these conditions, and finally we sketch the outline of the proof. As mentioned, the proof strategy, and in particular the following proposition which provides a compact set of sufficient conditions for the support to be recovered correctly is analogous to the one in [12]. A proof of this proposition can be found in the supplementary material. Proposition 4.1. Let α, Cmin > 0 be be defined by λmin (Q0 0 ,S 0 ) ≡ Cmin , S |||Q0 0 )C ,S 0 Q0 0 ,S 0 S (S −1 |||∞ ≡ 1 − α . (13) If the following conditions hold then the regularized least square solution (8) correctly recover the signed support sign(A0 ): r λα Amin Cmin G ∞≤ , GS 0 ∞ ≤ − λ, (14) 3 4k α Cmin α Cmin √ , √ . |||QS 0 ,S 0 − Q0 0 ,S 0 |||∞ ≤ (15) |||Q(S 0 )C ,S 0 − Q0 0 )C ,S 0 |||∞ ≤ S (S 12 k 12 k Further the same statement holds for the continuous model 3, provided G and Q are the gradient and the hessian of the likelihood (3). The proof of Theorem 3.1 consists in checking that, under the hypothesis (11) on the number of consecutive configurations, conditions (14) to (15) will hold with high probability. Checking these conditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is taken with respect to a stationary trajectory, we have E{G} = 0, E{Q} = Q0 . 4.1.1 Technical lemmas In this section we will state the necessary concentration lemmas for proving Theorem 3.1. These are non-trivial because G, Q are quadratic functions of dependent random variables the samples {x(t)}0≤t≤n . The proofs of Proposition 4.2, of Proposition 4.3, and Corollary 4.4 can be found in the supplementary material provided. Our first Proposition implies concentration of G around 0. Proposition 4.2. Let S ⊆ [p] be any set of vertices and ǫ < 1/2. If σmax ≡ σmax (I + η A0 ) < 1, then 2 P GS ∞ > ǫ ≤ 2|S| e−n(1−σmax ) ǫ /4 . (16) We furthermore need to bound the matrix norms as per (15) in proposition 4.1. First we relate bounds on |||QJS − Q0 JS |||∞ with bounds on |Qij − Q0 |, (i ∈ J, i ∈ S) where J and S are any ij subsets of {1, ..., p}. We have, P(|||QJS − Q0 )|||∞ > ǫ) ≤ |J||S| max P(|Qij − Q0 | > ǫ/|S|). JS ij i,j∈J (17) Then, we bound |Qij − Q0 | using the following proposition ij Proposition 4.3. Let i, j ∈ {1, ..., p}, σmax ≡ σmax (I + ηA0 ) < 1, T = ηn > 3/D and 0 < ǫ < 2/D where D = (1 − σmax )/η then, P(|Qij − Q0 )| > ǫ) ≤ 2e ij n − 32η2 (1−σmax )3 ǫ2 . (18) Finally, the next corollary follows from Proposition 4.3 and Eq. (17). Corollary 4.4. Let J, S (|S| ≤ k) be any two subsets of {1, ..., p} and σmax ≡ σmax (I + ηA0 ) < 1, ǫ < 2k/D and nη > 3/D (where D = (1 − σmax )/η) then, P(|||QJS − Q0 |||∞ > ǫ) ≤ 2|J|ke JS 7 n − 32k2 η2 (1−σmax )3 ǫ2 . (19) 4.1.2 Outline of the proof of Theorem 3.1 With these concentration bounds we can now easily prove Theorem 3.1. All we need to do is to compute the probability that the conditions given by Proposition 4.1 hold. From the statement of the theorem we have that the first two conditions (α, Cmin > 0) of Proposition 4.1 hold. In order to make the first condition on G imply the second condition on G we assume that λα/3 ≤ (Amin Cmin )/(4k) − λ which is guaranteed to hold if λ ≤ Amin Cmin /8k. (20) We also combine the two last conditions on Q, thus obtaining the following |||Q[p],S 0 − Q0 0 |||∞ ≤ [p],S α Cmin √ , 12 k (21) since [p] = S 0 ∪ (S 0 )C . We then impose that both the probability of the condition on Q failing and the probability of the condition on G failing are upper bounded by δ/2 using Proposition 4.2 and Corollary 4.4. It is shown in the supplementary material that this is satisfied if condition (11) holds. 4.2 Outline of the proof of Theorem 1.1 To prove Theorem 1.1 we recall that Proposition 4.1 holds provided the appropriate continuous time expressions are used for G and Q, namely G = −∇L(A0 ) = r 1 T T x(t) dbr (t) , 0 Q = ∇2 L(A0 ) = r 1 T T x(t)x(t)∗ dt . (22) 0 These are of course random variables. In order to distinguish these from the discrete time version, we will adopt the notation Gn , Qn for the latter. We claim that these random variables can be coupled (i.e. defined on the same probability space) in such a way that Gn → G and Qn → Q almost surely as n → ∞ for fixed T . Under assumption (5), it is easy to show that (11) holds for all n > n0 with n0 a sufficiently large constant (for a proof see the provided supplementary material). Therefore, by the proof of Theorem 3.1, the conditions in Proposition 4.1 hold for gradient Gn and hessian Qn for any n ≥ n0 , with probability larger than 1 − δ. But by the claimed convergence Gn → G and Qn → Q, they hold also for G and Q with probability at least 1 − δ which proves the theorem. We are left with the task of showing that the discrete and continuous time processes can be coupled in such a way that Gn → G and Qn → Q. With slight abuse of notation, the state of the discrete time system (6) will be denoted by x(i) where i ∈ N and the state of continuous time system (1) by x(t) where t ∈ R. We denote by Q0 the solution of (4) and by Q0 (η) the solution of (10). It is easy to check that Q0 (η) → Q0 as η → 0 by the uniqueness of stationary state distribution. The initial state of the continuous time system x(t = 0) is a N(0, Q0 ) random variable independent of b(t) and the initial state of the discrete time system is defined to be x(i = 0) = (Q0 (η))1/2 (Q0 )−1/2 x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the respective dynamical systems using the same matrix A0 using common randomness provided by the standard Brownian motion {b(t)}0≤t≤T in Rp . In order to couple x(t) and x(i), we construct w(i), the noise driving the discrete time system, by letting w(i) ≡ (b(T i/n) − b(T (i − 1)/n)). The almost sure convergence Gn → G and Qn → Q follows then from standard convergence of random walk to Brownian motion. Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 References [1] D.T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chemistry, 58:35–55, 2007. [2] D. Higham. Modeling and Simulating Chemical Reactions. SIAM Review, 50:347–368, 2008. [3] N.D.Lawrence et al., editor. Learning and Inference in Computational Systems Biology. MIT Press, 2010. [4] T. Toni, D. Welch, N. Strelkova, A. Ipsen, and M.P.H. Stumpf. Modeling and Simulating Chemical Reactions. J. R. Soc. Interface, 6:187–202, 2009. [5] K. Zhou, J.C. Doyle, and K. Glover. Robust and optimal control. Prentice Hall, 1996. [6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [7] D.L. Donoho. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59(7):907–934, 2006. [8] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. [9] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization. Annals of Statistics, 37:2109–2144, 2009. [10] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, 2009. [11] M.J. Wainwright, P. Ravikumar, and J.D. Lafferty. High-Dimensional Graphical Model Selection Using l-1-Regularized Logistic Regression. Advances in Neural Information Processing Systems, 19:1465, 2007. [12] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. [13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008. [14] K. Ball, T.G. Kurtz, L. Popovic, and G. Rempala. Modeling and Simulating Chemical Reactions. Ann. Appl. Prob., 16:1925–1961, 2006. [15] G.A. Pavliotis and A.M. Stuart. Parameter estimation for multiscale diffusions. J. Stat. Phys., 127:741–781, 2007. [16] J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical models of autoregressive processes. pages 89–116, 2010. [17] J. Songsiri and L. Vandenberghe. Topology selection in graphical models of autoregressive processes. Journal of Machine Learning Research, 2010. submitted. [18] F.R.K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997. [19] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1-regularized logistic regression. Annals of Statistics, 2008. 9
3 0.16642882 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
Author: Hongbo Zhou, Qiang Cheng
Abstract: Regularization technique has become a principled tool for statistics and machine learning research and practice. However, in most situations, these regularization terms are not well interpreted, especially on how they are related to the loss function and data. In this paper, we propose a robust minimax framework to interpret the relationship between data and regularization terms for a large class of loss functions. We show that various regularization terms are essentially corresponding to different distortions to the original data matrix. This minimax framework includes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordinate coding, multiple kernel learning, etc., as special cases. Within this minimax framework, we further give mathematically exact definition for a novel representation called sparse grouping representation (SGR), and prove a set of sufficient conditions for generating such group level sparsity. Under these sufficient conditions, a large set of consistent regularization terms can be designed. This SGR is essentially different from group lasso in the way of using class or group information, and it outperforms group lasso when there appears group label noise. We also provide some generalization bounds in a classification setting. 1
4 0.16307224 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
5 0.15099612 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
6 0.15037784 172 nips-2010-Multi-Stage Dantzig Selector
7 0.14988711 5 nips-2010-A Dirty Model for Multi-task Learning
8 0.12375899 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
9 0.11894866 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
10 0.1167411 222 nips-2010-Random Walk Approach to Regret Minimization
11 0.11493398 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
12 0.11041851 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
13 0.10545329 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
14 0.10520937 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
15 0.10516623 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
16 0.099720292 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
17 0.099101856 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
18 0.098212086 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
19 0.097830258 134 nips-2010-LSTD with Random Projections
20 0.096834727 243 nips-2010-Smoothness, Low Noise and Fast Rates
topicId topicWeight
[(0, 0.258), (1, 0.008), (2, 0.163), (3, 0.142), (4, 0.089), (5, -0.111), (6, -0.055), (7, 0.092), (8, -0.185), (9, -0.007), (10, 0.03), (11, 0.089), (12, -0.116), (13, 0.092), (14, -0.01), (15, -0.044), (16, 0.018), (17, 0.019), (18, 0.083), (19, 0.072), (20, 0.012), (21, 0.117), (22, 0.016), (23, 0.146), (24, -0.006), (25, -0.006), (26, 0.06), (27, -0.004), (28, -0.071), (29, -0.164), (30, -0.051), (31, 0.032), (32, -0.039), (33, 0.047), (34, -0.001), (35, -0.074), (36, 0.023), (37, 0.064), (38, -0.028), (39, -0.049), (40, 0.014), (41, 0.049), (42, -0.099), (43, -0.022), (44, 0.062), (45, -0.036), (46, -0.039), (47, 0.068), (48, 0.016), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.95466596 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
2 0.77827209 5 nips-2010-A Dirty Model for Multi-task Learning
Author: Ali Jalali, Sujay Sanghavi, Chao Ruan, Pradeep K. Ravikumar
Abstract: We consider multi-task learning in the setting of multiple linear regression, and where some relevant features could be shared across the tasks. Recent research has studied the use of ℓ1 /ℓq norm block-regularizations with q > 1 for such blocksparse structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the extent to which the features are shared across tasks. Indeed they show [8] that if the extent of overlap is less than a threshold, or even if parameter values in the shared features are highly uneven, then block ℓ1 /ℓq regularization could actually perform worse than simple separate elementwise ℓ1 regularization. Since these caveats depend on the unknown true parameters, we might not know when and which method to apply. Even otherwise, we are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage parameter overlap when it exists, but not pay a penalty when it does not ? Indeed, this falls under a more general question of whether we can model such dirty data which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). With the explosion of such dirty high-dimensional data in modern settings, it is vital to develop tools – dirty models – to perform biased statistical estimation tailored to such data. Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we estimate a superposition of two sets of parameters and regularize them differently. We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 or ℓ1 /ℓq methods, under high-dimensional scaling and over the entire range of possible overlaps (except at boundary cases, where we match the best method). 1 Introduction: Motivation and Setup High-dimensional scaling. In fields across science and engineering, we are increasingly faced with problems where the number of variables or features p is larger than the number of observations n. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity (e.g. in compressed sensing [3] and LASSO [14]), low-rank structure [13, 9], or sparse graphical model structure [12]. It is in such high-dimensional contexts in particular that multi-task learning [4] could be most useful. Here, 1 multiple tasks share some common structure such as sparsity, and estimating these tasks jointly by leveraging this common structure could be more statistically efficient. Block-sparse Multiple Regression. A common multiple task learning setting, and which is the focus of this paper, is that of multiple regression, where we have r > 1 response variables, and a common set of p features or covariates. The r tasks could share certain aspects of their underlying distributions, such as common variance, but the setting we focus on in this paper is where the response variables have simultaneously sparse structure: the index set of relevant features for each task is sparse; and there is a large overlap of these relevant features across the different regression problems. Such “simultaneous sparsity” arises in a variety of contexts [15]; indeed, most applications of sparse signal recovery in contexts ranging from graphical model learning, kernel learning, and function estimation have natural extensions to the simultaneous-sparse setting [12, 2, 11]. It is useful to represent the multiple regression parameters via a matrix, where each column corresponds to a task, and each row to a feature. Having simultaneous sparse structure then corresponds to the matrix being largely “block-sparse” – where each row is either all zero or mostly non-zero, and the number of non-zero rows is small. A lot of recent research in this setting has focused on ℓ1 /ℓq norm regularizations, for q > 1, that encourage the parameter matrix to have such blocksparse structure. Particular examples include results using the ℓ1 /ℓ∞ norm [16, 5, 8], and the ℓ1 /ℓ2 norm [7, 10]. Dirty Models. Block-regularization is “heavy-handed” in two ways. By strictly encouraging sharedsparsity, it assumes that all relevant features are shared, and hence suffers under settings, arguably more realistic, where each task depends on features specific to itself in addition to the ones that are common. The second concern with such block-sparse regularizers is that the ℓ1 /ℓq norms can be shown to encourage the entries in the non-sparse rows taking nearly identical values. Thus we are far away from the original goal of multitask learning: not only do the set of relevant features have to be exactly the same, but their values have to as well. Indeed recent research into such regularized methods [8, 10] caution against the use of block-regularization in regimes where the supports and values of the parameters for each task can vary widely. Since the true parameter values are unknown, that would be a worrisome caveat. We thus ask the question: can we learn multiple regression models by leveraging whatever overlap of features there exist, and without requiring the parameter values to be near identical? Indeed this is an instance of a more general question on whether we can estimate statistical models where the data may not fall cleanly into any one structural bracket (sparse, block-sparse and so on). With the explosion of dirty high-dimensional data in modern settings, it is vital to investigate estimation of corresponding dirty models, which might require new approaches to biased high-dimensional estimation. In this paper we take a first step, focusing on such dirty models for a specific problem: simultaneously sparse multiple regression. Our approach uses a simple idea: while any one structure might not capture the data, a superposition of structural classes might. Our method thus searches for a parameter matrix that can be decomposed into a row-sparse matrix (corresponding to the overlapping or shared features) and an elementwise sparse matrix (corresponding to the non-shared features). As we show both theoretically and empirically, with this simple fix we are able to leverage any extent of shared features, while allowing disparities in support and values of the parameters, so that we are always better than both the Lasso or block-sparse regularizers (at times remarkably so). The rest of the paper is organized as follows: In Sec 2. basic definitions and setup of the problem are presented. Main results of the paper is discussed in sec 3. Experimental results and simulations are demonstrated in Sec 4. Notation: For any matrix M , we denote its j th row as Mj , and its k-th column as M (k) . The set of all non-zero rows (i.e. all rows with at least one non-zero element) is denoted by RowSupp(M ) (k) and its support by Supp(M ). Also, for any matrix M , let M 1,1 := j,k |Mj |, i.e. the sums of absolute values of the elements, and M 1,∞ := j 2 Mj ∞ where, Mj ∞ (k) := maxk |Mj |. 2 Problem Set-up and Our Method Multiple regression. We consider the following standard multiple linear regression model: ¯ y (k) = X (k) θ(k) + w(k) , k = 1, . . . , r, where y (k) ∈ Rn is the response for the k-th task, regressed on the design matrix X (k) ∈ Rn×p (possibly different across tasks), while w(k) ∈ Rn is the noise vector. We assume each w(k) is drawn independently from N (0, σ 2 ). The total number of tasks or target variables is r, the number of features is p, while the number of samples we have for each task is n. For notational convenience, ¯ we collate these quantities into matrices Y ∈ Rn×r for the responses, Θ ∈ Rp×r for the regression n×r parameters and W ∈ R for the noise. ¯ Dirty Model. In this paper we are interested in estimating the true parameter Θ from data by lever¯ aging any (unknown) extent of simultaneous-sparsity. In particular, certain rows of Θ would have many non-zero entries, corresponding to features shared by several tasks (“shared” rows), while certain rows would be elementwise sparse, corresponding to those features which are relevant for some tasks but not all (“non-shared rows”), while certain rows would have all zero entries, corresponding to those features that are not relevant to any task. We are interested in estimators Θ that automatically adapt to different levels of sharedness, and yet enjoy the following guarantees: Support recovery: We say an estimator Θ successfully recovers the true signed support if ¯ sign(Supp(Θ)) = sign(Supp(Θ)). We are interested in deriving sufficient conditions under which ¯ the estimator succeeds. We note that this is stronger than merely recovering the row-support of Θ, which is union of its supports for the different tasks. In particular, denoting Uk for the support of the ¯ k-th column of Θ, and U = k Uk . Error bounds: We are also interested in providing bounds on the elementwise ℓ∞ norm error of the estimator Θ, ¯ Θ−Θ 2.1 ∞ = max max j=1,...,p k=1,...,r (k) Θj (k) ¯ − Θj . Our Method Our method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter matrices B and S with different regularizations for each: encouraging block-structured row-sparsity in B and elementwise sparsity in S. The corresponding “clean” models would either just use blocksparse regularizations [8, 10] or just elementwise sparsity regularizations [14, 18], so that either method would perform better in certain suited regimes. Interestingly, as we will see in the main results, by explicitly allowing to have both block-sparse and elementwise sparse component, we are ¯ able to outperform both classes of these “clean models”, for all regimes Θ. Algorithm 1 Dirty Block Sparse Solve the following convex optimization problem: (S, B) ∈ arg min S,B 1 2n r k=1 y (k) − X (k) S (k) + B (k) 2 2 + λs S 1,1 + λb B 1,∞ . (1) Then output Θ = B + S. 3 Main Results and Their Consequences We now provide precise statements of our main results. A number of recent results have shown that the Lasso [14, 18] and ℓ1 /ℓ∞ block-regularization [8] methods succeed in recovering signed supports with controlled error bounds under high-dimensional scaling regimes. Our first two theorems extend these results to our dirty model setting. In Theorem 1, we consider the case of deterministic design matrices X (k) , and provide sufficient conditions guaranteeing signed support recovery, and elementwise ℓ∞ norm error bounds. In Theorem 2, we specialize this theorem to the case where the 3 rows of the design matrices are random from a general zero mean Gaussian distribution: this allows us to provide scaling on the number of observations required in order to guarantee signed support recovery and bounded elementwise ℓ∞ norm error. Our third result is the most interesting in that it explicitly quantifies the performance gains of our method vis-a-vis Lasso and the ℓ1 /ℓ∞ block-regularization method. Since this entailed finding the precise constants underlying earlier theorems, and a correspondingly more delicate analysis, we follow Negahban and Wainwright [8] and focus on the case where there are two-tasks (i.e. r = 2), and where we have standard Gaussian design matrices as in Theorem 2. Further, while each of two tasks depends on s features, only a fraction α of these are common. It is then interesting to see how the behaviors of the different regularization methods vary with the extent of overlap α. Comparisons. Negahban and Wainwright [8] show that there is actually a “phase transition” in the scaling of the probability of successful signed support-recovery with the number of observations. n Denote a particular rescaling of the sample-size θLasso (n, p, α) = s log(p−s) . Then as Wainwright [18] show, when the rescaled number of samples scales as θLasso > 2 + δ for any δ > 0, Lasso succeeds in recovering the signed support of all columns with probability converging to one. But when the sample size scales as θLasso < 2−δ for any δ > 0, Lasso fails with probability converging to one. For the ℓ1 /ℓ∞ -reguralized multiple linear regression, define a similar rescaled sample size n θ1,∞ (n, p, α) = s log(p−(2−α)s) . Then as Negahban and Wainwright [8] show there is again a transition in probability of success from near zero to near one, at the rescaled sample size of θ1,∞ = (4 − 3α). Thus, for α < 2/3 (“less sharing”) Lasso would perform better since its transition is at a smaller sample size, while for α > 2/3 (“more sharing”) the ℓ1 /ℓ∞ regularized method would perform better. As we show in our third theorem, the phase transition for our method occurs at the rescaled sample size of θ1,∞ = (2 − α), which is strictly before either the Lasso or the ℓ1 /ℓ∞ regularized method except for the boundary cases: α = 0, i.e. the case of no sharing, where we match Lasso, and for α = 1, i.e. full sharing, where we match ℓ1 /ℓ∞ . Everywhere else, we strictly outperform both methods. Figure 3 shows the empirical performance of each of the three methods; as can be seen, they agree very well with the theoretical analysis. (Further details in the experiments Section 4). 3.1 Sufficient Conditions for Deterministic Designs We first consider the case where the design matrices X (k) for k = 1, · · ·, r are deterministic, and start by specifying the assumptions we impose on the model. We note that similar sufficient conditions for the deterministic X (k) ’s case were imposed in papers analyzing Lasso [18] and block-regularization methods [8, 10]. (k) A0 Column Normalization Xj 2 ≤ √ 2n for all j = 1, . . . , p, k = 1, . . . , r. ¯ Let Uk denote the support of the k-th column of Θ, and U = supports for each task. Then we require that k r A1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Xj , XUk (k) (k) XUk , XUk Uk denote the union of −1 c We will also find it useful to define γs := 1−max1≤k≤r maxj∈Uk (k) > 0. 1 k=1 (k) Xj , XUk Note that by the incoherence condition A1, we have γs > 0. A2 Eigenvalue Condition Cmin := min λmin 1≤k≤r A3 Boundedness Condition Dmax := max 1≤k≤r 1 (k) (k) XUk , XUk n 1 (k) (k) XUk , XUk n (k) (k) XUk , XUk −1 . 1 > 0. −1 ∞,1 < ∞. Further, we require the regularization penalties be set as λs > 2(2 − γs )σ log(pr) √ γs n and 4 λb > 2(2 − γb )σ log(pr) √ . γb n (2) 1 0.9 0.8 0.8 Dirty Model L1/Linf Reguralizer Probability of Success Probability of Success 1 0.9 0.7 0.6 0.5 0.4 LASSO 0.3 0.2 0 0.5 1 1.5 1.7 2 2.5 Control Parameter θ 3 3.1 3.5 0.6 0.5 0.4 L1/Linf Reguralizer 0.3 LASSO 0.2 p=128 p=256 p=512 0.1 Dirty Model 0.7 p=128 p=256 p=512 0.1 0 0.5 4 1 1.333 (a) α = 0.3 1.5 2 Control Parameter θ (b) α = 2.5 3 2 3 1 0.9 Dirty Model Probability of Success 0.8 0.7 L1/Linf Reguralizer 0.6 0.5 LASSO 0.4 0.3 0.2 p=128 p=256 p=512 0.1 0 0.5 1 1.2 1.5 1.6 2 Control Parameter θ 2.5 (c) α = 0.8 Figure 1: Probability of success in recovering the true signed support using dirty model, Lasso and ℓ1 /ℓ∞ regularizer. For a 2-task problem, the probability of success for different values of feature-overlap fraction α is plotted. As we can see in the regimes that Lasso is better than, as good as and worse than ℓ1 /ℓ∞ regularizer ((a), (b) and (c) respectively), the dirty model outperforms both of the methods, i.e., it requires less number of observations for successful recovery of the true signed support compared to Lasso and ℓ1 /ℓ∞ regularizer. Here p s = ⌊ 10 ⌋ always. Theorem 1. Suppose A0-A3 hold, and that we obtain estimate Θ from our algorithm with regularization parameters chosen according to (2). Then, with probability at least 1 − c1 exp(−c2 n) → 1, we are guaranteed that the convex program (1) has a unique optimum and (a) The estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ 4σ 2 log (pr) + λs Dmax . n Cmin ≤ bmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that min ¯ (j,k)∈Supp(Θ) ¯(k) θj > bmin . Here the positive constants c1 , c2 depend only on γs , γb , λs , λb and σ, but are otherwise independent of n, p, r, the problem dimensions of interest. Remark: Condition (a) guarantees that the estimate will have no false inclusions; i.e. all included features will be relevant. If in addition, we require that it have no false exclusions and that recover the support exactly, we need to impose the assumption in (b) that the non-zero elements are large enough to be detectable above the noise. 3.2 General Gaussian Designs Often the design matrices consist of samples from a Gaussian ensemble. Suppose that for each task (k) k = 1, . . . , r the design matrix X (k) ∈ Rn×p is such that each row Xi ∈ Rp is a zero-mean Gaussian random vector with covariance matrix Σ(k) ∈ Rp×p , and is independent of every other (k) row. Let ΣV,U ∈ R|V|×|U | be the submatrix of Σ(k) with rows corresponding to V and columns to U . We require these covariance matrices to satisfy the following conditions: r C1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Σj,Uk , ΣUk ,Uk k=1 5 −1 >0 1 C2 Eigenvalue Condition Cmin := min λmin Σ(k),Uk Uk > 0 so that the minimum eigenvalue 1≤k≤r is bounded away from zero. C3 Boundedness Condition Dmax := (k) ΣUk ,Uk −1 ∞,1 < ∞. These conditions are analogues of the conditions for deterministic designs; they are now imposed on the covariance matrix of the (randomly generated) rows of the design matrix. Further, defining s := maxk |Uk |, we require the regularization penalties be set as 1/2 λs > 1/2 4σ 2 Cmin log(pr) √ γs nCmin − 2s log(pr) and λb > 4σ 2 Cmin r(r log(2) + log(p)) . √ γb nCmin − 2sr(r log(2) + log(p)) (3) Theorem 2. Suppose assumptions C1-C3 hold, and that the number of samples scale as n > max 2s log(pr) 2sr r log(2)+log(p) 2 2 Cmin γs , Cmin γb . Suppose we obtain estimate Θ from algorithm (3). Then, with probability at least 1 − c1 exp (−c2 (r log(2) + log(p))) − c3 exp(−c4 log(rs)) → 1 for some positive numbers c1 − c4 , we are guaranteed that the algorithm estimate Θ is unique and satisfies the following conditions: (a) the estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ ≤ 50σ 2 log(rs) + λs nCmin 4s √ + Dmax . Cmin n gmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that 3.3 min ¯ (j,k)∈Supp(Θ) ¯(k) θj > gmin . Sharp Transition for 2-Task Gaussian Designs This is one of the most important results of this paper. Here, we perform a more delicate and finer analysis to establish precise quantitative gains of our method. We focus on the special case where r = 2 and the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ), so that C1 − C3 hold, with Cmin = Dmax = 1. As we will see both analytically and experimentally, our method strictly outperforms both Lasso and ℓ1 /ℓ∞ -block-regularization over for all cases, except at the extreme endpoints of no support sharing (where it matches that of Lasso) and full support sharing (where it matches that of ℓ1 /ℓ∞ ). We now present our analytical results; the empirical comparisons are presented next in Section 4. The results will be in terms of a particular rescaling of the sample size n as θ(n, p, s, α) := n . (2 − α)s log (p − (2 − α)s) We will also require the assumptions that 4σ 2 (1 − F1 λs > F2 λb > s/n)(log(r) + log(p − (2 − α)s)) 1/2 (n)1/2 − (s)1/2 − ((2 − α) s (log(r) + log(p − (2 − α)s)))1/2 4σ 2 (1 − s/n)r(r log(2) + log(p − (2 − α)s)) , 1/2 (n)1/2 − (s)1/2 − ((1 − α/2) sr (r log(2) + log(p − (2 − α)s)))1/2 . Theorem 3. Consider a 2-task regression problem (n, p, s, α), where the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ). 6 Suppose maxj∈B∗ ∗(1) Θj − ∗(2) Θj = o(λs ), where B ∗ is the submatrix of Θ∗ with rows where both entries are non-zero. Then the estimate Θ of the problem (1) satisfies the following: (Success) Suppose the regularization coefficients satisfy F1 − F2. Further, assume that the number of samples scales as θ(n, p, s, α) > 1. Then, with probability at least 1 − c1 exp(−c2 n) for some positive numbers c1 and c2 , we are guaranteed that Θ satisfies the support-recovery and ℓ∞ error bound conditions (a-b) in Theorem 2. ˆ ˆ (Failure) If θ(n, p, s, α) < 1 there is no solution (B, S) for any choices of λs and λb such that ¯ sign Supp(Θ) = sign Supp(Θ) . We note that we require the gap ∗(1) Θj ∗(2) − Θj to be small only on rows where both entries are non-zero. As we show in a more general theorem in the appendix, even in the case where the gap is large, the dependence of the sample scaling on the gap is quite weak. 4 Empirical Results In this section, we investigate the performance of our dirty block sparse estimator on synthetic and real-world data. The synthetic experiments explore the accuracy of Theorem 3, and compare our estimator with LASSO and the ℓ1 /ℓ∞ regularizer. We see that Theorem 3 is very accurate indeed. Next, we apply our method to a real world datasets containing hand-written digits for classification. Again we compare against LASSO and the ℓ1 /ℓ∞ . (a multi-task regression dataset) with r = 2 tasks. In both of this real world dataset, we show that dirty model outperforms both LASSO and ℓ1 /ℓ∞ practically. For each method, the parameters are chosen via cross-validation; see supplemental material for more details. 4.1 Synthetic Data Simulation We consider a r = 2-task regression problem as discussed in Theorem 3, for a range of parameters (n, p, s, α). The design matrices X have each entry being i.i.d. Gaussian with mean 0 and variance 1. For each fixed set of (n, s, p, α), we generate 100 instances of the problem. In each instance, ¯ given p, s, α, the locations of the non-zero entries of the true Θ are chosen at randomly; each nonzero entry is then chosen to be i.i.d. Gaussian with mean 0 and variance 1. n samples are then generated from this. We then attempt to estimate using three methods: our dirty model, ℓ1 /ℓ∞ regularizer and LASSO. In each case, and for each instance, the penalty regularizer coefficients are found by cross validation. After solving the three problems, we compare the signed support of the solution with the true signed support and decide whether or not the program was successful in signed support recovery. We describe these process in more details in this section. Performance Analysis: We ran the algorithm for five different values of the overlap ratio α ∈ 2 {0.3, 3 , 0.8} with three different number of features p ∈ {128, 256, 512}. For any instance of the ˆ ¯ problem (n, p, s, α), if the recovered matrix Θ has the same sign support as the true Θ, then we count it as success, otherwise failure (even if one element has different sign, we count it as failure). As Theorem 3 predicts and Fig 3 shows, the right scaling for the number of oservations is n s log(p−(2−α)s) , where all curves stack on the top of each other at 2 − α. Also, the number of observations required by dirty model for true signed support recovery is always less than both LASSO and ℓ1 /ℓ∞ regularizer. Fig 1(a) shows the probability of success for the case α = 0.3 (when LASSO is better than ℓ1 /ℓ∞ regularizer) and that dirty model outperforms both methods. When α = 2 3 (see Fig 1(b)), LASSO and ℓ1 /ℓ∞ regularizer performs the same; but dirty model require almost 33% less observations for the same performance. As α grows toward 1, e.g. α = 0.8 as shown in Fig 1(c), ℓ1 /ℓ∞ performs better than LASSO. Still, dirty model performs better than both methods in this case as well. 7 4 p=128 p=256 p=512 Phase Transition Threshold 3.5 L1/Linf Regularizer 3 2.5 LASSO 2 Dirty Model 1.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 Shared Support Parameter α 0.7 0.8 0.9 1 Figure 2: Verification of the result of the Theorem 3 on the behavior of phase transition threshold by changing the parameter α in a 2-task (n, p, s, α) problem for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. The y-axis p n is s log(p−(2−α)s) , where n is the number of samples at which threshold was observed. Here s = ⌊ 10 ⌋. Our dirty model method shows a gain in sample complexity over the entire range of sharing α. The pre-constant in Theorem 3 is also validated. n 10 20 40 Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Our Model 8.6% 0.53% B:165 B + S:171 S:18 B + S:1651 3.0% 0.56% B:211 B + S:226 S:34 B + S:2118 2.2% 0.57% B:270 B + S:299 S:67 B + S:2761 ℓ1 /ℓ∞ 9.9% 0.64% 170 1700 3.5% 0.62% 217 2165 3.2% 0.68% 368 3669 LASSO 10.8% 0.51% 123 539 4.1% 0.68% 173 821 2.8% 0.85% 354 2053 Table 1: Handwriting Classification Results for our model, ℓ1 /ℓ∞ and LASSO Scaling Verification: To verify that the phase transition threshold changes linearly with α as predicted by Theorem 3, we plot the phase transition threshold versus α. For five different values of 2 α ∈ {0.05, 0.3, 3 , 0.8, 0.95} and three different values of p ∈ {128, 256, 512}, we find the phase transition threshold for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. We consider the point where the probability of success in recovery of signed support exceeds 50% as the phase transition threshold. We find this point by interpolation on the closest two points. Fig 2 shows that phase transition threshold for dirty model is always lower than the phase transition for LASSO and ℓ1 /ℓ∞ regularizer. 4.2 Handwritten Digits Dataset We use the handwritten digit dataset [1], containing features of handwritten numerals (0-9) extracted from a collection of Dutch utility maps. This dataset has been used by a number of papers [17, 6] as a reliable dataset for handwritten recognition algorithms. There are thus r = 10 tasks, and each handwritten sample consists of p = 649 features. Table 1 shows the results of our analysis for different sizes n of the training set . We measure the classification error for each digit to get the 10-vector of errors. Then, we find the average error and the variance of the error vector to show how the error is distributed over all tasks. We compare our method with ℓ1 /ℓ∞ reguralizer method and LASSO. Again, in all methods, parameters are chosen via cross-validation. For our method we separate out the B and S matrices that our method finds, so as to illustrate how many features it identifies as “shared” and how many as “non-shared”. For the other methods we just report the straight row and support numbers, since they do not make such a separation. Acknowledgements We acknowledge support from NSF grant IIS-101842, and NSF CAREER program, Grant 0954059. 8 References [1] A. Asuncion and D.J. Newman. UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html. University of California, School of Information and Computer Science, Irvine, CA, 2007. [2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. [3] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4):118–121, 2007. [4] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [5] C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008. [6] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003. [7] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task learning. In 22nd Conference On Learning Theory (COLT), 2009. [8] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ1,∞ -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008. [9] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010. [10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 2010. [11] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B. [12] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using ℓ1 -regularized logistic regression. Annals of Statistics, 2009. [13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. In Allerton Conference, Allerton House, Illinois, 2007. [14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [15] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, Special issue on “Sparse approximations in signal and image processing”, 86:572–602, 2006. [16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27:349–363, 2005. [17] M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog. Handwritten digit recognition by combined classifiers. Kybernetika, 34(4):381–386, 1998. [18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55: 2183–2202, 2009. 9
3 0.77718401 172 nips-2010-Multi-Stage Dantzig Selector
Author: Ji Liu, Peter Wonka, Jieping Ye
Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m n) and a noisy observation vector y ∈ Rn satisfying y = Xβ ∗ + where is the noise vector following a Gaussian distribution N (0, σ 2 I), how to recover the signal (or parameter vector) β ∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β ∗ . We show that if X obeys a certain condition, then with a large probability the difference between the solution ˆ β estimated by the proposed method and the true solution β ∗ measured in terms of the lp norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N )1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β ∗ , ∆ is independent of m and is much smaller than the first term, and N is the number of entries of √ β ∗ larger than a certain value in the order of O(σ log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately √ √ from Cs1/p log mσ to C(s − N )1/p log mσ where the value N depends on the number of large entries in β ∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. 1
4 0.7380324 148 nips-2010-Learning Networks of Stochastic Differential Equations
Author: José Pereira, Morteza Ibrahimi, Andrea Montanari
Abstract: We consider linear models for stochastic dynamics. To any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T . We analyze the ℓ1 -regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufficiently high. This result substantiates the notion of a well defined ‘time complexity’ for the network inference problem. keywords: Gaussian processes, model selection and structure learning, graphical models, sparsity and feature selection. 1 Introduction and main results Let G = (V, E) be a directed graph with weight A0 ∈ R associated to the directed edge (j, i) from ij j ∈ V to i ∈ V . To each node i ∈ V in this network is associated an independent standard Brownian motion bi and a variable xi taking values in R and evolving according to A0 xj (t) dt + dbi (t) , ij dxi (t) = j∈∂+ i where ∂+ i = {j ∈ V : (j, i) ∈ E} is the set of ‘parents’ of i. Without loss of generality we shall take V = [p] ≡ {1, . . . , p}. In words, the rate of change of xi is given by a weighted sum of the current values of its neighbors, corrupted by white noise. In matrix notation, the same system is then represented by dx(t) = A0 x(t) dt + db(t) , p (1) 0 p×p with x(t) ∈ R , b(t) a p-dimensional standard Brownian motion and A ∈ R a matrix with entries {A0 }i,j∈[p] whose sparsity pattern is given by the graph G. We assume that the linear system ij x(t) = A0 x(t) is stable (i.e. that the spectrum of A0 is contained in {z ∈ C : Re(z) < 0}). Further, ˙ we assume that x(t = 0) is in its stationary state. More precisely, x(0) is a Gaussian random variable 1 independent of b(t), distributed according to the invariant measure. Under the stability assumption, this a mild restriction, since the system converges exponentially to stationarity. A portion of time length T of the system trajectory {x(t)}t∈[0,T ] is observed and we ask under which conditions these data are sufficient to reconstruct the graph G (i.e., the sparsity pattern of A0 ). We are particularly interested in computationally efficient procedures, and in characterizing the scaling of the learning time for large networks. Can the network structure be learnt in a time scaling linearly with the number of its degrees of freedom? As an example application, chemical reactions can be conveniently modeled by systems of nonlinear stochastic differential equations, whose variables encode the densities of various chemical species [1, 2]. Complex biological networks might involve hundreds of such species [3], and learning stochastic models from data is an important (and challenging) computational task [4]. Considering one such chemical reaction network in proximity of an equilibrium point, the model (1) can be used to trace fluctuations of the species counts with respect to the equilibrium values. The network G would represent in this case the interactions between different chemical factors. Work in this area focused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for fixed p, as T → ∞, while we will tackle here the regime in which both p and T diverge. Before stating our results, it is useful to stress a few important differences with respect to classical graphical model learning problems: (i) Samples are not independent. This can (and does) increase the sample complexity. (ii) On the other hand, infinitely many samples are given as data (in fact a collection indexed by the continuous parameter t ∈ [0, T ]). Of course one can select a finite subsample, for instance at regularly spaced times {x(i η)}i=0,1,... . This raises the question as to whether the learning performances depend on the choice of the spacing η. (iii) In particular, one expects that choosing η sufficiently large as to make the configurations in the subsample approximately independent can be harmful. Indeed, the matrix A0 contains more information than the stationary distribution of the above process (1), and only the latter can be learned from independent samples. (iv) On the other hand, letting η → 0, one can produce an arbitrarily large number of distinct samples. However, samples become more dependent, and intuitively one expects that there is limited information to be harnessed from a given time interval T . Our results confirm in a detailed and quantitative way these intuitions. 1.1 Results: Regularized least squares Regularized least squares is an efficient and well-studied method for support recovery. We will discuss relations with existing literature in Section 1.3. In the present case, the algorithm reconstructs independently each row of the matrix A0 . The rth row, A0 , is estimated by solving the following convex optimization problem for Ar ∈ Rp r minimize L(Ar ; {x(t)}t∈[0,T ] ) + λ Ar 1 , (2) where the likelihood function L is defined by L(Ar ; {x(t)}t∈[0,T ] ) = 1 2T T 0 (A∗ x(t))2 dt − r 1 T T 0 (A∗ x(t)) dxr (t) . r (3) (Here and below M ∗ denotes the transpose of matrix/vector M .) To see that this likelihood function is indeed related to least squares, one can formally write xr (t) = dxr (t)/dt and complete the square ˙ for the right hand side of Eq. (3), thus getting the integral (A∗ x(t) − xr (t))2 dt − xr (t)2 dt. ˙ ˙ r The first term is a sum of square residuals, and the second is independent of A. Finally the ℓ1 regularization term in Eq. (2) has the role of shrinking to 0 a subset of the entries Aij thus effectively selecting the structure. Let S 0 be the support of row A0 , and assume |S 0 | ≤ k. We will refer to the vector sign(A0 ) as to r r the signed support of A0 (where sign(0) = 0 by convention). Let λmax (M ) and λmin (M ) stand for r 2 the maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by Amin the smallest absolute value among the non-zero entries of row A0 . r When stable, the diffusion process (1) has a unique stationary measure which is Gaussian with covariance Q0 ∈ Rp×p given by the solution of Lyapunov’s equation [5] A0 Q0 + Q0 (A0 )∗ + I = 0. (4) Our guarantee for regularized least squares is stated in terms of two properties of the covariance Q0 and one assumption on ρmin (A0 ) (given a matrix M , we denote by ML,R its submatrix ML,R ≡ (Mij )i∈L,j∈R ): (a) We denote by Cmin ≡ λmin (Q0 0 ,S 0 ) the minimum eigenvalue of the restriction of Q0 to S the support S 0 and assume Cmin > 0. (b) We define the incoherence parameter α by letting |||Q0 (S 0 )C ,S 0 Q0 S 0 ,S 0 and assume α > 0. (Here ||| · |||∞ is the operator sup norm.) −1 |||∞ = 1 − α, ∗ (c) We define ρmin (A0 ) = −λmax ((A0 + A0 )/2) and assume ρmin (A0 ) > 0. Note this is a stronger form of stability assumption. Our main result is to show that there exists a well defined time complexity, i.e. a minimum time interval T such that, observing the system for time T enables us to reconstruct the network with high probability. This result is stated in the following theorem. Theorem 1.1. Consider the problem of learning the support S 0 of row A0 of the matrix A0 from a r sample trajectory {x(t)}t∈[0,T ] distributed according to the model (1). If T > 104 k 2 (k ρmin (A0 )−2 + A−2 ) 4pk min log , 2 α2 ρmin (A0 )Cmin δ (5) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36 log(4p/δ)/(T α2 ρmin (A0 )) . The time complexity is logarithmic in the number of variables and polynomial in the support size. Further, it is roughly inversely proportional to ρmin (A0 ), which is quite satisfying conceptually, since ρmin (A0 )−1 controls the relaxation time of the mixes. 1.2 Overview of other results So far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant statements, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics, with parameter η > 0: x(t) = x(t − 1) + ηA0 x(t − 1) + w(t), t ∈ N0 . (6) Here x(t) ∈ Rp is the vector collecting the dynamical variables, A0 ∈ Rp×p specifies the dynamics as above, and {w(t)}t≥0 is a sequence of i.i.d. normal vectors with covariance η Ip×p (i.e. with independent components of variance η). We assume that consecutive samples {x(t)}0≤t≤n are given and will ask under which conditions regularized least squares reconstructs the support of A0 . The parameter η has the meaning of a time-step size. The continuous-time model (1) is recovered, in a sense made precise below, by letting η → 0. Indeed we will prove reconstruction guarantees that are uniform in this limit as long as the product nη (which corresponds to the time interval T in the previous section) is kept constant. For a formal statement we refer to Theorem 3.1. Theorem 1.1 is indeed proved by carefully controlling this limit. The mathematical challenge in this problem is related to the fundamental fact that the samples {x(t)}0≤t≤n are dependent (and strongly dependent as η → 0). Discrete time models of the form (6) can arise either because the system under study evolves by discrete steps, or because we are subsampling a continuous time system modeled as in Eq. (1). Notice that in the latter case the matrices A0 appearing in Eq. (6) and (1) coincide only to the zeroth order in η. Neglecting this technical complication, the uniformity of our reconstruction guarantees as η → 0 has an appealing interpretation already mentioned above. Whenever the samples spacing is not too large, the time complexity (i.e. the product nη) is roughly independent of the spacing itself. 3 1.3 Related work A substantial amount of work has been devoted to the analysis of ℓ1 regularized least squares, and its variants [6, 7, 8, 9, 10]. The most closely related results are the one concerning high-dimensional consistency for support recovery [11, 12]. Our proof follows indeed the line of work developed in these papers, with two important challenges. First, the design matrix is in our case produced by a stochastic diffusion, and it does not necessarily satisfies the irrepresentability conditions used by these works. Second, the observations are not corrupted by i.i.d. noise (since successive configurations are correlated) and therefore elementary concentration inequalities are not sufficient. Learning sparse graphical models via ℓ1 regularization is also a topic with significant literature. In the Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples [13]. In the context of binary pairwise graphical models, Ref. [11] proves high-dimensional consistency of regularized logistic regression for structural learning, under a suitable irrepresentability conditions on a modified covariance. Also this paper focuses on i.i.d. samples. Most of these proofs builds on the technique of [12]. A naive adaptation to the present case allows to prove some performance guarantee for the discrete-time setting. However the resulting bounds are not uniform as η → 0 for nη = T fixed. In particular, they do not allow to prove an analogous of our continuous time result, Theorem 1.1. A large part of our effort is devoted to producing more accurate probability estimates that capture the correct scaling for small η. Similar issues were explored in the study of stochastic differential equations, whereby one is often interested in tracking some slow degrees of freedom while ‘averaging out’ the fast ones [14]. The relevance of this time-scale separation for learning was addressed in [15]. Let us however emphasize that these works focus once more on system with a fixed (small) number of dimensions p. Finally, the related topic of learning graphical models for autoregressive processes was studied recently in [16, 17]. The convex relaxation proposed in these papers is different from the one developed here. Further, no model selection guarantee was proved in [16, 17]. 2 Illustration of the main results It might be difficult to get a clear intuition of Theorem 1.1, mainly because of conditions (a) and (b), which introduce parameters Cmin and α. The same difficulty arises with analogous results on the high-dimensional consistency of the LASSO [11, 12]. In this section we provide concrete illustration both via numerical simulations, and by checking the condition on specific classes of graphs. 2.1 Learning the laplacian of graphs with bounded degree Given a simple graph G = (V, E) on vertex set V = [p], its laplacian ∆G is the symmetric p × p matrix which is equal to the adjacency matrix of G outside the diagonal, and with entries ∆G = ii −deg(i) on the diagonal [18]. (Here deg(i) denotes the degree of vertex i.) It is well known that ∆G is negative semidefinite, with one eigenvalue equal to 0, whose multiplicity is equal to the number of connected components of G. The matrix A0 = −m I + ∆G fits into the setting of Theorem 1.1 for m > 0. The corresponding model (1.1) describes the over-damped dynamics of a network of masses connected by springs of unit strength, and connected by a spring of strength m to the origin. We obtain the following result. Theorem 2.1. Let G be a simple connected graph of maximum vertex degree k and consider the model (1.1) with A0 = −m I + ∆G where ∆G is the laplacian of G and m > 0. If k+m 5 4pk T ≥ 2 · 105 k 2 , (7) (k + m2 ) log m δ then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36(k + m)2 log(4p/δ)/(T m3 ). In other words, for m bounded away from 0 and ∞, regularized least squares regression correctly reconstructs the graph G from a trajectory of time length which is polynomial in the degree and logarithmic in the system size. Notice that once the graph is known, the laplacian ∆G is uniquely determined. Also, the proof technique used for this example is generalizable to other graphs as well. 4 2800 Min. # of samples for success prob. = 0.9 1 0.9 p = 16 p = 32 0.8 Probability of success p = 64 0.7 p = 128 p = 256 0.6 p = 512 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 T=nη 350 400 2600 2400 2200 2000 1800 1600 1400 1200 1 10 450 2 3 10 10 p Figure 1: (left) Probability of success vs. length of the observation interval nη. (right) Sample complexity for 90% probability of success vs. p. 2.2 Numerical illustrations In this section we present numerical validation of the proposed method on synthetic data. The results confirm our observations in Theorems 1.1 and 3.1, below, namely that the time complexity scales logarithmically with the number of nodes in the network p, given a constant maximum degree. Also, the time complexity is roughly independent of the sampling rate. In Fig. 1 and 2 we consider the discrete-time setting, generating data as follows. We draw A0 as a random sparse matrix in {0, 1}p×p with elements chosen independently at random with P(A0 = 1) = k/p, k = 5. The ij process xn ≡ {x(t)}0≤t≤n is then generated according to Eq. (6). We solve the regularized least 0 square problem (the cost function is given explicitly in Eq. (8) for the discrete-time case) for different values of n, the number of observations, and record if the correct support is recovered for a random row r using the optimum value of the parameter λ. An estimate of the probability of successful recovery is obtained by repeating this experiment. Note that we are estimating here an average probability of success over randomly generated matrices. The left plot in Fig.1 depicts the probability of success vs. nη for η = 0.1 and different values of p. Each curve is obtained using 211 instances, and each instance is generated using a new random matrix A0 . The right plot in Fig.1 is the corresponding curve of the sample complexity vs. p where sample complexity is defined as the minimum value of nη with probability of success of 90%. As predicted by Theorem 2.1 the curve shows the logarithmic scaling of the sample complexity with p. In Fig. 2 we turn to the continuous-time model (1). Trajectories are generated by discretizing this stochastic differential equation with step δ much smaller than the sampling rate η. We draw random matrices A0 as above and plot the probability of success for p = 16, k = 4 and different values of η, as a function of T . We used 211 instances for each curve. As predicted by Theorem 1.1, for a fixed observation interval T , the probability of success converges to some limiting value as η → 0. 3 Discrete-time model: Statement of the results Consider a system evolving in discrete time according to the model (6), and let xn ≡ {x(t)}0≤t≤n 0 be the observed portion of the trajectory. The rth row A0 is estimated by solving the following r convex optimization problem for Ar ∈ Rp minimize L(Ar ; xn ) + λ Ar 0 where L(Ar ; xn ) ≡ 0 1 2η 2 n 1 , (8) n−1 2 t=0 {xr (t + 1) − xr (t) − η A∗ x(t)} . r (9) Apart from an additive constant, the η → 0 limit of this cost function can be shown to coincide with the cost function in the continuous time case, cf. Eq. (3). Indeed the proof of Theorem 1.1 will amount to a more precise version of this statement. Furthermore, L(Ar ; xn ) is easily seen to be the 0 log-likelihood of Ar within model (6). 5 1 1 0.9 0.95 0.9 0.7 Probability of success Probability of success 0.8 η = 0.04 η = 0.06 0.6 η = 0.08 0.5 η = 0.1 0.4 η = 0.14 0.3 η = 0.22 η = 0.18 0.85 0.8 0.75 0.7 0.65 0.2 0.6 0.1 0 50 100 150 T=nη 200 0.55 0.04 250 0.06 0.08 0.1 0.12 η 0.14 0.16 0.18 0.2 0.22 Figure 2: (right)Probability of success vs. length of the observation interval nη for different values of η. (left) Probability of success vs. η for a fixed length of the observation interval, (nη = 150) . The process is generated for a small value of η and sampled at different rates. As before, we let S 0 be the support of row A0 , and assume |S 0 | ≤ k. Under the model (6) x(t) has r a Gaussian stationary state distribution with covariance Q0 determined by the following modified Lyapunov equation A0 Q0 + Q0 (A0 )∗ + ηA0 Q0 (A0 )∗ + I = 0 . (10) It will be clear from the context whether A0 /Q0 refers to the dynamics/stationary matrix from the continuous or discrete time system. We assume conditions (a) and (b) introduced in Section 1.1, and adopt the notations already introduced there. We use as a shorthand notation σmax ≡ σmax (I +η A0 ) where σmax (.) is the maximum singular value. Also define D ≡ 1 − σmax /η . We will assume D > 0. As in the previous section, we assume the model (6) is initiated in the stationary state. Theorem 3.1. Consider the problem of learning the support S 0 of row A0 from the discrete-time r trajectory {x(t)}0≤t≤n . If nη > 4pk 104 k 2 (kD−2 + A−2 ) min log , 2 DC 2 α δ min (11) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = (36 log(4p/δ))/(Dα2 nη). In other words the discrete-time sample complexity, n, is logarithmic in the model dimension, polynomial in the maximum network degree and inversely proportional to the time spacing between samples. The last point is particularly important. It enables us to derive the bound on the continuoustime sample complexity as the limit η → 0 of the discrete-time sample complexity. It also confirms our intuition mentioned in the Introduction: although one can produce an arbitrary large number of samples by sampling the continuous process with finer resolutions, there is limited amount of information that can be harnessed from a given time interval [0, T ]. 4 Proofs In the following we denote by X ∈ Rn×p the matrix whose (t + 1)th column corresponds to the configuration x(t), i.e. X = [x(0), x(1), . . . , x(n − 1)]. Further ∆X ∈ Rn×p is the matrix containing configuration changes, namely ∆X = [x(1) − x(0), . . . , x(n) − x(n − 1)]. Finally we write W = [w(1), . . . , w(n − 1)] for the matrix containing the Gaussian noise realization. Equivalently, The r th row of W is denoted by Wr . W = ∆X − ηA X . In order to lighten the notation, we will omit the reference to xn in the likelihood function (9) and 0 simply write L(Ar ). We define its normalized gradient and Hessian by G = −∇L(A0 ) = r 1 ∗ XWr , nη Q = ∇2 L(A0 ) = r 6 1 XX ∗ . n (12) 4.1 Discrete time In this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem 3.1. We start by stating a set of sufficient conditions for regularized least squares to work. Then we present a series of concentration lemmas to be used to prove the validity of these conditions, and finally we sketch the outline of the proof. As mentioned, the proof strategy, and in particular the following proposition which provides a compact set of sufficient conditions for the support to be recovered correctly is analogous to the one in [12]. A proof of this proposition can be found in the supplementary material. Proposition 4.1. Let α, Cmin > 0 be be defined by λmin (Q0 0 ,S 0 ) ≡ Cmin , S |||Q0 0 )C ,S 0 Q0 0 ,S 0 S (S −1 |||∞ ≡ 1 − α . (13) If the following conditions hold then the regularized least square solution (8) correctly recover the signed support sign(A0 ): r λα Amin Cmin G ∞≤ , GS 0 ∞ ≤ − λ, (14) 3 4k α Cmin α Cmin √ , √ . |||QS 0 ,S 0 − Q0 0 ,S 0 |||∞ ≤ (15) |||Q(S 0 )C ,S 0 − Q0 0 )C ,S 0 |||∞ ≤ S (S 12 k 12 k Further the same statement holds for the continuous model 3, provided G and Q are the gradient and the hessian of the likelihood (3). The proof of Theorem 3.1 consists in checking that, under the hypothesis (11) on the number of consecutive configurations, conditions (14) to (15) will hold with high probability. Checking these conditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is taken with respect to a stationary trajectory, we have E{G} = 0, E{Q} = Q0 . 4.1.1 Technical lemmas In this section we will state the necessary concentration lemmas for proving Theorem 3.1. These are non-trivial because G, Q are quadratic functions of dependent random variables the samples {x(t)}0≤t≤n . The proofs of Proposition 4.2, of Proposition 4.3, and Corollary 4.4 can be found in the supplementary material provided. Our first Proposition implies concentration of G around 0. Proposition 4.2. Let S ⊆ [p] be any set of vertices and ǫ < 1/2. If σmax ≡ σmax (I + η A0 ) < 1, then 2 P GS ∞ > ǫ ≤ 2|S| e−n(1−σmax ) ǫ /4 . (16) We furthermore need to bound the matrix norms as per (15) in proposition 4.1. First we relate bounds on |||QJS − Q0 JS |||∞ with bounds on |Qij − Q0 |, (i ∈ J, i ∈ S) where J and S are any ij subsets of {1, ..., p}. We have, P(|||QJS − Q0 )|||∞ > ǫ) ≤ |J||S| max P(|Qij − Q0 | > ǫ/|S|). JS ij i,j∈J (17) Then, we bound |Qij − Q0 | using the following proposition ij Proposition 4.3. Let i, j ∈ {1, ..., p}, σmax ≡ σmax (I + ηA0 ) < 1, T = ηn > 3/D and 0 < ǫ < 2/D where D = (1 − σmax )/η then, P(|Qij − Q0 )| > ǫ) ≤ 2e ij n − 32η2 (1−σmax )3 ǫ2 . (18) Finally, the next corollary follows from Proposition 4.3 and Eq. (17). Corollary 4.4. Let J, S (|S| ≤ k) be any two subsets of {1, ..., p} and σmax ≡ σmax (I + ηA0 ) < 1, ǫ < 2k/D and nη > 3/D (where D = (1 − σmax )/η) then, P(|||QJS − Q0 |||∞ > ǫ) ≤ 2|J|ke JS 7 n − 32k2 η2 (1−σmax )3 ǫ2 . (19) 4.1.2 Outline of the proof of Theorem 3.1 With these concentration bounds we can now easily prove Theorem 3.1. All we need to do is to compute the probability that the conditions given by Proposition 4.1 hold. From the statement of the theorem we have that the first two conditions (α, Cmin > 0) of Proposition 4.1 hold. In order to make the first condition on G imply the second condition on G we assume that λα/3 ≤ (Amin Cmin )/(4k) − λ which is guaranteed to hold if λ ≤ Amin Cmin /8k. (20) We also combine the two last conditions on Q, thus obtaining the following |||Q[p],S 0 − Q0 0 |||∞ ≤ [p],S α Cmin √ , 12 k (21) since [p] = S 0 ∪ (S 0 )C . We then impose that both the probability of the condition on Q failing and the probability of the condition on G failing are upper bounded by δ/2 using Proposition 4.2 and Corollary 4.4. It is shown in the supplementary material that this is satisfied if condition (11) holds. 4.2 Outline of the proof of Theorem 1.1 To prove Theorem 1.1 we recall that Proposition 4.1 holds provided the appropriate continuous time expressions are used for G and Q, namely G = −∇L(A0 ) = r 1 T T x(t) dbr (t) , 0 Q = ∇2 L(A0 ) = r 1 T T x(t)x(t)∗ dt . (22) 0 These are of course random variables. In order to distinguish these from the discrete time version, we will adopt the notation Gn , Qn for the latter. We claim that these random variables can be coupled (i.e. defined on the same probability space) in such a way that Gn → G and Qn → Q almost surely as n → ∞ for fixed T . Under assumption (5), it is easy to show that (11) holds for all n > n0 with n0 a sufficiently large constant (for a proof see the provided supplementary material). Therefore, by the proof of Theorem 3.1, the conditions in Proposition 4.1 hold for gradient Gn and hessian Qn for any n ≥ n0 , with probability larger than 1 − δ. But by the claimed convergence Gn → G and Qn → Q, they hold also for G and Q with probability at least 1 − δ which proves the theorem. We are left with the task of showing that the discrete and continuous time processes can be coupled in such a way that Gn → G and Qn → Q. With slight abuse of notation, the state of the discrete time system (6) will be denoted by x(i) where i ∈ N and the state of continuous time system (1) by x(t) where t ∈ R. We denote by Q0 the solution of (4) and by Q0 (η) the solution of (10). It is easy to check that Q0 (η) → Q0 as η → 0 by the uniqueness of stationary state distribution. The initial state of the continuous time system x(t = 0) is a N(0, Q0 ) random variable independent of b(t) and the initial state of the discrete time system is defined to be x(i = 0) = (Q0 (η))1/2 (Q0 )−1/2 x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the respective dynamical systems using the same matrix A0 using common randomness provided by the standard Brownian motion {b(t)}0≤t≤T in Rp . In order to couple x(t) and x(i), we construct w(i), the noise driving the discrete time system, by letting w(i) ≡ (b(T i/n) − b(T (i − 1)/n)). The almost sure convergence Gn → G and Qn → Q follows then from standard convergence of random walk to Brownian motion. Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 References [1] D.T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chemistry, 58:35–55, 2007. [2] D. Higham. Modeling and Simulating Chemical Reactions. SIAM Review, 50:347–368, 2008. [3] N.D.Lawrence et al., editor. Learning and Inference in Computational Systems Biology. MIT Press, 2010. [4] T. Toni, D. Welch, N. Strelkova, A. Ipsen, and M.P.H. Stumpf. Modeling and Simulating Chemical Reactions. J. R. Soc. Interface, 6:187–202, 2009. [5] K. Zhou, J.C. Doyle, and K. Glover. Robust and optimal control. Prentice Hall, 1996. [6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [7] D.L. Donoho. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59(7):907–934, 2006. [8] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. [9] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization. Annals of Statistics, 37:2109–2144, 2009. [10] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, 2009. [11] M.J. Wainwright, P. Ravikumar, and J.D. Lafferty. High-Dimensional Graphical Model Selection Using l-1-Regularized Logistic Regression. Advances in Neural Information Processing Systems, 19:1465, 2007. [12] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. [13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008. [14] K. Ball, T.G. Kurtz, L. Popovic, and G. Rempala. Modeling and Simulating Chemical Reactions. Ann. Appl. Prob., 16:1925–1961, 2006. [15] G.A. Pavliotis and A.M. Stuart. Parameter estimation for multiscale diffusions. J. Stat. Phys., 127:741–781, 2007. [16] J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical models of autoregressive processes. pages 89–116, 2010. [17] J. Songsiri and L. Vandenberghe. Topology selection in graphical models of autoregressive processes. Journal of Machine Learning Research, 2010. submitted. [18] F.R.K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997. [19] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1-regularized logistic regression. Annals of Statistics, 2008. 9
5 0.70140958 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
Author: Hongbo Zhou, Qiang Cheng
Abstract: Regularization technique has become a principled tool for statistics and machine learning research and practice. However, in most situations, these regularization terms are not well interpreted, especially on how they are related to the loss function and data. In this paper, we propose a robust minimax framework to interpret the relationship between data and regularization terms for a large class of loss functions. We show that various regularization terms are essentially corresponding to different distortions to the original data matrix. This minimax framework includes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordinate coding, multiple kernel learning, etc., as special cases. Within this minimax framework, we further give mathematically exact definition for a novel representation called sparse grouping representation (SGR), and prove a set of sufficient conditions for generating such group level sparsity. Under these sufficient conditions, a large set of consistent regularization terms can be designed. This SGR is essentially different from group lasso in the way of using class or group information, and it outperforms group lasso when there appears group label noise. We also provide some generalization bounds in a classification setting. 1
6 0.68903846 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
7 0.66854358 45 nips-2010-CUR from a Sparse Optimization Viewpoint
8 0.64567798 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
9 0.60805249 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
10 0.55962765 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
11 0.54834902 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
12 0.54255617 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
13 0.54187709 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
14 0.54139286 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
15 0.53341526 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
16 0.53288084 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
17 0.49709803 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
18 0.49418476 222 nips-2010-Random Walk Approach to Regret Minimization
19 0.47930712 134 nips-2010-LSTD with Random Projections
20 0.47813478 117 nips-2010-Identifying graph-structured activation patterns in networks
topicId topicWeight
[(13, 0.092), (17, 0.02), (21, 0.017), (23, 0.21), (27, 0.048), (30, 0.058), (35, 0.043), (45, 0.161), (50, 0.063), (52, 0.06), (60, 0.055), (77, 0.056), (78, 0.018), (90, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.81653905 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
2 0.74357301 9 nips-2010-A New Probabilistic Model for Rank Aggregation
Author: Tao Qin, Xiubo Geng, Tie-yan Liu
Abstract: This paper is concerned with rank aggregation, which aims to combine multiple input rankings to get a better ranking. A popular approach to rank aggregation is based on probabilistic models on permutations, e.g., the Luce model and the Mallows model. However, these models have their limitations in either poor expressiveness or high computational complexity. To avoid these limitations, in this paper, we propose a new model, which is defined with a coset-permutation distance, and models the generation of a permutation as a stagewise process. We refer to the new model as coset-permutation distance based stagewise (CPS) model. The CPS model has rich expressiveness and can therefore be used in versatile applications, because many different permutation distances can be used to induce the coset-permutation distance. The complexity of the CPS model is low because of the stagewise decomposition of the permutation probability and the efficient computation of most coset-permutation distances. We apply the CPS model to supervised rank aggregation, derive the learning and inference algorithms, and empirically study their effectiveness and efficiency. Experiments on public datasets show that the derived algorithms based on the CPS model can achieve state-ofthe-art ranking accuracy, and are much more efficient than previous algorithms.
3 0.72294372 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.71864748 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
5 0.69989234 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
6 0.69831872 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
7 0.69734448 148 nips-2010-Learning Networks of Stochastic Differential Equations
8 0.69264144 222 nips-2010-Random Walk Approach to Regret Minimization
9 0.69150829 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
10 0.69023365 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
11 0.68964231 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
12 0.68944108 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
13 0.68759376 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
14 0.6862036 261 nips-2010-Supervised Clustering
15 0.6860382 221 nips-2010-Random Projections for $k$-means Clustering
16 0.68518782 5 nips-2010-A Dirty Model for Multi-task Learning
17 0.6851806 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
18 0.68454987 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
19 0.68442166 258 nips-2010-Structured sparsity-inducing norms through submodular functions
20 0.68394995 63 nips-2010-Distributed Dual Averaging In Networks