jmlr jmlr2010 jmlr2010-99 knowledge-graph by maker-knowledge-mining

99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs


Source: pdf

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics University of California Berkeley, CA 94720-1776, USA Editor: John Lafferty Abstract Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. [sent-8, score-0.195]

2 Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i. [sent-11, score-0.978]

3 This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. [sent-15, score-1.189]

4 Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure 1. [sent-17, score-0.192]

5 Using the ℓ1 -norm to enforce sparsity has been very successful, as evidenced by the widespread use of methods such as basis pursuit (Chen et al. [sent-31, score-0.234]

6 There is now a well-developed theory on what conditions are required on the design matrix X ∈ Rn×p for such ℓ1 -based relaxations to reliably estimate β∗ . [sent-33, score-0.196]

7 In the case of noiseless observation models, it is known that imposing a restricted nullspace property on the design matrix X ∈ Rn×p is both necessary and sufficient for the basis pursuit linear program to recover β∗ exactly. [sent-34, score-0.842]

8 The nullspace property and its link to the basis pursuit linear program has been discussed in various papers (Cohen et al. [sent-35, score-0.555]

9 To this end, various sufficient conditions for the success of ℓ1 -relaxations have been proposed, including restricted eigenvalue conditions (Bickel et al. [sent-38, score-0.435]

10 Of the conditions mentioned, one of weakest known sufficient conditions for bounding ℓ2 -error are the restricted eigenvalue (RE) conditions due to Bickel et al. [sent-40, score-0.497]

11 In this paper, we consider a restricted eigenvalue condition that is essentially equivalent to the RE condition in Bickel et al. [sent-42, score-0.589]

12 Thus, in the setting of linear regression with random design, the interesting question is the following: for what ensembles of design matrices do the restricted nullspace and eigenvalue conditions hold with high probability? [sent-46, score-0.899]

13 To date, the main routes to establishing these properties have been via either incoherence conditions (Donoho and Huo, 2001; Feuer and Nemirovski, 2003) or via the restricted isometry property (Candes and Tao, 2005), both of which are sufficient but not necessary conditions (Cohen et al. [sent-47, score-0.432]

14 The restricted isometry property (RIP) holds with high probability for various classes of random matrices with i. [sent-49, score-0.403]

15 As a concrete example, suppose that we are fitting a linear regression model to predict heart disease on the basis of a set of p covariates (e. [sent-67, score-0.19]

16 Nonetheless, at least in practice, ℓ1 -methods still work very well in settings where the covariates are correlated and non-unitary, but currently lacking is the corresponding theory that guarantees the performance of ℓ1 -relaxations for dependent designs. [sent-74, score-0.203]

17 The main contribution of this paper is a direct proof that both the restricted nullspace and eigenvalue conditions hold with high probability for a broad class of dependent Gaussian design matrices. [sent-75, score-0.804]

18 The class of matrices covered by our result allows for correlation among different covariates, and hence covers many matrices for which restricted isometry or incoherence conditions fail to hold but the restricted eigenvalue condition holds. [sent-80, score-1.005]

19 Interestingly, one can even sample the rows of the design matrix X from a multivariate Gaussian with a degenerate covariance matrix Σ, and nonetheless, our results still guarantee that the restricted nullspace and eigenvalue conditions will hold with high probability (see Section 3. [sent-81, score-1.11]

20 We begin in Section 2 with background on sparse linear models, the basis pursuit and Lasso ℓ1 -relaxations, and sufficient conditions for their success. [sent-85, score-0.257]

21 Indeed, any vector β∗ in the nullspace of X, which has dimension at least p − n, leads to y = 0 when w = 0. [sent-101, score-0.356]

22 In the case of noiseless observations, obtained by setting w = 0 in the observation model (1), a closely related convex program is the basis pursuit linear program (Chen et al. [sent-108, score-0.21]

23 Accordingly, as we discuss here, there is a good understanding of the necessary and sufficient conditions for the success of ℓ1 -based relaxations such as basis pursuit and the Lasso. [sent-116, score-0.257]

24 1 R ESTRICTED N ULLSPACE IN N OISELESS S ETTING In the noiseless setting (w = 0), it is known that the basis pursuit linear program (LP) (2) recovers β∗ exactly if and only if the design matrix X satisfies a restricted nullspace condition. [sent-119, score-0.805]

25 For a given sparsity index k ≤ p, we say that the matrix X satisfies the restricted nullspace (RN) condition of order k if null(X) ∩ C (S; 1) = {0} for all subsets S of cardinality k. [sent-124, score-0.763]

26 Although this definition appeared in earlier work (Donoho and Huo, 2001; Feuer and Nemirovski, 2003), the terminology of restricted nullspace is due to Cohen et al. [sent-125, score-0.494]

27 2244 R ESTRICTED E IGENVALUE P ROPERTIES FOR C ORRELATED G AUSSIAN D ESIGNS This restricted nullspace property is important, because the basis pursuit LP recovers any vector k-sparse vector β∗ exactly if and only if X satisfies the restricted nullspace property of order k. [sent-127, score-1.224]

28 , 2009; Donoho and Huo, 2001; Elad and Bruckstein, 2002; Feuer and Nemirovski, 2003) for more discussion of restricted nullspaces and equivalence to exact recovery of basis pursuit. [sent-129, score-0.224]

29 Various conditions have been used to analyze the ℓ2 norm convergence rate of ℓ1 -based methods, including the restricted isometry property (Candes and Tao, 2007), various types of restricted eigenvalue conditions (van de Geer, 2007; Bickel et al. [sent-133, score-0.743]

30 Of all these conditions, the least restrictive are the restricted eigenvalue conditions due to Bickel et al. [sent-135, score-0.373]

31 (2009), their restricted eigenvalue (RE) condition is less severe than both the RIP condition (Candes and Tao, 2007) and an earlier set of restricted eigenvalue conditions due to Meinshausen and Yu (2009). [sent-138, score-0.962]

32 All of these conditions involve lower bounds on Xθ 2 that hold uniformly over the previously defined set C (S; α), Here we state a condition that is essentially equivalent to the restricted eigenvalue condition due to Bickel et al. [sent-139, score-0.683]

33 In particular, we say that the p × p sample covariance matrix X T X/n satisfies the restricted eigenvalue (RE) condition over S with parameters (α, γ) ∈ [1, ∞) × (0, ∞) if 1 T T 1 θ X Xθ = Xθ n n 2 2 ≥ γ2 θ 2 2 for all θ ∈ C (S; α). [sent-141, score-0.648]

34 If this condition holds uniformly for all subsets S with cardinality k, we say that X T X/n satisfies a restricted eigenvalue condition of order k with parameters (α, γ). [sent-142, score-0.589]

35 On occasion, we will also say that a deterministic p × p covariance matrix Σ satisfies an RE condition, by which we mean that Σ1/2 θ 2 ≥ γ θ 2 for all θ ∈ C (S; α). [sent-143, score-0.198]

36 It is straightforward to show that the RE condition for some α implies the restricted nullspace condition for the same α, so that the RE condition is slightly stronger than the RN property. [sent-144, score-0.911]

37 Again, the RE condition is important because it yields guarantees on the ℓ2 -error of any Lasso estimate β. [sent-145, score-0.193]

38 Main Result and Its Consequences Thus, in order to provide performance guarantees (either exact recovery or ℓ2 -error bounds) for ℓ1 relaxations applied to sparse linear models, it is sufficient to show that the RE or RN conditions hold. [sent-154, score-0.227]

39 Given that our interest is in providing sufficient conditions for these properties, the remainder 2245 R ASKUTTI , WAINWRIGHT AND Y U of the paper focuses on providing conditions for the RE condition to hold for random designs, which implies that the RN condition is satisfied. [sent-155, score-0.434]

40 1 Statement of Main Result Our main result guarantees that the restricted eigenvalue (and hence restricted nullspace) conditions hold for a broad class of random Gaussian designs. [sent-157, score-0.597]

41 In intuitive terms, Theorem 1 provides some insight into the eigenstructure of the sample covariance matrix Σ = X T X/n. [sent-171, score-0.198]

42 One implication of the lower bound (3) is that the nullspace of X cannot contain any vectors that are “overly” sparse. [sent-172, score-0.356]

43 In particular, for any vector v ∈ R p such that n v 1 / Σ1/2 v 2 = o( log p ), the right-hand side of the lower bound (3) will be strictly positive, showing that v cannot belong to the nullspace of X. [sent-173, score-0.399]

44 In the following corollary, we formalize this intuition by showing how Theorem 1 guarantees that whenever the population covariance Σ satisfies the RE condition of order k, then the sample covariance Σ = X T X/n satisfies the same property as long as the sample size is sufficiently large. [sent-174, score-0.597]

45 Corollary 1 (Restricted eigenvalue property) Suppose that Σ satisfies the RE condition of order k with parameters (α, γ). [sent-175, score-0.312]

46 Then for universal positive constants c, c′ , c′′ , if the sample size satisfies n > c′′ ρ2 (Σ) (1 + α)2 k log p, γ2 (4) γ then the matrix Σ = X T X/n satisfies the RE condition with parameters (α, 8 ) with probability at least 1 − c′ exp(−cn). [sent-176, score-0.273]

47 2 ≥γ v 2 for all R ESTRICTED E IGENVALUE P ROPERTIES FOR C ORRELATED G AUSSIAN D ESIGNS Under the assumed scaling (4) of the sample size, we have 9 (1 + α) ρ(Σ) k log p ≤ γ/8, n which shows that the RE condition holds for X T X/n with parameter (α, γ/8) as claimed. [sent-181, score-0.218]

48 Remarks: (a) From the definitions, it is easy to see that if the RE condition holds with α = 1 and any γ > 0 (even arbitrarily small), then the RN condition also holds. [sent-182, score-0.278]

49 , 2009; van de Geer, 2000; van de Geer and Buhlmann, 2009) that if X T X/n satisfies the RE condition, then the ℓ2 error of the Lasso under the sparse linear model with Gaussian noise satisfies the bound β − β∗ 2 = O( k log p ) n with high probability. [sent-185, score-0.216]

50 Consequently, in order to ensure that the ℓ2 -error is bounded, the sample size must scale as n = Ω(k log p), which matches the scaling (4) required in Corollary 1, as long as the sequence of covariance matrices Σ have diagonal entries that stay bounded. [sent-186, score-0.354]

51 (c) Finally, we note that Theorem 1 guarantees that the sample covariance X T X/n satisfies a property that is slightly stronger than the RE condition. [sent-187, score-0.231]

52 3, we demonstrate that Gausssian design matrices where the covariance matrix is a Toeplitz matrix satisfies the RE condition under the milder scaling requirement n = Ω(k log(p)). [sent-197, score-0.569]

53 1 also provides sufficient conditions for a restricted eigenvalue condition to hold for design matrices with dependent columns. [sent-202, score-0.682]

54 They show that if the true covariance matrix satisfies an RE condition, and if the elementwise maximum Σ − Σ ∞ between the sample covariance Σ = X T X/n and true covariance Σ is suitably bounded, then the sample covariance also satisfies the RE condition. [sent-203, score-0.655]

55 Their result applied to the case of Gaussian random matrices guarantees 2247 R ASKUTTI , WAINWRIGHT AND Y U that Σ satisfies the RE property as long as n = Ω(k2 log p) and Σ satisfies the RE condition. [sent-204, score-0.229]

56 By contrast, Corollary 1 guarantees the RE condition with the less restrictive scaling n = Ω(k log p). [sent-205, score-0.272]

57 √ Note that if k = O ( n), our scaling condition is satisfied while their condition fails. [sent-206, score-0.314]

58 6 in her paper establishes that certain families of sub-Gaussian matrices satisfy the RE condition w. [sent-214, score-0.234]

59 We will see that Corollary 1 applies to many sequences of covariance matrices Σ = Σ p×p that have much more structure than the identity matrix. [sent-222, score-0.235]

60 For suitable choices, these same examples show that the RE condition can hold with high probability, even when the restricted isometry property (RIP) of Candes and Tao (2005) is violated with probability converging to one. [sent-229, score-0.479]

61 Example 1 (Toeplitz matrices) Consider a covariance matrix with the Toeplitz structure Σi j = a|i− j| for some parameter a ∈ [0, 1). [sent-230, score-0.198]

62 The minimum eigenvalue of Σ is 1 − a > 0, independent of the dimension p, so that the population matrix Σ clearly satisfies the RE condition. [sent-232, score-0.318]

63 Since ρ2 (Σ) = 1, Theorem 1 implies that the sample covariance matrix Σ = X T X/n obtained by sampling from this distribution will also satisfy the RE condition with high probability as long as n = Ω(k log p). [sent-233, score-0.38]

64 This provides an example of a matrix family with substantial correlation between covariates for which the RE property still holds. [sent-234, score-0.244]

65 However, regardless of the sample size, the submatrices of the sample covariance Σ will not satisfy restricted isometry properties (RIP) if the parameter a is sufficiently large. [sent-235, score-0.411]

66 Consequently, imposing RIP amounts to requiring that the population condition number λmax (ΣSS )/λmin (ΣSS ) be very close to one. [sent-242, score-0.226]

67 2248 R ESTRICTED E IGENVALUE P ROPERTIES FOR C ORRELATED G AUSSIAN D ESIGNS We now consider a matrix family with an even higher amount of dependency among the covariates, where the RIP constants are actually unbounded as the sparsity k increases, but the RE condition is still satisfied. [sent-244, score-0.302]

68 Example 2 (Spiked identity model) For a parameter a ∈ [0, 1), the spiked identity model is given by the family of covariance matrices Σ : = (1 − a)I p×p + a1 1T , where 1 ∈ R p is the vector of all ones. [sent-245, score-0.278]

69 The minimum eigenvalue of Σ is 1 − a, so that the population covariance clearly satisfies the RE condition for any fixed a ∈ [0, 1). [sent-246, score-0.539]

70 Since this covariance matrix has maximum variance ρ2 (Σ) = 1, Corollary 1 implies that a sample covariance matrix Σ = X T X/n will satisfy the RE property with high probability with sample size n = Ω(k log p). [sent-247, score-0.476]

71 On the other hand, the spiked identity matrix Σ has very poorly conditioned sub-matrices, which implies that a sample covariance matrix Σ = X T X/n will violate the restricted isometry property (RIP) with high probability as n grows. [sent-248, score-0.607]

72 An easy calculation shows that λmin (ΣSS ) = 1 − a > 0 and λmax (ΣSS ) = 1 + a (k − 1), so that the population condition number of this sub-matrix is λmax (ΣSS ) 1 + a (k − 1) = . [sent-250, score-0.226]

73 In both of the preceding examples, the minimum eigenvalue of Σ was bounded from below and the diagonal entries of Σ were bounded from above, which allowed us to assert immediately that the RE condition was satisfied for the population covariance matrix. [sent-263, score-0.579]

74 As a final example, we now consider sampling from population covariance matrices that are actually rank degenerate, but for which our theory still provides guarantees. [sent-264, score-0.322]

75 2249 R ASKUTTI , WAINWRIGHT AND Y U Example 3 (Highly degenerate covariance matrices) Let Σ be any matrix with bounded diagonal that satisfies the RE property of some order k. [sent-265, score-0.285]

76 Suppose that we sample n times from a N(0, Σ) distribution, and then form the empirical covariance matrix Σ = X T X/n. [sent-266, score-0.198]

77 Now if we condition on the original design matrix X in the high probability set, we may view Σ as a fixed but highly rank-degenerate matrix. [sent-269, score-0.24]

78 We may then form a second 1 empirical covariance matrix Σ = n X T X. [sent-275, score-0.198]

79 Conditionally on Σ having the RE property of order k and a bounded diagonal, Corollary 1 shows that the resampled empirical covariance Σ will also have the RE property of order k with high probability, again for n = Ω(k log p). [sent-276, score-0.257]

80 This simple example shows that in the high-dimensional setting p ≫ n, it is possible for the RE condition to hold with high probability even when the original population covariance matrix (Σ in this example) has a p − n-dimensional nullspace. [sent-277, score-0.456]

81 Note moreover that this is not an isolated phenomenon—rather, it will hold for almost every sample covariance matrix Σ constructed in the way that we have described. [sent-278, score-0.23]

82 The main ingredients are the Gordon-Slepian comparison inequalities (Gordon, 1985) for Gaussian processes, concentration of measure for Lipschitz functions (Ledoux, 2001), and a peeling argument. [sent-281, score-0.247]

83 = sup 1 − √ n n v∈V(r) M(r, X) : = 1 − inf v∈V(r) Our first step is to upper bound the expectation E[M(r, X)], where the expectation is taken over the random Gaussian matrix X. [sent-294, score-0.255]

84 (3) Third, we use a peeling argument to show that our analysis holds with high probability and uniformly over all possible choice of the ℓ1 -radius r, which then implies that the condition (5) holds with high probability as claimed. [sent-296, score-0.225]

85 In particular, we have − inf v∈V(r) Xv 2 = − inf sup uT Xv = sup v∈V(r) u∈Sn−1 inf uT Xv. [sent-303, score-0.473]

86 v∈V u∈U v∈V u∈U We use Gordon’s inequality to show that E[M(r, X)] = 1 + E[ sup inf Yu,v ] ≤ 1 + E[ sup v∈V(r) u∈S n−1 2251 inf Zu,v ], v∈V(r) u∈S n−1 R ASKUTTI , WAINWRIGHT AND Y U where we recall that Yu,v = uT Xv and Zu,v is a different Gaussian process to be defined shortly. [sent-308, score-0.426]

87 ˜ ˜ Consequently, we may apply Gordon’s inequality to conclude that E sup inf uT Xv n−1 v∈V(r) u∈S ≤ E sup inf Zu,v v∈V(r) u∈S n−1 = E[ inf gT u] + E[ sup hT Σ1/2 v] u∈Sn−1 v∈V(r) = −E[ g 2 ] + E[ sup hT Σ1/2 v]. [sent-325, score-0.741]

88 v∈V(r) 2252 R ESTRICTED E IGENVALUE P ROPERTIES FOR C ORRELATED G AUSSIAN D ESIGNS We now observe that by definition of V(r), we have sup |hT Σ1/2 v| ≤ sup v v∈V(r) 1 Σ1/2 h ∞ v∈V(r) ≤ r Σ1/2 h ∞. [sent-326, score-0.236]

89 By concentration of measure for Lipschitz functions of Gaussians (see Appendix B), this tail bound will follow if we show that the Lipschitz constant of M(r, X) as a function of the Gaussian √ random matrix is less than 1/ n. [sent-338, score-0.194]

90 We find that √ n h(W ) − h(W ′ ) = sup − W Σ1/2 v v∈V(r) 2− sup − W ′ Σ1/2 v 2 . [sent-340, score-0.236]

91 Therefore ˆ sup v∈V(r) − W Σ1/2 v 2 − sup v∈V(r) − W ′ Σ1/2 v 2 = − W Σ1/2 v ˆ ≤ ≤ W ′ Σ1/2 v ˆ sup v∈V(r) 2− 2− ′ − W ′ Σ1/2 v sup v∈V(r) W Σ1/2 v ˆ (W −W )Σ 1/2 v 2 2 2 . [sent-344, score-0.472]

92 With this notation, we can bound the Lipschitz constant of h as √ n h(W ) − h(W ′ ) ≤ sup (W −W ′ )Σ1/2 v 2 v∈V(r) (a) ≤ sup 2 Σ1/2 v 2 v∈V(r) (b) ≤ Σ1/2 v sup v∈V(r) |||(W −W ′ )|||2 |||(W −W ′ )|||F (c) = |||W −W ′ |||F . [sent-346, score-0.354]

93 4 Extension to All Vectors Via Peeling Thus far, we have shown that M(r, X) = 1 − inf v∈V(r) Xv 2 Xv 2 √ = sup 1 − √ n n v∈V(r) ≥ 3t(r)/2, (8) 1 with probability no larger than 2 exp(−nt 2 (r)/8) where t(r) = 4 + 3 r ρ(Σ) log p . [sent-352, score-0.24]

94 Conclusion Methods based on ℓ1 -relaxations are very popular, and the weakest possible conditions on the design matrix X required to provide performance guarantees—namely, the restricted nullspace and eigenvalue conditions—are well-understood. [sent-368, score-0.83]

95 In this paper, we have proved that these conditions hold with high probability for a broad class of Gaussian design matrices allowing for quite general dependency among the columns, as captured by a covariance matrix Σ representing the dependence among the different covariates. [sent-369, score-0.43]

96 As a corollary, our result guarantees that known performance guarantees for ℓ1 -relaxations such as basis pursuit and Lasso hold with high probability for such problems, provided the population matrix Σ satisfies the RE condition. [sent-370, score-0.447]

97 Interestingly, our theory shows that ℓ1 -methods can perform well when the covariates are sampled from a Gaussian distribution with a degenerate covariance matrix. [sent-371, score-0.339]

98 Finally, although this paper provides various conditions under which the RE condition holds with high probability, it does not address the issue of how to determine whether a given sample covariance matrix matrix Σ = X T X/n satisfies the RE condition. [sent-376, score-0.457]

99 Since for any v ∈ Am , we have g(h(v)) ≤ 2m µ, we combine these inequalities to obtain ∞ P[E ] ≤ ≤ ∑P m=1 ∞ sup h(v)≤g−1 (2m µ) ∑ 2 exp m=1 ∞ = 2 ∑ exp m=1 f (v, X) ≥ 2m µ − can [g(g−1 (2m µ))]2 − can 22m µ2 , from which the stated claim follows by upper bounding this geometric sum. [sent-400, score-0.264]

100 Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling. [sent-413, score-0.265]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nullspace', 0.356), ('geer', 0.224), ('ss', 0.212), ('xv', 0.202), ('estricted', 0.176), ('eigenvalue', 0.173), ('rip', 0.173), ('askutti', 0.169), ('orrelated', 0.169), ('igenvalue', 0.16), ('covariates', 0.149), ('esigns', 0.144), ('covariance', 0.14), ('condition', 0.139), ('restricted', 0.138), ('isometry', 0.133), ('roperties', 0.13), ('pursuit', 0.121), ('sup', 0.118), ('candes', 0.117), ('toeplitz', 0.115), ('aussian', 0.112), ('lasso', 0.111), ('raskutti', 0.106), ('wainwright', 0.102), ('bickel', 0.101), ('matrices', 0.095), ('feuer', 0.094), ('concentration', 0.093), ('gaussian', 0.088), ('ut', 0.087), ('population', 0.087), ('peeling', 0.086), ('satis', 0.08), ('inf', 0.079), ('lipschitz', 0.079), ('zu', 0.075), ('tao', 0.075), ('sparsity', 0.072), ('van', 0.07), ('dantzig', 0.067), ('re', 0.066), ('huo', 0.064), ('info', 0.064), ('ledoux', 0.064), ('davidson', 0.064), ('donoho', 0.063), ('buhlmann', 0.062), ('conditions', 0.062), ('meinshausen', 0.058), ('unitary', 0.058), ('matrix', 0.058), ('rn', 0.057), ('haupt', 0.056), ('szarek', 0.056), ('zv', 0.056), ('mendelson', 0.055), ('guarantees', 0.054), ('nemirovski', 0.053), ('sensing', 0.051), ('degenerate', 0.05), ('noiseless', 0.048), ('pajor', 0.048), ('gordon', 0.047), ('selector', 0.047), ('recovery', 0.045), ('design', 0.043), ('tail', 0.043), ('cohen', 0.043), ('spiked', 0.043), ('log', 0.043), ('minimax', 0.042), ('basis', 0.041), ('ht', 0.04), ('gu', 0.04), ('entries', 0.04), ('exp', 0.039), ('adamczak', 0.037), ('elementwise', 0.037), ('garvesh', 0.037), ('negahban', 0.037), ('rudelson', 0.037), ('property', 0.037), ('corollary', 0.037), ('scaling', 0.036), ('radius', 0.036), ('yu', 0.036), ('inequalities', 0.036), ('sn', 0.035), ('relaxations', 0.033), ('banach', 0.033), ('nt', 0.033), ('berkeley', 0.033), ('sparse', 0.033), ('constants', 0.033), ('inequality', 0.032), ('mjw', 0.032), ('ingredients', 0.032), ('hold', 0.032), ('claim', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure

2 0.23471038 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

Author: Dapo Omidiran, Martin J. Wainwright

Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization

3 0.18502542 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

4 0.13994309 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

Author: Fei Ye, Cun-Hui Zhang

Abstract: We consider the estimation of regression coefficients in a high-dimensional linear model. For regression coefficients in ℓr balls, we provide lower bounds for the minimax ℓq risk and minimax quantiles of the ℓq loss for all design matrices. Under an ℓ0 sparsity condition on a target coefficient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefficient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufficient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the ℓq risk and loss in ℓr balls, 0 ≤ r ≤ 1 ≤ q ≤ ∞. By allowing q = ∞, our risk bounds imply the variable selection consistency of threshold Lasso and Dantzig selectors. Keywords: variable selection, estimation, oracle inequality, minimax, linear regression, penalized least squares, linear programming

5 0.092130415 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

Author: Tong Zhang

Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation

6 0.070227638 72 jmlr-2010-Matrix Completion from Noisy Entries

7 0.0690789 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

8 0.064078704 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

9 0.061994188 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

10 0.060904864 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

11 0.059751336 82 jmlr-2010-On Learning with Integral Operators

12 0.055234924 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

13 0.055053744 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

14 0.054977566 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

15 0.051244833 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

16 0.049448077 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

17 0.049411088 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

18 0.048222754 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

19 0.046402685 84 jmlr-2010-On Spectral Learning

20 0.046019599 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.223), (1, -0.162), (2, 0.186), (3, -0.087), (4, -0.238), (5, -0.18), (6, 0.096), (7, -0.183), (8, -0.05), (9, -0.073), (10, 0.149), (11, -0.105), (12, -0.027), (13, 0.149), (14, -0.13), (15, -0.019), (16, -0.096), (17, 0.025), (18, 0.041), (19, 0.111), (20, 0.071), (21, -0.051), (22, -0.082), (23, -0.107), (24, -0.111), (25, 0.095), (26, 0.059), (27, 0.006), (28, 0.048), (29, -0.095), (30, -0.048), (31, 0.022), (32, -0.111), (33, -0.021), (34, -0.065), (35, 0.033), (36, 0.014), (37, 0.008), (38, 0.05), (39, 0.011), (40, -0.031), (41, -0.016), (42, 0.012), (43, -0.055), (44, 0.058), (45, 0.002), (46, -0.081), (47, 0.053), (48, -0.008), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95617008 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure

2 0.81119555 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

3 0.80582935 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

Author: Fei Ye, Cun-Hui Zhang

Abstract: We consider the estimation of regression coefficients in a high-dimensional linear model. For regression coefficients in ℓr balls, we provide lower bounds for the minimax ℓq risk and minimax quantiles of the ℓq loss for all design matrices. Under an ℓ0 sparsity condition on a target coefficient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefficient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufficient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the ℓq risk and loss in ℓr balls, 0 ≤ r ≤ 1 ≤ q ≤ ∞. By allowing q = ∞, our risk bounds imply the variable selection consistency of threshold Lasso and Dantzig selectors. Keywords: variable selection, estimation, oracle inequality, minimax, linear regression, penalized least squares, linear programming

4 0.77380836 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

Author: Dapo Omidiran, Martin J. Wainwright

Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization

5 0.45616174 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

Author: Jean-Yves Audibert, Sébastien Bubeck

Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.

6 0.42612901 72 jmlr-2010-Matrix Completion from Noisy Entries

7 0.35199514 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

8 0.3118324 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

9 0.30968821 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

10 0.3031666 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

11 0.27340198 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

12 0.26120952 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

13 0.25498658 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

14 0.25240916 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

15 0.25043148 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

16 0.24915607 84 jmlr-2010-On Spectral Learning

17 0.24528882 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

18 0.24490656 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

19 0.23926653 102 jmlr-2010-Semi-Supervised Novelty Detection

20 0.23215148 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.039), (8, 0.068), (15, 0.012), (21, 0.015), (32, 0.027), (36, 0.018), (37, 0.582), (52, 0.011), (75, 0.099), (85, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97086424 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

Author: Carl Edward Rasmussen, Hannes Nickisch

Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks. Keywords: Gaussian processes, nonparametric Bayes, probabilistic regression and classification Gaussian processes (GPs) (Rasmussen and Williams, 2006) have convenient properties for many modelling tasks in machine learning and statistics. They can be used to specify distributions over functions without having to commit to a specific functional form. Applications range from regression over classification to reinforcement learning, spatial models, survival and other time series1 models. Predictions of GP models come with a natural confidence measure: predictive error-bars. Although the implementation of the basic principles in the simplest case is straight forward, various complicating features are often desired in practice. For example, a GP is determined by a mean function and a covariance function, but these functions are mostly difficult to specify fully a priori, and typically they are given in terms of hyperparameters, that is, parameters which have to be inferred. Another source of difficulty is the likelihood function. For Gaussian likelihoods, inference is analytically tractable; however, in many tasks, Gaussian likelihoods are not appropriate, and approximate inference methods such as Expectation Propagation (EP) (Minka, 2001), Laplace’s approximation (LA) (Williams and Barber, 1998) and variational bounds (VB) (Gibbs and MacKay, 2000) become necessary (Nickisch and Rasmussen, 2008). In case of large training data, approximations (Candela and Rasmussen, 2005) like FITC (Snelson and Ghahramani, 2006) are needed. The GPML toolbox is designed to overcome these hurdles with its variety of mean, covariance and likelihood functions as well as inference methods, while being simple to use and easy to extend. ∗. Also at Max Planck Institute for Biological Cybernetics, Spemannstraße 38, 72076 T¨ bingen, Germany. u 1. Note, that here we typically think of GPs with a more general index set than time. ©2010 Carl Edward Rasmussen and Hannes Nickisch. R ASMUSSEN AND N ICKISCH 1. Implementation The GPML toolbox can be obtained from http://gaussianprocess.org/gpml/code/matlab/ and also http://mloss.org/software/view/263/ under the FreeBSD license. Based on simple interfaces for covariance, mean, likelihood functions as well as inference methods, we offer full compatibility to both Matlab 7.x2 and GNU Octave 3.2.x.3 Special attention has been given to properly disentangle covariance, likelihood and mean hyperparameters. Also, care has been taken to avoid numerical inaccuracies, for example, safe likelihood evaluations for extreme inputs and stable matrix operations. For example, the covariance matrix K can become numerically close to singular making its naive inversion numerically unsafe. We handle these situations in a principled way4 such that Cholesky decompositions are computed of well-conditioned matrices only. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examples n. Looking at the (growing) body of literature on sparse approximations, this knowledge is still somewhat in flux, and consensus on the best approaches has not yet been reached. We provide stable and modular code checked by an exhaustive suite of test cases. A single function gp.m serves as main interface to the user—it can make inference and predictions and allows the mean, covariance and likelihood function as well as the inference methods to be specified freely. Furthermore, gp.m enables convenient learning of the hyperparameters by maximising the log marginal likelihood ln Z. One of the particularly appealing properties of GP models is that principled and practical approaches exist for learning the parameters of mean, covariance and likelihood functions. Good adaptation of such parameters can be essential to obtain both high quality predictions and insights into the properties of the data. The GPML toolbox is particularly flexible, including a large library of different covariance and mean functions, and flexible ways to combine these into more expressive, specialised functions. The user can choose between two gradient-based optimisers: one uses conjugate gradients (CG)5 and the other one relies on a quasi-Newton scheme.6 ∂ Computing the derivatives w.r.t. hyperparameters ∂θi ln Z with gp.m does not need any extra programming effort; every inference method automatically collects the respective derivatives from the mean, covariance and likelihood functions and passes them to gp.m. Our documentation comes in two pieces: a hypertext user documentation7 doc/index.html with examples and code browsing and a technical documentation8 doc/manual.pdf focusing on the interfaces and more technical issues. A casual user will use the hypertext document to quickly get his data analysed, however a power user will consult the pdf document once he wants to include his own mean, covariance, likelihood and inference routines or learn about implementation details. 2. 3. 4. 5. 6. 7. 8. Matlab is available from MathWorks, http://www.mathworks.com/. Octave is available from the Free Software Foundation, http://www.gnu.org/software/octave/. We do not consider the “blind” addition of a “small ridge” to K a principled way. Carl Rasmussen’s code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. Peter Carbonetto’s wrapper can be found at http://www.cs.ubc.ca/˜pcarbo/lbfgsb-for-matlab.html. Documentation can be found at http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Technical docs are available at http://www.gaussianprocess.org/gpml/code/matlab/doc/manual.pdf. 3012 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX 2. The GPML Toolbox We illustrate the modular structure of the GPML toolbox by means of a simple code example. GPs are used to formalise and update knowledge about distributions over functions. A GP prior distribution on an unknown latent function f ∼ GP (mφ (x), kψ (x, x′ )), consists of a mean function m(x) = E[ f (x)], and a covariance function k(x, x) = E[( f (x) − m(x))( f (x′ ) − m(x′ ))], both of which typically contain hyperparameters φ and ψ, which we want to fit in the light of data. We generally assume independent observations, that is, input/output pairs (xi , yi ) of f with joint likelihood Pρ (y|f) = ∏n Pρ (yi | f (xi )) factorising over cases. Finally, after specification of the prior and i=1 fitting of the hyperparameters θ = {φ, ψ, ρ}, we wish to compute predictive distributions for test cases. 1 2 3 4 5 6 7 % 1) SET UP THE GP : COVARIANCE ; MEAN , LIKELIHOOD , INFERENCE METHOD mf = { ’ meanSum ’ ,{ ’ meanLinear ’, @meanConst }}; a = 2; b = 1; % m(x) = a*x+b cf = { ’ covSEiso ’}; sf = 1; ell = 0.7; % squared exponential covariance funct lf = ’ likLaplace ’; sn = 0.2; % assume Laplace noise with variance sn ˆ2 hyp0 . mean = [a;b ]; hyp0 . cov = log([ ell ; sf ]); hyp0 . lik = log( sn ); % hypers inf = ’ infEP ’; % specify expectation propagation as inference method % 2) MINIMISE NEGATIVE LOG MARGINAL LIKELIHOOD nlZ wrt . hyp ; do 50 CG steps Ncg = 50; [hyp , nlZ ] = minimize ( hyp0 , ’gp ’, -Ncg , inf , mf , cf , lf , X , y ); % 3) PREDICT AT UNKNOWN TEST INPUTS [ymu , ys2 ] = gp (hyp , inf , mf , cf , lf , X , y , Xs ); % test input Xs In line 1, we specify the mean mφ (x) = a⊤ x + b of the GP with hyperparameters φ = {a, b}. First, the functional form of the mean function is given and its parameters are initialised. The desired mean function, happens not to exist in the library of mean functions; instead we have to make a composite mean function from simple constituents. This is done using a nested cell array containing the algebraic expression for m(x): As the sum of a linear (mean/meanLinear.m) and a constant mean function (mean/meanConst.m) it is an affine function. In addition to linear and constant mean functions, the toolbox offers m(x) = 0 and m(x) = 1. These simple mean functions can be combined by composite mean functions to obtain sums (mean/meanSum.m) m(x) = ∑ j m j (x), products m(x) = ∏ j m j (x), scaled versions m(x) = αm0 (x) and powers m(x) = m0 (x)d . This flexible mechanism is used for convenient specification of an extensible algebra of mean functions. Note that functions are referred to either as name strings ’meanConst’ or alternatively function handles @meanConst. The order of components of the hyperparameters φ is the same as in the specification of the cell array. Every mean function implements its evaluation m = mφ (X) and first derivative ∂ computation mi = ∂φi mφ (X) on a data set X. In the same spirit, the squared exponential covariance kψ (x, x′ ) = σ f ² exp(− x − x′ 2 /2ℓ2 ) (cov/covSEiso.m) with hyperparameters ψ = {ln ℓ, ln σ f } is set up in line 2. Note, that the hyperparameters are represented by the logarithms, as these parameters are naturally positive. Many other simple covariance functions are contained in the toolbox. Among others, we offer linear, constant, Mat´ rn, rational quadratic, polynomial, periodic, neural network and finite support coe variance functions. Composite covariance functions allow for sums k(x, x′ ) = ∑ j k j (x, x′ ), products k(x, x′ ) = ∏ j k j (x, x′ ), positive scaling k(x, x′ ) = σ2 k0 (x, x′ ) and masking of components f k(x, x′ ) = k0 (xI , x′ ) with I ⊆ [1, 2, .., D], x ∈ RD . Again, the interface is simple since only the I ∂ evaluation of the covariance matrix K = kψ (X) and its derivatives ∂i K = ∂ψi kψ (X) on a data set X are required. Furthermore, we need cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction. There are no restrictions on the composition of both mean and covariance functions—any combination is allowed including nested composition. 3013 R ASMUSSEN AND N ICKISCH √ √ The Laplace (lik/likLaplace.m) likelihood Pρ (y| f ) = exp(− 2/σn |y − f |)/ 2σn with hyperparameters ρ = {ln σn } is specified in line 3. There are only simple likelihood functions: Gaussian, Sech-squared, Laplacian and Student’s t for ordinary and sparse regression as well as the error and the logistic function for classification. Again, the same inference code is used for any likelihood function. Although the specification of likelihood functions is simple for the user, writing new likelihood functions is slightly more involved as different inference methods require access to different properties; for example, LA requires second derivatives and EP requires derivatives of moments. All hyperparameters θ = {φ, ψ, ρ} are stored in a struct hyp.{mean,cov,lik}, which is initialised in line 4; we select the approximate inference algorithm EP (inf/infEP.m) in line 5. We optimise the hyperparameters θ ≡ hyp by calling the CG optimiser (util/minimize.m) with initial value θ0 ≡ hyp0 in line 6 allowing at most N = 50 evaluations of the EP approximation to the marginal likelihood ZEP (θ) as done by gp.m. Here, D = (X, y) ≡ (X,y) is the training data where X = {x1 , .., xn } and y ∈ Rn . Under the hood, gp.m computes in every step a Gaussian ∂ posterior approximation and the derivatives ∂θ ln ZEP (θ) of the marginal likelihood by calling EP. Predictions with optimised hyperparameters are done in line 7, where we call gp.m with the unseen test inputs X∗ ≡ Xs as additional argument. As a result, we obtain the approximate marginal predictive mean E[P(y∗ |D , X∗ )] ≡ ymu and the predictive variance V[P(y∗ |D , X∗ )] ≡ ys2. Likelihood \ Inference Gaussian Sech-squared Laplacian Student’s t Error function Logistic function Exact FITC EP Laplace VB Type, Output Domain regression, R regression, R regression, R regression, R classification, {±1} classification, {±1} Alternate Name logistic distribution double exponential probit regression logit regression Table 1: Likelihood ↔ inference compatibility in the GPML toolbox Table 1 gives the legal likelihood/inference combinations. Exact inference and the FITC approximation support the Gaussian likelihood only. Variational Bayesian (VB) inference is applicable to all likelihoods. Expectation propagation (EP) for the Student’s t likelihood is inherently unstable due to its non-log-concavity. The Laplace approximation (LA) for Laplace likelihoods is not sensible due to the non-differentiable peak of the Laplace likelihood. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and LA. If the number of training examples is larger than a few thousand, dense matrix computations become too slow. We provide the FITC approximation for regression with Gaussian likelihood where ˜ instead of the exact covariance matrix K, a low-rank plus diagonal matrix K = Q + diag(K − Q) ⊤ K−1 K is used. The matrices K and K contain covariances and cross-covariances where Q = Ku uu u uu u of and between inducing inputs ui and data points x j . Using inf/infFITC.m together with any covariance function wrapped into cov/covFITC.m makes the computations feasible for large n. Acknowledgments Thanks to Ed Snelson for assisting with the FITC approximation. 3014 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX References Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015

same-paper 2 0.91707873 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure

3 0.83409333 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc

Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R

4 0.80174339 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random fields

5 0.79730481 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

Author: Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin

Abstract: Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods. Keywords: maximum entropy, iterative scaling, coordinate descent, natural language processing, optimization

6 0.61601919 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

7 0.58851123 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models

8 0.50791109 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project

9 0.49392581 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

10 0.48742199 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

11 0.47174707 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

12 0.47041589 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

13 0.46885884 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

14 0.46566486 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

15 0.43703249 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

16 0.43082988 109 jmlr-2010-Stochastic Composite Likelihood

17 0.41472471 100 jmlr-2010-SFO: A Toolbox for Submodular Function Optimization

18 0.40835059 15 jmlr-2010-Approximate Tree Kernels

19 0.40333161 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

20 0.40332261 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes