jmlr jmlr2013 jmlr2013-119 knowledge-graph by maker-knowledge-mining

119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Source: pdf

Author: Antony Joseph

Abstract: The performance of orthogonal matching pursuit (OMP) for variable selection is analyzed for random designs. When contrasted with the deterministic case, since the performance is here measured after averaging over the distribution of the design matrix, one can have far less stringent sparsity constraints on the coefﬁcient vector. We demonstrate that for exact sparse vectors, the performance of the OMP is similar to known results on the Lasso algorithm (Wainwright, 2009). Moreover, variable selection under a more relaxed sparsity assumption on the coefﬁcient vector, whereby one has only control on the ℓ1 norm of the smaller coefﬁcients, is also analyzed. As consequence of these results, we also show that the coefﬁcient estimate satisﬁes strong oracle type inequalities. Keywords: high dimensional regression, greedy algorithms, Lasso, compressed sensing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics University of California, Berkeley Berkeley, CA-94720, USA Editor: Xiaotong Shen Abstract The performance of orthogonal matching pursuit (OMP) for variable selection is analyzed for random designs. [sent-4, score-0.122]

2 , 1993) is a variant of the matching pursuit algorithm (Mallat and Zhang, 1993), where, successive ﬁts are computed through the least squares projection of Y on the current set of selected terms. [sent-17, score-0.099]

3 With random designs one can have reliable detection of the support with far less stringent c 2013 Antony Joseph. [sent-22, score-0.117]

4 Firstly, we give results on partial support recovery, which is important since exact recovery of support places strong requirements on n if some of the non-zero elements are small in magnitude. [sent-26, score-0.158]

5 , p} to be the set of indices corresponding to columns in the X matrix. [sent-49, score-0.08]

6 ∪ a(i) as the set of detected columns after i steps, step i + 1 of the algorithm only operates on the columns in Ji+1 = J − d(i), that is, the columns not detected in the previous steps. [sent-54, score-0.21]

7 In other words, indices detected in previous steps remain detected. [sent-55, score-0.075]

8 In practice, however, the algorithm should be performed after √ standardizing the columns of X to have average 0 and norm n. [sent-60, score-0.096]

9 Also, the newly selected term a(i) may be equivalently expressed as, a(i) = arg min inf Y − Fiti−1 − wX j 2 , j∈J w∈R where Fiti−1 is the least squares ﬁt of Y on the columns in d(i − 1). [sent-74, score-0.114]

10 A similar statistic was used by Fletcher and Rangan (2011) for an asymptotic analysis of the OMP for exact support recovery using i. [sent-88, score-0.157]

11 ˆ ˆ Further, denote S as the estimate of S0 obtained using either method, and E = {S = S0 } the error event that one is not able to recover the support exactly. [sent-104, score-0.099]

12 A common sufﬁcient condition on X for this type of recovery is the mutual incoherence condition, which requires that the the inner product between distinct columns be small. [sent-108, score-0.206]

13 Similar requirements are needed for the irrepresentable condition to hold. [sent-120, score-0.106]

14 Recovery using the irrepresentable condition has been shown for Lasso (Zhao and Yu, 2006; Wainwright, 2009), and for the OMP (Zhang, 2009a; Cai and Wang, 2011). [sent-121, score-0.077]

15 A natural question is to ask about requirements on X to ensure recovery in an average sense, as opposed to the strong sense described above. [sent-122, score-0.136]

16 It is shown that under certain conditions on Σ, which can be described as population counterparts of the conditions for deterministic X’s, one can recover S0 with high probability with n = Ω(k log p) observations, with the constant depending inversely on β2 . [sent-128, score-0.1]

17 The form of n is in a sense ideal since min now k = O(n/ log p) is nearly the same n, if we ignore the log p factor. [sent-129, score-0.122]

18 As mentioned earlier, apart from establishing similar properties to hold for the OMP with k-sparse vectors, we also demonstrate 1774 VARIABLE S ELECTION W ITH OMP strong support recovery results under a more general notion of sparsity. [sent-130, score-0.134]

19 Notation: For a set A ⊆ J, we denote as XA the sub-matrix of X comprising of columns with indices in A . [sent-141, score-0.08]

20 d designs and for k-sparse vectors, similar to that in Section 2. [sent-153, score-0.117]

21 However, there the analysis was for exact support recovery and was asymptotic in nature. [sent-155, score-0.129]

22 One consequence of our results is that n = Ω(k log p) samples are sufﬁcient for the recovery of any coefﬁcient vector with βmin that is at least the same order as the noise level. [sent-158, score-0.169]

23 Condition 2, which bounds the ℓ2 norm of the noise vector, is required for controlling the norm of the residuals Ri . [sent-214, score-0.107]

24 Below, we state the theorem giving sufﬁcient conditions on n for reliable recovery of the support of β. [sent-216, score-0.107]

25 (16) ˆ j∈F ˆ In particular, if β2 > α then S = S0 , that is the support is recovered exactly, with probability at min least 1 − perr, k . [sent-234, score-0.083]

26 We remark that the proof of the theorem shows that the algorithm stops in at most k steps, with probability at least 1 − perr,k . [sent-236, score-0.102]

27 Further, if α is taken to be less than β2 , then the above theorem guarantees exact min recovery of the support. [sent-240, score-0.175]

28 Correspondingly, from (15) and (12), one sees that if ¯ b2 n = max b1 k, 2 βmin log p, for some b1 , b2 > 0, then the support can recovered exactly with high probability. [sent-241, score-0.15]

29 1, the algorithm should be implemented after standardizing the X matrix by subtracting out this estimated mean vector, followed by scaling the columns to ensure that they have norm √ n. [sent-284, score-0.096]

30 There exists smin , smax > 0 so that, λmin (ΣT T ) ≥ smin and λmax (ΣT T ) ≤ smin , (17) uniformly for all subsets T , with |T | = k. [sent-295, score-0.526]

31 This is essentially the population analog of the irrepresentable condition (5). [sent-299, score-0.104]

32 In this case, assumptions (17, 18) for exactly sparse vectors are identical to the sufﬁcient conditions for support recovery for the Lasso by Wainwright (2009). [sent-309, score-0.128]

33 As an example, for the standard Gaussian design, condition (17) is satisﬁed with smin = smax = 1. [sent-310, score-0.282]

34 It is well known, see, for example, Cai and Wang (2011), Tropp (2004), that if the correlations between any two distinct columns are small, as given by the incoherence condition, it implies both the sparse eigenvalue condition (17) as well as the irrepresentable condition (18). [sent-315, score-0.197]

35 (22) Deﬁne: smin = 1 − ω0 /2, smax = 1 + ω0 /2, ¯ ¯ ˜ ν1 = ν + ω0 η, ν1 = ω0 η, ω = ω0 , (23) (24) ¯ Then, conditions (17) - (19) holds, for k = 1, . [sent-323, score-0.258]

36 , k, with the above values of smin , smax , ω, ν1 and ˜ ν1 . [sent-326, score-0.258]

37 Equation (21) controls the maximum correlation between distinct columns and can be regarded as the population analog of the incoherence condition (4). [sent-329, score-0.126]

38 Further, the quantities smin , smax , ω, ν1 and ν1 will be as in (23) and (24). [sent-332, score-0.282]

39 We note that this goal is different from that required in Zhang and Huang (2008) for support recovery with approximately sparse β. [sent-336, score-0.128]

40 These will now be expressed as functions ˜ ν, ω0 and η using the various quantities smin , smax , ω, ν1 and ν1 deﬁned in (23) and (24). [sent-342, score-0.282]

41 We deﬁne the values of λ and hu = (1 + h) min , λmax and λ in the following manner: λmin = smin hℓ and λmax = smax hu . [sent-347, score-0.398]

42 ˜ Before stating the analog of Corollary 2, as an aside, we give implications of the above theorem for exact recovery of support for k-sparse vectors and i. [sent-368, score-0.156]

43 d Gaussian designs, that there is a sharp threshold, namely n ≍ 2k log p, for exact recovery of the support as n, p, k, as well as kβ2 /σ2 , tends min to inﬁnity. [sent-374, score-0.213]

44 d Gaussian designs and exact sparse vectors, smin = smax = 1 and ¯ ˜ ω, ν1 , ν1 and η are all zero. [sent-378, score-0.418]

45 Accordingly, from (29), one sees that if n ≈ 2(1 + a)k log p, for large k, p, one can recover the support exactly, with probability at least 1 − perr, k . [sent-387, score-0.158]

46 In ˜ this case, one gets the threshold n ≈ 2k log p for exact recovery. [sent-389, score-0.127]

47 Also, as with the case with sub-Gaussian designs, the proof also demonstrates that the algorithm stops within k steps, with probability at least 1 − perr, k . [sent-405, score-0.102]

48 From (31), one sees that the larger ˆ coefﬁcients, that is, those with magnitude Ω( kµn ), are contained in S with high probability. [sent-407, score-0.089]

49 More explicitly, (β j : j ∈ S) ˆ ˆ is simply the least squares estimate when Y is regressed on X ˆ and β j = 0 for j ∈ Sc . [sent-419, score-0.076]

50 ˜ For the above values of η, ω0 and with ν = 1, evaluate the quantities smin , smax as well as ν1 , ν1 and ω using expressions (23) and (24). [sent-423, score-0.282]

51 For ﬁxed such β, if ¯¯ n ≥ ξ k log p, then the following holds with probability at least 1 − perr, k : ˜ ˆ β−β p 2 ≤ C ∑ min β2 , σ2 µ2 , j n j=1 where C = (4/9)r2 . [sent-428, score-0.121]

52 If ¯ n ≥ ξ∗ k log p, then for C1 = (4/9)(r∗ )2 , the following holds except on a set with probability perr, k : ˜ ˆ β−β p 2 ≤ C1 ∑ min β2 , σ2 µ2 . [sent-442, score-0.121]

53 Denote, ˜ Zi = max |Zi j | and Zi = max |Zi j | c j∈S0 j∈S0 ˜ Notice if Zi > τ and Zi > Zi , then the index detected in step i, that is a(i), belongs to S. [sent-448, score-0.147]

54 Let Eu be the event that statement (16) does not hold for the truncated problem. [sent-470, score-0.11]

55 ˆ 1784 VARIABLE S ELECTION W ITH OMP ˜ ˜ ˜ ˜ ˜ c Denote Ti = max j∈S0 X jT Ri−1 / Ri−1 and Ti = max j∈S0 X jT Ri−1 / Ri−1 , for i = 1, . [sent-472, score-0.108]

56 ˜ ˜ Notice that the statistics Ti , Ti are similar to Zi , Zi , the only difference being that the residuals involved in the former arise from running the algorithm on the truncated problem, whereas in the latter they arise from consideration of the original problem. [sent-476, score-0.077]

57 Further, let E f be the event ˜ ˜ E f = Ti > τ, Ti ≥ Ti for some i ≤ m + 1 . [sent-477, score-0.097]

58 We use the ˆ is the least squares estimate when U is regressed on HS2 and ϕ j = 0 for j not in S ˆ following Lemma, the proof of which is similar to the analysis by Zhang (2009a). [sent-511, score-0.107]

59 ˆ Lemma 9 Let βls be the least squares ﬁt when Y is regressed on XS0 . [sent-532, score-0.076]

60 Correspondingly, from Lemma 13(b), the event {max j∈J |Z1 j | > τ} has probability at most perr, 0 = 2/pa . [sent-552, score-0.111]

61 Correspondj ingly, from Lemma 13(b), one gets max j |Z j | is less than (max j σ j )τ, except on a set with probT c ability 2k/p1+a . [sent-560, score-0.094]

62 Now, using ξ(δ∗ ) = 16r2 , notice that ξ(δ∗ )kτ2 = ξ k log p. [sent-567, score-0.112]

63 ¯¯ Correspondingly, since n ≥ ξ k log p, one gets that ¯ n = ξ(δ)kτ2 , (37) for some δ ≥ δ∗ . [sent-568, score-0.078]

64 Analogous to before, we ˆ ˜ ˜ and S initially pretend that X = XS and β = βS and run the algorithm on the truncated problem to get ˜ ˜ ˜ ˜ ˜ residuals R0 , R1 , R2 , . [sent-582, score-0.098]

65 Further, as before, let ˆ ˆ Eu be the event that statement (30) is not met for this truncated problem. [sent-587, score-0.133]

66 1787 J OSEPH Lemma 10 Parts (i)-(iii) of this lemma demonstrate that requirements (i)-(iii) of Lemma 8 are satisﬁed with high probability. [sent-597, score-0.098]

67 , m + 1, V ji = b jW T ˜ Ri−1 + E ji , ˜ i−1 R ˜ ˜ where E ji = Z T Ri−1 / Ri−1 . [sent-619, score-0.234]

68 Here, the ﬁrst inequality follows from using (42) and aT XS Ri−1 / Ri−1 ≤ j ˜ a j 1 Ti , along with the fact that |V ji | is bounded by (1 − ω)τ1 on E c . [sent-628, score-0.078]

69 What remains to be seen is that the probability of the event Eu can be bounded as before. [sent-637, score-0.111]

70 If k = 0, we will show that the probability that max j∈J |Z1 j | exceeds τ1 is at most perr, 0 . [sent-647, score-0.091]

71 Further, using Y /σY ≤ (1+µn ), with probability at least 1−1/p ¯ from Lemma 14, one has that the ﬁrst term in the right side of (46) is at most ν1 τ(1 + k−1/2 ) with 1789 J OSEPH probability at least 1−1/p. [sent-653, score-0.098]

72 Denoting, τ2 = [ν1 (1 + k−1/2 ) + 1]τ, one sees max j∈J |Z1 j | ≤ τ2 , with probability at least 1 − perr, 0 . [sent-655, score-0.149]

73 Notice that since τ1 ≥ τ2 , the event ˜ max j∈J |Z1 j | ≤ τ1 also has probability at least 1 − perr, 0 . [sent-656, score-0.165]

74 As before, taking ¯¯ ¯ ¯ α(δ) = σ2 /[(1 + δ)k] and ξ(δ) = ξ(α(δ), δ), we notice that ρ2 ξ(δ∗ )kτ2 = ξ k log p, where δ∗ = 3. [sent-659, score-0.112]

75 Correspondingly, using part (b) of Lemma 8, with τ0 = τ1 and r2 = r2 , one gets that ˜ r2 στ1 k/n ˆ βS − βS ≤ , 1 − τ1 r1 k/n (48) ¯¯ with probability at least 1 − perr, k . [sent-668, score-0.077]

76 max SS ¯ Remark: Since we take n > 2k log p, we have θn ≤ 1. [sent-683, score-0.092]

77 Now, taking r = µn , one has, using the above, that with probability at 1 − 2/p the following holds: hℓ v 2 D 1/2 2 1 Uv n 2 2 1/2 ≤ ≤ hu v for all v ∈ Rk . [sent-698, score-0.084]

78 Now, notice that since XS = UΣSS , one has from the above that, with probability at least 1 − 2/p, hℓ ΣSS v ≤ 1 XS v n 2 1/2 ≤ hu ΣSS v 1/2 2 for all v ∈ Rk . [sent-699, score-0.158]

79 Correspondingly, from (17), since smin ≤ ΣSS v 2 / v 2 ≤ smax , which implies that, with probability at least 1 − 2/p, 1 λmin v 2 ≤ XS v 2 ≤ λmax v 2 for all v ∈ Rk , n where λmin , λmax as in (26). [sent-700, score-0.295]

80 Now from Lemma 14, the probability of ˜ ˜ smax ˜1 1¯ n ¯ ¯ ˜ ˜ ˜ the event ε 2 /(nσ2 ) > (1 + µn )2 is bounded 1/p. [sent-705, score-0.235]

81 Correspondingly, from Lemma 14, the event { W / n > (1 + µn )} has probability at most 1/p. [sent-729, score-0.111]

82 Consequently, the E ji ’s are standard normal random vari˜ i ’s, they follow N(0, 1), and hence, follow the same distribution ables; Indeed, conditional on the R unconditionally. [sent-734, score-0.078]

83 Accordingly, using the same logic as in the proof of Theorem 1, the event max 1≤i≤m+1, j∈Sc E ji > τ (50) has probability bounded by 2/π(k + 1)/(τpa ). [sent-735, score-0.274]

84 Consequently, using the bounds on |b j | and the above, one gets that except on a set with probability 1/p + 2/π(k + 1)/(τpa ), one has √ max c |V ji | ≤ ν1 µn n (1 + µn ) + τ. [sent-736, score-0.209]

85 As mentioned earlier, one drawback of the analysis is the crude manner in which the probability of event (50), that no terms outside of S are selected, is bounded. [sent-751, score-0.111]

86 In Fletcher and Rangan (2011), a more careful analysis had been carried out for exact recovery with i. [sent-753, score-0.129]

87 Their analysis carries over, for the general case analyzed here, by noting that the random variables E ji , for i = 1, . [sent-757, score-0.099]

88 This should improve the probability of the event (50) to something a. [sent-764, score-0.111]

89 To be consistent with their notation, let’s assume that the entries of X are scaled so that the columns have norm equal (or nearly equal) to one. [sent-768, score-0.096]

90 Consequently, using Lemma 17 and the above, one has that, min √ ˆ ˆ nρ2 Ud(i) − US / n T Ri ≥ , max H j Ri |u(i)| ϕu(i) + σ j∈u(i) ˜ where ρ2 = ρ1 /λmax . [sent-837, score-0.1]

91 Now, √ √ ˆˆ ˆ ˆ ls − ϕ ∞ is bounded by c0 στ0 / n along with the fact that US − US / n ≥ use √ the fact that ϕ ˆ ˆ λmin ϕ − ϕls , to get that from (52) and (53) that, ϕF2 ≤ c0 στ0 ˆ ˆ |F2 | + τ0 n r1 ˆ |F2 | ( ϕF2 + σ) ˆ n (54) when the algorithm stops. [sent-845, score-0.134]

92 This leads us to (35), which completes the proof of part 1796 VARIABLE S ELECTION W ITH OMP For part (b), notice that ˆ ϕ−ϕ ≤ √ ˆ k ϕls − ϕ ∞+ ˆ ˆ ϕls − ϕ . [sent-850, score-0.132]

93 Use the fact (Cai and Wang, 2011, Lemma 2), 1 − γ(k − 1) ≤ smin ≤ smax ≤ 1 + γ(k − 1). [sent-858, score-0.258]

94 Orthogonal matching pursuit for sparse signal recovery with noise. [sent-916, score-0.203]

95 Near-optimal signal recovery from random projections: Universal encoding e strategies? [sent-947, score-0.107]

96 Stable recovery of sparse overcomplete representations in the presence of noise. [sent-969, score-0.128]

97 A simple lemma for optimization in a Hilbert space, with application to projection pursuit and neural net training. [sent-998, score-0.117]

98 Signal recovery from random measurements via orthogonal matching pursuit. [sent-1076, score-0.16]

99 Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 constrained quadratic programming (lasso). [sent-1083, score-0.14]

100 The sparsity and bias of the lasso selection in high-dimensional linear regression. [sent-1091, score-0.087]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sc', 0.51), ('perr', 0.423), ('omp', 0.255), ('xs', 0.188), ('econd', 0.186), ('correspondingly', 0.184), ('ss', 0.178), ('eu', 0.155), ('oseph', 0.155), ('ri', 0.147), ('smin', 0.134), ('ls', 0.134), ('smax', 0.124), ('ssc', 0.124), ('designs', 0.117), ('els', 0.114), ('recovery', 0.107), ('zi', 0.1), ('ti', 0.095), ('election', 0.086), ('cand', 0.078), ('ji', 0.078), ('event', 0.074), ('notice', 0.074), ('tropp', 0.074), ('lemma', 0.069), ('sees', 0.058), ('lasso', 0.054), ('max', 0.054), ('irrepresentable', 0.053), ('rangan', 0.052), ('regressed', 0.052), ('cov', 0.048), ('pursuit', 0.048), ('fletcher', 0.048), ('hu', 0.047), ('min', 0.046), ('coef', 0.045), ('jt', 0.045), ('barron', 0.044), ('columns', 0.044), ('undetected', 0.041), ('consequently', 0.041), ('residuals', 0.041), ('gets', 0.04), ('detected', 0.039), ('ith', 0.039), ('log', 0.038), ('probability', 0.037), ('wainwright', 0.036), ('truncated', 0.036), ('indices', 0.036), ('cai', 0.034), ('stops', 0.034), ('sparsity', 0.033), ('gilbert', 0.032), ('pa', 0.032), ('ud', 0.032), ('magnitude', 0.031), ('entries', 0.031), ('zhang', 0.031), ('htw', 0.031), ('jw', 0.031), ('standardizing', 0.031), ('proof', 0.031), ('incoherence', 0.031), ('rows', 0.029), ('requirements', 0.029), ('corollary', 0.028), ('statistic', 0.028), ('joseph', 0.027), ('matching', 0.027), ('analog', 0.027), ('apart', 0.027), ('threshold', 0.027), ('completes', 0.027), ('orthogonal', 0.026), ('compressed', 0.026), ('recover', 0.025), ('accordingly', 0.025), ('tao', 0.025), ('side', 0.024), ('condition', 0.024), ('quantities', 0.024), ('squares', 0.024), ('noise', 0.024), ('iii', 0.023), ('let', 0.023), ('averaging', 0.022), ('meinshausen', 0.022), ('exact', 0.022), ('pi', 0.021), ('analyzed', 0.021), ('sparse', 0.021), ('antony', 0.021), ('buhlmann', 0.021), ('fiti', 0.021), ('pretend', 0.021), ('norm', 0.021), ('hd', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

2 0.11899038 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

Abstract: Unions of subspaces provide a powerful generalization of single subspace models for collections of high-dimensional data; however, learning multiple subspaces from data is challenging due to the fact that segmentation—the identiﬁcation of points that live in the same subspace—and subspace estimation must be performed simultaneously. Recently, sparse recovery methods were shown to provide a provable and robust strategy for exact feature selection (EFS)—recovering subsets of points from the ensemble that live in the same subspace. In parallel with recent studies of EFS with ℓ1 -minimization, in this paper, we develop sufﬁcient conditions for EFS with a greedy method for sparse signal recovery known as orthogonal matching pursuit (OMP). Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. In particular, we demonstrate that sparse recovery methods provide signiﬁcant advantages over NN methods and that the gap between the two approaches is particularly pronounced when the sampling of subspaces in the data set is sparse. Our results suggest that OMP may be employed to reliably recover exact feature sets in a number of regimes where NN approaches fail to reveal the subspace membership of points in the ensemble. Keywords: subspace clustering, unions of subspaces, hybrid linear models, sparse approximation, structured sparsity, nearest neighbors, low-rank approximation

3 0.10767374 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: In this paper we establish that the Lov´ sz ϑ function on a graph can be restated as a kernel learning a problem. We introduce the notion of SVM − ϑ graphs, on which Lov´ sz ϑ function can be apa proximated well by a Support vector machine (SVM). We show that Erd¨ s-R´ nyi random G(n, p) o e log4 n np graphs are SVM − ϑ graphs for n ≤ p < 1. Even if we embed a large clique of size Θ 1−p in a G(n, p) graph the resultant graph still remains a SVM − ϑ graph. This immediately suggests an SVM based algorithm for recovering a large planted clique in random graphs. Associated with the ϑ function is the notion of orthogonal labellings. We introduce common orthogonal labellings which extends the idea of orthogonal labellings to multiple graphs. This allows us to propose a Multiple Kernel learning (MKL) based solution which is capable of identifying a large common dense subgraph in multiple graphs. Both in the planted clique case and common subgraph detection problem the proposed solutions beat the state of the art by an order of magnitude. Keywords: orthogonal labellings of graphs, planted cliques, random graphs, common dense subgraph

4 0.090095922 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

5 0.075359397 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Changshui Zhang

Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. Keywords: multi-task learning, multi-stage, non-convex, sparse learning

6 0.07262487 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

7 0.059246976 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

8 0.054637399 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

9 0.046903234 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

10 0.044315774 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

11 0.041190132 104 jmlr-2013-Sparse Single-Index Model

12 0.038024299 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

13 0.037971038 114 jmlr-2013-The Rate of Convergence of AdaBoost

14 0.037819237 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

15 0.037188787 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

16 0.036409486 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

17 0.034699321 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

18 0.03409249 76 jmlr-2013-Nonparametric Sparsity and Regularization

19 0.033934847 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

20 0.032771807 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.182), (1, 0.065), (2, 0.046), (3, 0.062), (4, 0.133), (5, -0.038), (6, 0.025), (7, -0.026), (8, 0.178), (9, 0.03), (10, 0.005), (11, -0.018), (12, -0.086), (13, 0.132), (14, 0.05), (15, 0.079), (16, -0.086), (17, -0.125), (18, -0.155), (19, -0.308), (20, 0.213), (21, -0.051), (22, -0.12), (23, 0.137), (24, 0.057), (25, 0.071), (26, 0.166), (27, 0.1), (28, -0.007), (29, 0.021), (30, -0.025), (31, 0.059), (32, -0.011), (33, 0.049), (34, 0.035), (35, -0.025), (36, 0.198), (37, -0.101), (38, -0.057), (39, -0.044), (40, -0.04), (41, -0.178), (42, -0.038), (43, -0.027), (44, 0.036), (45, 0.02), (46, 0.002), (47, -0.019), (48, 0.076), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92623508 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

2 0.68464822 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

3 0.51150537 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

4 0.41382077 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

Author: Tony Cai, Jianqing Fan, Tiefeng Jiang

Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either ﬁxed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere

5 0.39355683 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Changshui Zhang

6 0.3592844 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.34463841 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

8 0.31077275 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

9 0.30761737 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

10 0.29429764 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

11 0.28091377 104 jmlr-2013-Sparse Single-Index Model

12 0.23892485 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

13 0.23837321 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

14 0.2310936 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

15 0.21809033 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

16 0.2126576 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction

17 0.20591697 114 jmlr-2013-The Rate of Convergence of AdaBoost

18 0.1894501 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

19 0.18896337 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

20 0.18728857 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (5, 0.151), (6, 0.044), (10, 0.046), (14, 0.425), (20, 0.023), (23, 0.039), (68, 0.033), (70, 0.034), (75, 0.025), (85, 0.03), (87, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80371314 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Speciﬁcally, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are veriﬁed in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-ﬁeld (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we deﬁne its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning ﬂows

same-paper 2 0.68608481 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

3 0.4317781 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

4 0.4002077 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semideﬁnite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling

5 0.39831218 114 jmlr-2013-The Rate of Convergence of AdaBoost

Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are ﬁnite. Our ﬁrst result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate

6 0.39657521 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

7 0.38598818 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

8 0.38048404 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

9 0.38001493 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

10 0.3785477 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

11 0.3784866 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

12 0.37726972 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

13 0.37615865 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

14 0.37410146 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

15 0.37300882 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

16 0.3725754 15 jmlr-2013-Bayesian Canonical Correlation Analysis

17 0.37171206 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

18 0.3715232 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

19 0.37043366 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

20 0.36940378 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis