jmlr jmlr2011 jmlr2011-6 knowledge-graph by maker-knowledge-mining

6 jmlr-2011-A Simpler Approach to Matrix Completion

Source: pdf

Author: Benjamin Recht

Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisﬁes a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706 Editor: Francis Bach Abstract This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. [sent-3, score-0.183]

2 The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. [sent-6, score-0.332]

3 If the underlying matrix satisﬁes a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. [sent-7, score-0.218]

4 The novel techniques herein are based on recent work in quantum information theory. [sent-9, score-0.107]

5 Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing 1. [sent-10, score-0.4]

6 Introduction Recovering a low-rank matrix from a partial sampling of its entries is a recurring problem in collaborative ﬁltering (Rennie and Srebro, 2005; Koren et al. [sent-11, score-0.166]

7 While a variety of heuristics have been developed across many disciplines, the general problem of ﬁnding the lowest rank matrix satisfying equality constraints is NP-hard. [sent-17, score-0.096]

8 All known algorithms which can compute the lowest rank solution for all instances require time at least exponential in the dimensions of the matrix in both theory and practice (Chistov and Grigoriev, 1984). [sent-18, score-0.096]

9 The nuclear norm is equal to the sum of the singular values of a matrix and is the best convex lower bound of the rank function on the set of matrices whose singular values are all bounded by 1. [sent-20, score-0.576]

10 The intuition behind this heuristic is that whereas the rank function counts the number of nonvanishing singular values, the nuclear norm sums their amplitude, much like how the ℓ1 norm is a useful surrogate for counting the number of nonzeros in a vector. [sent-21, score-0.493]

11 Moreover, the nuclear norm can be minimized subject to equality constraints via semideﬁnite programming. [sent-22, score-0.286]

12 (2010), where the authors developed probabilistic techniques to study average case behavior and showed that the nuclear norm heuristic could solve most instances of the linearlyconstrained rank-minimization problem assuming the number of linear constraints was sufﬁciently large. [sent-27, score-0.307]

13 e (2009) to show that one could, in special cases, reconstruct a low-rank matrix by observing a set of entries of size at most a polylogarithmic factor larger than the intrinsic dimension of the variety of rank r matrices. [sent-31, score-0.206]

14 (2009) to provide e a bound on the number of entries required to reconstruct a low-rank matrix which is optimal up to a small numerical constant and one logarithmic factor. [sent-33, score-0.148]

15 Cand` s and Recht observed e that it is impossible to recover a matrix which is equal to zero in nearly all of its entries unless all of the entries of the matrix are observed (consider, for example, the rank one matrix which is equal to 1 in one entry and zeros everywhere else). [sent-37, score-0.391]

16 r 1≤i≤n Note that for any subspace, the smallest µ(U) can be is 1, achieved, for example, if U is spanned by √ vectors whose entries all have magnitude 1/ n. [sent-41, score-0.087]

17 If a matrix has row and column spaces with low coherence, then each entry can be expected to provide about the same amount of information. [sent-43, score-0.083]

18 Recall that the nuclear norm of an n1 × n2 matrix X is the sum of the singular values of X, min{n ,n } X ∗ = ∑k=1 1 2 σk (X), where, here and below, σk (X) denotes the kth largest singular value of X. [sent-44, score-0.48]

19 The main result of this paper is the following Theorem 2 Let M be an n1 × n2 matrix of rank r with singular value decomposition U ΣV ∗ . [sent-45, score-0.164]

20 A1 The matrix U V ∗ has a maximum entry bounded by µ1 positive µ1 . [sent-48, score-0.083]

21 r/(n1 n2 ) in absolute value for some Suppose m entries of M are observed with locations sampled uniformly at random. [sent-49, score-0.174]

22 If r > log(n2 ), the number of entries can be reduced to cu r(n1 + n2 ) log4 (n2 ). [sent-57, score-0.116]

23 Similarly, there is a numerical constant ci such that ci µ2 r(n1 + n2 ) log3 (n2 ) entries are sufﬁcient to 0 recover a matrix of arbitrary rank r whose singular vectors have entries with magnitudes bounded by µ0 /n1 . [sent-58, score-0.338]

24 Many of their results also require restrictions on the rank of M , and their bounds depend superlinearly on µ0 . [sent-62, score-0.076]

25 (2009) require the matrix rank to be no more than log(n2 ), and require bounds on the maximum magnitude of the entries in M and the ratios σ1 (M )/σr (M ) and n2 /n1 . [sent-64, score-0.201]

26 It is a consequence of the coupon collector’s problem that at least n2 log n2 uniformly sampled entries are necessary just to guarantee that at least one entry in every row and column is observed with high probability. [sent-68, score-0.244]

27 In addition, rank r matrices have r(n1 + n2 − r) parameters, a fact that can be veriﬁed by counting the number of degrees of freedom in the singular value decomposition. [sent-69, score-0.184]

28 Interestingly, Cand` s and e Tao (2009) showed that Cµ0 n2 r log(n2 ) entries were necessary for completion when the entries are sampled uniformly at random. [sent-70, score-0.354]

29 The present work only uses basic matrix analysis, elementary large deviation bounds, and a noncommutative version of Bernstein’s Inequality proven here in the Appendix. [sent-75, score-0.215]

30 (2010) in quantum information which considered the problem of reconstructing the density matrix of a quantum ensemble using as few measurements as possible. [sent-77, score-0.252]

31 Their work extended results from Cand` s and Recht (2009) e to the quantum regime by using special algebraic properties of quantum measurements. [sent-78, score-0.214]

32 In Section 3 I show how the sampling with replacement model bounds probabilities in the uniform sampling model, and present very short proofs of some of the main results in Cand` s and e Recht (2009). [sent-82, score-0.163]

33 Surprisingly, this yields a simple proof of Theorem 2, provided in Section 4, which has the least restrictive assumptions of any assertion proven thus far. [sent-83, score-0.087]

34 I closely follow the conventions established in Cand` s and Recht (2009), and invite the reader to consult this reference for e a more thorough discussion of the matrix completion problem and the associated convex geometry. [sent-86, score-0.152]

35 Matrices are bold capital, vectors are bold lowercase and scalars or entries are not bold. [sent-89, score-0.087]

36 , ud ] will denote the n × d matrix whose kth column is uk . [sent-95, score-0.103]

37 The spectral norm (the top singular value) of such an operator will be denoted by A = supX: X F ≤1 A (X) F . [sent-108, score-0.184]

38 It is convenient to introduce the orthogonal decomposition Rn1 ×n2 = T ⊕ T ⊥ where T is the linear space spanned by elements of the form ∗ uk y ∗ and xvk , 1 ≤ k ≤ r, where x and y are arbitrary, and T ⊥ is its orthogonal complement. [sent-119, score-0.093]

39 T ⊥ is the subspace of matrices spanned by the family (xy ∗ ), where x (respectively y) is any vector orthogonal to U (respectively V ). [sent-120, score-0.101]

40 (2010), providing bounds on recovering low-rank matrices in almost any basis. [sent-123, score-0.096]

41 Note here that while PU and PV are matrices, PT is a linear operator mapping matrices to matrices. [sent-127, score-0.114]

42 It follows from the deﬁnition (3) of PT that PT (ea e∗ ) = (PU ea )e∗ + ea (PV eb )∗ − (PU ea )(PV eb )∗ . [sent-129, score-0.735]

43 b b This gives PT (ea e∗ ) b Since PU ea 2 2 F = PT (ea e∗ ), ea e∗ = PU ea b b ≤ µ(U)r/n1 and PV eb PT (ea e∗ ) b 2 F 2 2 + PV eb 2 − PU ea 2 PV eb 2 . [sent-130, score-0.996]

44 Sampling with Replacement As discussed above, the main contribution of this work is an analysis of uniformly sampled sets of entries via the study of a sampling with replacement model. [sent-134, score-0.278]

45 In all of these results, the theorem statements concerned sampling sets of m entries uniformly, but it was shown that probability of failure under Bernoulli sampling with m p = n1 n2 closely approximated the probability of failure under uniform sampling. [sent-140, score-0.361]

46 The present work will analyze the situation where each entry index is sampled independently from the uniform distribution on {1, . [sent-141, score-0.1]

47 It would appear that sampling with replacement is not suitable for analyzing matrix completion as one might encounter duplicate entries. [sent-149, score-0.235]

48 However, just as is the case with Bernoulli sampling, bounding the likelihood of error when sampling with replacement allows us to bound the probability of the nuclear norm heuristic failing under uniform sampling. [sent-150, score-0.411]

49 Proposition 3 The probability that the nuclear norm heuristic fails when the set of observed entries is sampled uniformly from the collection of sets of size m is less than or equal to the probability that the heuristic fails when m entries are sampled independently with replacement. [sent-151, score-0.708]

50 Let Ωk denote a set of entries of size k sampled uniformly from all collections of entries of size k. [sent-162, score-0.261]

51 That is, the probability decreases as the number of entries revealed is increased. [sent-165, score-0.106]

52 Surprisingly, changing the sampling model makes most of the theorems from Cand` s and Recht e (2009) simple consequences of a noncommutative variant of Bernstein’s Inequality. [sent-166, score-0.179]

53 Let us now record two theorems, proven for the Bernoulli model in Cand` s and Recht (2009), e that admit very simple proofs in the sampling with replacement model. [sent-184, score-0.143]

54 Let Ω = {(ak , bk )}l be a collection of indices sampled uniformly k=1 with replacement. [sent-186, score-0.182]

55 Set RΩ to be the operator RΩ (Z) = |Ω| ∑ e ak e ∗k , Z e ak e ∗k . [sent-187, score-0.112]

56 Unlike in previous work on matrix completion, RΩ is not a projection operator if there are duplicates in Ω. [sent-190, score-0.094]

57 Note that for a ﬁxed entry, the probability it is sampled more than t times is equal to the probability of more than t heads occurring in a sequence of m tosses where the probability of a head is n11n2 . [sent-195, score-0.074]

58 Applying the union bound over all of the n1 n2 entries u m and the fact that n1 n2 < 1, we have P[any entry is selected more than 8 β log(n2 ) times] 3 ≤n1 n2 8 − 3 β log(n2 ) 8 exp 8 β log(n2 ) 3 β log(n2 ) 3 2−2β ≤n2 when n2 ≥ 9. [sent-197, score-0.175]

59 In addition to this bound on the norm of RΩ , the following theorem asserts that the operator PT RΩ PT is also very close to an isometry on T if the number of sampled entries is sufﬁciently large. [sent-201, score-0.313]

60 Theorem 6 Suppose Ω is a set of entries of size m sampled independently and uniformly with replacement. [sent-205, score-0.174]

61 Proof Decompose any matrix Z as Z = ∑ab Z, ea e∗ ea e∗ so that b b PT (Z) = ∑ PT (Z), ea e∗ ea e∗ = ∑ Z, PT (ea e∗ ) ea e∗ . [sent-207, score-1.103]

62 Then RΩ PT (Z) = ∑m Z, PT (eak e∗k ) eak e∗k which gives k=1 b b m (PT RΩ PT )(Z) = ∑ Z, PT (eak e∗k ) PT (eak e∗k ). [sent-218, score-0.176]

63 b b k=1 3419 R ECHT Now the fact that the operator PT RΩ PT does not deviate from its expected value E(PT RΩ PT ) = PT (E RΩ )PT = PT ( m m I ) PT = PT n1 n2 n1 n2 in the spectral norm can be proven using the Noncommutative Bernstein Inequality. [sent-219, score-0.155]

64 This operator is b b rank one, has operator norm Tab = PT (ea e∗ ) 2 , and we have PT = ∑a,b Tab by (5). [sent-221, score-0.23]

65 Using this fact, we can compute the bound Tak bk − n11n2 PT ≤ max{ PT (eak e∗k ) 2 , n11n2 } ≤ µ0 r F b n1 + n2 , n1 n2 where the ﬁnal inequality follows from (4). [sent-227, score-0.129]

66 We also have E[(Tak bk − n11n2 PT )2 ] = E[ PT (eak e∗k ) b 2 F Tak bk ] − ≤ max{ E[ PT (eak e∗k ) b ≤ max{ E[Tak bk ] µ0 r 1 PT ] n2 n2 1 2 2 F Tak bk ] , 1 n2 n2 1 2 } n1 + n2 n1 + n2 1 , 2 2 } ≤ µ0 r 2 2 . [sent-228, score-0.38]

67 This theorem asserts that for a ﬁxed matrix, if one sets all of the entries not in Ω to zero it remains close to a multiple of the original matrix in the operator norm. [sent-234, score-0.236]

68 Theorem 7 Suppose Ω is a set of entries of size m sampled independently and uniformly with replacement and let Z be a ﬁxed n1 × n2 matrix. [sent-235, score-0.237]

69 Proof First observe that the operator norm can be upper bounded by a multiple of the matrix inﬁnity norm 1/2 1/2 Z = sup x =1 y =1 ∑ Zab ya xb ≤ ∑ a,b ∑ 2 Zab y2 a a,b ≤ ≤ √ √ 2 xb a,b 1/2 n2 max a n1 n2 Z 3420 ∑ b ∞. [sent-237, score-0.214]

70 2 Zab A S IMPLER A PPROACH TO M ATRIX C OMPLETION 1 Note that n1 n2 RΩ (Z) − Z = m ∑m n1 n2 Zak bk eak e∗k − Z. [sent-238, score-0.271]

71 This is a sum of zero-mean random k=1 b m matrices, and n1 n2 Zak bk eak e∗k − Z ≤ n1 n2 Zak bk eak e∗k + Z < 3 n1 n2 Z ∞ for n1 ≥ 2. [sent-239, score-0.542]

72 We b b 2 also have E[(n1 n2 Zak bk eak e∗k − Z)∗ (n1 n2 Zak bk eak e∗k − Z)] b b 2 = n1 n2 ∑ Zcd ed e∗ − Z ∗ Z d c,d 2 n1 n2 ∑ Zcd ed e∗ , Z ∗ Z d ≤ max ≤n1 n2 2 c,d Z 2 ∞ where we again use the fact that A − B ≤ max{ A , B } for positive semideﬁnite A and B. [sent-240, score-0.542]

73 A similar calculation holds for (n1 n2 Zak bk eak e∗k − Z)(n1 n2 Zak bk eak e∗k − Z)∗ . [sent-241, score-0.542]

74 Succinctly, it says that for a ﬁxed matrix in T , the operator PT RΩ does not increase the matrix inﬁnity norm. [sent-244, score-0.132]

75 Lemma 8 Suppose Ω is a set of entries of size m sampled independently and uniformly with replacement and let Z ∈ T be a ﬁxed n1 × n2 matrix. [sent-245, score-0.237]

76 For each matrix index (c, d), sample (a, b) uniformly at random to deﬁne the random variable ξcd = ec e∗ , n1 n2 ea e∗ , Z PT (ea e∗ ) − Z . [sent-249, score-0.307]

77 We have E[ξcd ] = 0, |ξcd | ≤ µ0 r(n1 + n2 ) Z ∞ , and d b b E[ξ2 ] = cd 1 ec e∗ , n1 n2 ea e∗ , Z PT (ea e∗ ) − Z d b b n1 n2 ∑ a,b = n1 n2 ∑ PT (ec e∗ ), ea e∗ d b 2 ≤ n1 n2 PT (ec e∗ ) d ≤ µ0 r(n1 + n2 ) Z a,b 2 F Z 2 ∞ ea e∗ , Z b 2 2 2 − Zcd 2 ∞. [sent-250, score-0.704]

78 The main idea is to approximate a dual feasible solution e of (2) which certiﬁes that M is the unique minimum nuclear norm solution. [sent-258, score-0.286]

79 This change in the sampling model allows one to avoid decoupling inequalities and gives rise to the dramatic simpliﬁcation here. [sent-266, score-0.079]

80 To proceed, recall again that by Proposition 3 it sufﬁces to consider the scenario when the entries are sampled independently and uniformly with replacement. [sent-267, score-0.174]

81 For Z ∈ ker RΩ , pick U⊥ and V⊥ such that [U , U⊥ ] and ∗ [V , V⊥ ] are unitary matrices and that U⊥ V⊥ , PT ⊥ (Z) = PT ⊥ (Z) ∗ . [sent-277, score-0.101]

82 A S IMPLER A PPROACH TO M ATRIX C OMPLETION The ﬁrst inequality holds from the variational characterization of the nuclear norm. [sent-279, score-0.26]

83 That is, any if X has Mab = Xab for all (a, b) ∈ Ω, X has strictly larger nuclear norm than M , and hence M is the unique minimizer of (2). [sent-282, score-0.286]

84 With m satisfying the bound in the main theorem statement, the ﬁrst inequality in (6) fails to hold with probability at 2−2β 2−2β1/2 most 2n2 by Theorem 6, and the second inequality fails to hold with probability at most n2 2−2β by Proposition 5. [sent-310, score-0.166]

85 For all k, (8) fails to hold with probability at most 2n2 , (9) fails to hold with 2−2β probability at most 2n2 , and (10) fails to hold with probability at most (n1 + n2 )1−2β . [sent-311, score-0.096]

86 Nonetheless, all prior results on matrix completion have imposed an assumption like A1 (i. [sent-320, score-0.131]

87 3424 A S IMPLER A PPROACH TO M ATRIX C OMPLETION In many matrix completion scenarios of interest in machine learning, the provided entries are corrupted by noise. [sent-324, score-0.218]

88 Hence, under the sampling assumptions of Theorem 2, low-rank matrices can be approximated from noisy data by solving a quadratically constrained nuclear norm problem. [sent-331, score-0.385]

89 The noncommutative versions of Chernoff and Bernstein’ s Inequalities may be useful throughout machine learning and statistical signal processing, and a fruitful line of inquiry would examine how to apply these tools from quantum information to the study of classical signals and systems. [sent-333, score-0.263]

90 For square matrices A, the matrix exponential will be denoted exp(A) and is given by the power series ∞ exp(A) = Ak . [sent-343, score-0.096]

91 ∑ 3425 R ECHT The following theorem is a generalization of Markov’s inequality originally proven in Ahlswede and Winter (2002). [sent-345, score-0.107]

92 Next I will derive a noncommutative version of the Chernoff bound. [sent-354, score-0.138]

93 The version stated here is more general in that the random matrices need not be identically distributed, but the proof is essentially the same. [sent-359, score-0.085]

94 n n P ∑ Xk ∑ (Xk − A) nA = P 0 k=1 k=1 n ∑ T (Xk − A)T ∗ =P 0 k=1 n = P exp ∑ T (Xk − A)T ∗ Id k=1 n ≤ Tr E exp ∑ T (Xk − A)T ∗ k=1 n = E Tr exp ∑ T (Xk − A)T ∗ k=1 n−1 ≤ E Tr exp ∑ T (Xk − A)T ∗ k=1 n−1 ≤ E1,. [sent-369, score-0.172]

95 ,n−1 Tr exp exp (T (Xn − A)T ∗ ) ∑ T (Xk − A)T ∗ k=1 E[exp (T (Xn − A)T ∗ )] ≤ E[exp (T (Xn − A)T ∗ )] E1,. [sent-372, score-0.086]

96 This is just another statement of the duality between the nuclear and operator norms. [sent-381, score-0.282]

97 M2 The ﬁrst inequality follows from the triangle inequality and the fact that E[Yk ] = 0, the second inequality follows from (12), and the ﬁnal inequality follows from the fact that 1 + x ≤ exp(x) for all x. [sent-401, score-0.136]

98 A rank minimization heuristic with application to minimum order system approximation. [sent-450, score-0.079]

99 On the rank minimization problem over a positive semideﬁnite linear matrix inequality. [sent-484, score-0.096]

100 Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. [sent-497, score-0.382]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pt', 0.56), ('cand', 0.417), ('recht', 0.249), ('nuclear', 0.226), ('ea', 0.213), ('eak', 0.176), ('noncommutative', 0.138), ('xk', 0.115), ('echt', 0.113), ('quantum', 0.107), ('impler', 0.1), ('keshavan', 0.1), ('ompletion', 0.1), ('pu', 0.1), ('bk', 0.095), ('bernstein', 0.095), ('completion', 0.093), ('chernoff', 0.093), ('gross', 0.088), ('zak', 0.088), ('entries', 0.087), ('pv', 0.086), ('failure', 0.079), ('ahlswede', 0.075), ('tak', 0.075), ('tao', 0.073), ('yk', 0.073), ('singular', 0.068), ('tr', 0.067), ('atrix', 0.066), ('pproach', 0.066), ('wk', 0.064), ('replacement', 0.063), ('norm', 0.06), ('rank', 0.058), ('matrices', 0.058), ('operator', 0.056), ('sampled', 0.055), ('winter', 0.053), ('eb', 0.048), ('semide', 0.046), ('entry', 0.045), ('uk', 0.045), ('fazel', 0.043), ('ker', 0.043), ('exp', 0.043), ('sampling', 0.041), ('cd', 0.041), ('proven', 0.039), ('decoupling', 0.038), ('emmanuel', 0.038), ('matrix', 0.038), ('maryam', 0.038), ('tab', 0.038), ('zab', 0.038), ('zcd', 0.038), ('theorem', 0.034), ('inequality', 0.034), ('benjamin', 0.033), ('fails', 0.032), ('nathan', 0.032), ('obeying', 0.032), ('uniformly', 0.032), ('xa', 0.029), ('cu', 0.029), ('ak', 0.028), ('proof', 0.027), ('log', 0.025), ('chistov', 0.025), ('hagerup', 0.025), ('incoherence', 0.025), ('mesbahi', 0.025), ('zpv', 0.025), ('bernoulli', 0.025), ('ec', 0.024), ('orthogonal', 0.024), ('stephen', 0.023), ('reconstruct', 0.023), ('heuristic', 0.021), ('asserts', 0.021), ('consult', 0.021), ('linial', 0.021), ('terence', 0.021), ('xab', 0.021), ('ab', 0.021), ('assertion', 0.021), ('kth', 0.02), ('compressed', 0.02), ('lt', 0.02), ('recovering', 0.02), ('andrea', 0.019), ('heads', 0.019), ('koren', 0.019), ('revealed', 0.019), ('subspace', 0.019), ('ix', 0.018), ('fruitful', 0.018), ('rennie', 0.018), ('bounds', 0.018), ('eigenvalues', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

2 0.058086541 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and reﬁtting: ﬁrst we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

3 0.039531525 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

Author: Jim C. Huang, Brendan J. Frey

Abstract: We present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called cumulative distribution networks (CDNs). Unlike graphs for probability density and mass functions, for CDFs the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model, and conditional probabilities correspond to computing mixed derivatives. We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graphs, but include the conditional independence properties of bi-directed graphs. In order to perform inference in such models, we describe the ‘derivative-sum-product’ (DSP) message-passing algorithm in which messages correspond to derivatives of the joint CDF. We will then apply CDNs to the problem of learning to rank players in multiplayer team-based games and suggest several future directions for research. Keywords: graphical models, cumulative distribution function, message-passing algorithm, inference

4 0.038615186 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

Abstract: In the efﬁcient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is ﬁxed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modiﬁed algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never ﬁnd the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a ﬁxed prior. Keywords: convergence rates, efﬁcient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization

5 0.036439948 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a ﬁxed-rank positive semideﬁnite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of ﬁxedrank positive semideﬁnite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semideﬁnite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semideﬁnite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

6 0.035306223 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

7 0.03416739 55 jmlr-2011-Learning Multi-modal Similarity

8 0.029658042 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

9 0.028971219 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

10 0.027312517 35 jmlr-2011-Forest Density Estimation

11 0.026154039 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

12 0.025594596 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

13 0.02493092 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

14 0.024284255 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

15 0.024155503 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

16 0.023924517 91 jmlr-2011-The Sample Complexity of Dictionary Learning

17 0.023277726 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

18 0.02272651 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

19 0.022556048 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

20 0.022016397 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.123), (1, 0.016), (2, 0.025), (3, 0.015), (4, 0.022), (5, -0.003), (6, -0.007), (7, 0.016), (8, -0.054), (9, 0.04), (10, -0.022), (11, 0.009), (12, 0.034), (13, -0.066), (14, 0.051), (15, 0.043), (16, 0.125), (17, 0.054), (18, 0.025), (19, -0.061), (20, 0.093), (21, 0.023), (22, -0.058), (23, -0.06), (24, -0.056), (25, 0.026), (26, 0.161), (27, 0.084), (28, 0.205), (29, 0.241), (30, 0.164), (31, 0.133), (32, -0.025), (33, 0.168), (34, -0.119), (35, -0.058), (36, -0.113), (37, -0.162), (38, -0.271), (39, 0.227), (40, 0.226), (41, -0.06), (42, 0.083), (43, 0.062), (44, -0.109), (45, -0.148), (46, -0.053), (47, -0.222), (48, 0.053), (49, -0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96836287 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

2 0.31625837 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

3 0.288032 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

Author: Tony Jebara

Abstract: A multitask learning framework is developed for discriminative classiﬁcation and regression where multiple large-margin linear classiﬁers are estimated for different prediction problems. These classiﬁers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The ﬁrst multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classiﬁer as well as adaptively specialized individual classiﬁers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classiﬁcation, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming

4 0.23997401 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

5 0.23884803 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

Author: Marek Petrik, Shlomo Zilberstein

Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing speciﬁc norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also brieﬂy analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes

6 0.23811466 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

7 0.22537301 83 jmlr-2011-Scikit-learn: Machine Learning in Python

8 0.22332248 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

9 0.21387814 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

10 0.20859116 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

11 0.2078764 22 jmlr-2011-Differentially Private Empirical Risk Minimization

12 0.1806104 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

13 0.17648306 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

14 0.17533985 91 jmlr-2011-The Sample Complexity of Dictionary Learning

15 0.16890338 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

16 0.16693909 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

17 0.16366766 55 jmlr-2011-Learning Multi-modal Similarity

18 0.16317606 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

19 0.15790059 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

20 0.14408325 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.653), (9, 0.031), (10, 0.017), (24, 0.028), (31, 0.045), (32, 0.014), (60, 0.012), (73, 0.025), (78, 0.058), (90, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94396335 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

2 0.92144495 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the ﬁeld of statistics, is a simpliﬁed model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation

3 0.91954505 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

4 0.62893063 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efﬁciently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

5 0.5388664 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

6 0.51335847 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.51335412 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

8 0.50162578 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

9 0.46304011 104 jmlr-2011-X-Armed Bandits

10 0.44017169 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

11 0.42549324 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

12 0.41925481 35 jmlr-2011-Forest Density Estimation

13 0.41666028 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

14 0.4119277 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

15 0.40641299 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

16 0.40245858 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

17 0.39974496 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

18 0.39241624 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

19 0.39115718 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

20 0.37373841 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing