nips nips2011 nips2011-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. [sent-6, score-0.316]
2 We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i. [sent-7, score-0.22]
3 We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. [sent-10, score-0.666]
4 1 Introduction One of the most common approaches to collaborative filtering and matrix completion is trace-norm regularization [1, 2, 3, 4, 5]. [sent-11, score-0.189]
5 In this approach we attempt to complete an unknown matrix, based on a small subset of revealed entries, by finding a matrix with small trace-norm, which matches those entries as best as possible. [sent-12, score-0.184]
6 This approach has repeatedly shown good performance in practice, and is theoretically well understood for the case where revealed entries are sampled uniformly [6, 7, 8, 9, 10, 11]. [sent-13, score-0.159]
7 Under such uniform sampling, Θ(n log(n)) entries are sufficient for good completion of an n × n matrix— i. [sent-14, score-0.2]
8 However, for arbitrary sampling distributions, the worst-case sample complexity lies between a lower bound of Ω(n4/3 ) [12] and an upper bound of O(n3/2 ) [13], i. [sent-17, score-0.334]
9 Motivated by these issues, Salakhutdinov and Srebro [12] proposed to use a weighted variant of the trace-norm, which takes the distribution of the entries into account, and showed experimentally that this variant indeed leads to superior performance. [sent-20, score-0.264]
10 However, although this recent paper established that the weighted trace-norm corrects a specific situation where the standard trace-norm fails, no general learning guarantees are provided, and it is not clear if indeed the weighted trace-norm always leads to the desired behavior. [sent-21, score-0.338]
11 The only theoretical analysis of the weighted trace-norm that we are aware of is a recent report by Negahban and Wainwright [10] that provides reconstruction guarantees for a low-rank matrix with i. [sent-22, score-0.299]
12 1 In this paper we rigorously study learning with a weighted trace-norm under an arbitrary sampling distribution, and show that this situation is indeed more complicated, requiring a correction to the weighting. [sent-31, score-0.345]
13 We also rigorously consider weighting according to either the true sampling distribution (as in [10]) or the empirical frequencies, as is actually done in practice, and present evidence that weighting by the empirical frequencies might be advantageous. [sent-33, score-0.573]
14 We consider an arbitrary unknown n × m target matrix Y , where a subset of entries {Yit ,jt }s indexed by S = {(i1 , j1 ), . [sent-39, score-0.206]
15 Based on this subset on entries, we would ˆ like to fill in the missing entries and obtain a prediction matrix XS ∈ Rn×m , with low expected loss ˆ ˆ Lp (XS ) = Eij∼p ((XS )ij , Yij ) , where (x, y) is some loss function. [sent-48, score-0.282]
16 Note that we measure the loss with respect to the same distribution p(i, j) from which the training set is drawn (this is also the case in [12, 10, 13]). [sent-49, score-0.158]
17 Given some distribution p(i, j) on [n] × [m], the weighted trace-norm of X is given by [12] X r n c tr(pr ,pc ) = diag (pr ) 1/2 · X · diag (pc ) 1/2 , tr m where p ∈ R and p ∈ R denote vectors of the row- and column-marginals respectively. [sent-51, score-0.251]
18 Note that the weighted trace-norm only depends on these marginals (but not their joint distribution) and 1 that if pr and pc are uniform, then X tr(pr ,pc ) = √nm X tr . [sent-52, score-0.873]
19 The weighted trace-norm does not generally scale with n and m, and in particular, if X has rank r and entries bounded in [−1, 1], then √ X tr(pr ,pc ) ≤ r regardless of which p (i, j) is used. [sent-53, score-0.298]
20 and the guarantee is on generalization for future samples drawn by the same distribution, our results can also be stated in a transductive model, where a training set and a test set are created by splitting a fixed subset of entries uniformly at random (as in [13]). [sent-59, score-0.448]
21 The transductive setting is discussed, and transductive variants of our Theorems are given, in Section 4. [sent-60, score-0.362]
22 when the weighting is according to the sampling distribution p(i, j). [sent-64, score-0.222]
23 , (is , js )} of indexes in [n] × [m], the empirical Rademacher complexity of the class (with respect to S) is given by ˆ RS (X ) = Eσ∼{±1}s sup X∈X 1 s s σt Xit jt , t=1 ˆ where σ is a vector of signs drawn uniformly at random. [sent-70, score-0.279]
24 Intuitively, RS (X ) measures the extent to which the class X can “overfit” data, by finding a matrix X which correlates as strongly as possible to a sample from a matrix of random noise. [sent-71, score-0.17]
25 For a loss (x, y) that is Lipschitz in x, the Rademacher ˆ complexity can be used to uniformly bound the deviations |Lp (X)− LS (X)| for all X ∈ X , yielding a learning guarantee on the empirical risk minimizer [14]. [sent-72, score-0.351]
26 1 Guarantees for Special Sampling Distributions We begin by providing guarantees for an arbitrary, possibly unbounded, Lipschitz loss (x, y), but only under sampling distributions which are either product distributions (i. [sent-74, score-0.318]
27 p(i, j) = pr (i)pc (j)) or have uniform marginals (i. [sent-76, score-0.517]
28 pr and pc are uniform, but perhaps the rows and columns are not independent). [sent-78, score-0.41]
29 For an l-Lipschitz loss , fix any matrix Y , sample size s, and distribution ˆ p, such that p is either a product distribution or has uniform marginals. [sent-82, score-0.328]
30 from the distribution p, ˆ Lp (XS ) ≤ inf X∈Wr [p] rn log(n) s Lp (X) + O l · . [sent-87, score-0.193]
31 Following [11] by including the weights, using the duality between spectral norm · sp and trace-norm, we compute: √ √ s s r r eit ,jt ˆ = ES,σ σt ES,σ Qt , ES RS (Wr [p]) = s s pr (it ) pc (jt ) t=1 t=1 sp where ei,j = ei eT and Qt = σt √ j eit ,jt pr (it )pc (jt ) sp ∈ Rn×m . [sent-94, score-1.091]
32 mini,j {npr (i) · mpc (j)} If p has uniform row- and column-marginals, then for all i, j, npr (i) = mpc (j) = 1. [sent-103, score-0.24]
33 (Here we assume s > n log(n), since otherwise we s √ need only establish that excess error is O(l r), which holds trivially for any matrix in Wr [p]. [sent-105, score-0.206]
34 ˆ By a similar calculation of the expected spectral norms, we can now bound ES RS (Z) ≤ rn log(n) s ˆ ˆ ˆ . [sent-108, score-0.19]
35 We similarly bound LS (X ∗ ) − Lp (X ∗ ), where X ∗ = ˆ ˆ ˆ we can bound Lp (X ˆ ˆ ˆ arg minX∈Wr [p] Lp (X). [sent-113, score-0.196]
36 Since LS (XS ) ≤ LS (X ∗ ), this yields the desired bound on excess error. [sent-114, score-0.194]
37 More precisely, if p satisfies 1 1 pr (i) ≥ Cn , pc (j) ≥ Cm for all i, j, then the bound (1) holds, up to a factor of C. [sent-117, score-0.485]
38 This is important to note since the examples where the unweighted trace-norm fails under a non-uniform distribution are situations where some marginals are very high (but none are too low) [12]. [sent-121, score-0.318]
39 This suggests that the low-probability marginals could perhaps be “smoothed” to satisfy a lower bound, without removing the advantages of the weighted trace-norm. [sent-122, score-0.379]
40 We will exploit this in Section 3 to give a guarantee that holds more generally for arbitrary p, when smoothing is applied. [sent-123, score-0.267]
41 2 Guarantees for bounded loss In Theorem 1, we showed a strong bound on excess error, but only for a restricted class of distributions p. [sent-125, score-0.337]
42 We now show that if the loss function is bounded, then we can give a non-trivial, but weaker, learning guarantee that holds uniformly over all distributions p. [sent-126, score-0.247]
43 Since we are in any case discussing Lipschitz loss functions, requiring that the loss function be bounded essentially amounts to requiring that the entries of the matrices involved be bounded. [sent-127, score-0.329]
44 For an l-Lipschitz loss bounded by b, fix any matrix Y , sample size s, and any ˆ ˆ distribution p. [sent-132, score-0.246]
45 from the distribution p, ˆ Lp (XS ) ≤ inf X∈Wr [p] Lp (X) + O (l + b) · 3 rn log(n) s . [sent-137, score-0.193]
46 3 Problems with the standard weighting In the previous Sections, we showed that for distributions p that are either product distributions or have uniform marginals, we can prove a square-root bound on excess error, as shown in (1). [sent-140, score-0.488]
47 For arbitrary p, the only learning guarantee we obtain is a cube-root bound given in (2), for the special case of bounded loss. [sent-141, score-0.236]
48 We would like to know whether the square-root bound might hold uniformly over all distributions p, and if not, whether the cube-root bound is the strongest result that we can give for the bounded-loss setting, and whether any bound will hold uniformly over all p in the unbounded-loss setting. [sent-142, score-0.399]
49 In Example 2, we show that with unbounded (Lipschitz) loss, we cannot bound s excess error better than a constant bound, by giving a construction for arbitrary n = m and arbitrary s ≤ nm in the unbounded-loss regime, where excess error is Ω(1). [sent-145, score-0.577]
50 We note that both examples can be modified to fit the transductive setting, demonstrating that smoothing is necessary in the transductive setting as well. [sent-147, score-0.477]
51 Let p (1, 1) = 1 , and s p (i, 1) = p (1, j) = 0 for all i, j > 1, yielding pr (1) = pc (1) = 1 . [sent-160, score-0.41]
52 The following table summarizes the learning guarantees that can be established for the (standard) weighted trace-norm. [sent-167, score-0.21]
53 1-Lipschitz, 1-bounded loss 1-Lipschitz, unbounded loss p = product rn log(n) s rn log(n) s pr , pc = uniform rn log(n) s rn log(n) s p arbitrary 3 3 rn log(n) s 1 Smoothing the weighted trace norm Considering Theorem 1 and the degenerate examples in Section 2. [sent-169, score-1.361]
54 The Rademacher complexity computations in the proof of Theorem 1 show that the problem lies not with large entries in the vectors pr and pc (i. [sent-171, score-0.562]
55 if pr and/or pc are “spiky”), but with the small entries in these vectors. [sent-173, score-0.507]
56 1, we present such a smoothing, and provide guarantees for learning with a smoothed weighted trace-norm. [sent-176, score-0.368]
57 2 we check the smoothing correction to the weighted trace-norm on real data, and observe that indeed it can also be beneficial in practice. [sent-179, score-0.292]
58 1 Learning guarantee for arbitrary distributions Fix a distribution p and a constant α ∈ (0, 1), and let p denote the smoothed marginals: ˜ r r c 1 1 p (i) = α · p (i) + (1 − α) · n , p (j) = α · pc (j) + (1 − α) · m . [sent-181, score-0.564]
59 For an l-Lipschitz loss , fix any matrix Y , sample size s, and any distribution p. [sent-184, score-0.21]
60 from the distribution p, ˆ Lp (XS ) ≤ inf X∈Wr [p] ˜ Lp (X) + O l · 5 rn log(n) s . [sent-189, score-0.193]
61 Then Qt sp ≤ maxij √ r 1 c ≤ 2 nm = R, p (i)p (j) ˜ ˜ s t=1 and E s t=1 E Qt QT t QT Qt t sp = s · maxi p (i)p (j) ˜ ˜ p(i,j) 1 j 1 pr (i)· 2m ≤ 2 p(i,j) j pr (i)pc (j) ˜ ˜ ≤ s · maxi 4sm. [sent-194, score-0.719]
62 sp Moving from Theorem 1 to Theorem 3, we are competing with a different class of matrices: inf X∈Wr [p] Lp (X) inf X∈Wr [p] Lp (X). [sent-198, score-0.241]
63 For example, we consider the low-rank matrix reconstruction problem, where the trace-norm bound is used as a surrogate for rank. [sent-200, score-0.164]
64 In order for the (squared) weighted trace-norm to 1/2 1/2 2 be a lower bound on the rank, we would need to assume diag (pr ) Xdiag (pc ) ≤ 1 [11]. [sent-201, score-0.203]
65 Netflix also provides a qualification set containing 1,408,395 ratings, but due to the sampling scheme, ratings from users with few ratings are overrepresented relative to the training set. [sent-211, score-0.289]
66 To avoid dealing with different training and test distributions, we also created our own validation and test sets, each containing 100,000 ratings set aside from the training set. [sent-212, score-0.165]
67 For both k=30 and k=100 the weighted trace-norm with smoothing (α = 0. [sent-220, score-0.243]
68 9) significantly outperforms the weighted trace-norm without smoothing (α = 1), even on the differently-sampled Netflix qualification set. [sent-221, score-0.243]
69 The proposed weighted trace-norm with smoothing outperforms max-norm regularization [19], and performs comparably to “geometric” smoothing [12]. [sent-222, score-0.386]
70 7987 The empirically-weighted trace norm In practice, the sampling distribution p is not known exactly — it can only be estimated via the locations of the entries which are observed in the sample. [sent-261, score-0.252]
71 Defining the empirical marginals pr (i) = ˆ #{t : it = i} c #{t : jt = j} , p (j) = ˆ , s s 6 ˆ we would like to give a learning guarantee when XS is estimated via regularization on the pˆ weighted trace-norm, rather than the p-weighted trace-norm. [sent-262, score-0.841]
72 1, we give bounds on excess error when learning with smoothed empirical marginals, which show that there is no theoretical disadvantage as compared to learning with the smoothed true marginals. [sent-264, score-0.587]
73 2, we introduce the transductive learning setting, and give a result based on the empirical marginals which implies a sample complexity bound that 1 is better by a factor of log /2 (n). [sent-267, score-0.736]
74 3, we show that in low-rank matrix reconstruction simulations, using empirical marginals indeed yields better reconstructions. [sent-269, score-0.411]
75 For an l-Lipschitz loss , fix any matrix Y , sample size s, and any distribution p. [sent-273, score-0.21]
76 The proof of this Theorem (in the Supplementary Material) uses Theorem 3 and involves showing that when s = Ω(n log(n)), which is required for all Theorems so far to be meaningful, the true and empirical marginals are the same up to a constant factor. [sent-280, score-0.375]
77 in the proof of Theorem 1) arises from the bound on the expected spectral norm of a matrix, which, for a diagonal matrix, is just a bound on the deviation of empirical frequencies. [sent-284, score-0.301]
78 Although we could not establish such a result in the inductive setting, we now turn to the transductive setting, where we could indeed obtain a better guarantee. [sent-286, score-0.233]
79 2 Guarantee for the transductive setting In the transductive model, we fix a set S ⊂ [n] × [m] of size 2s, and then randomly split S into a training set S and a test set T of equal size s. [sent-288, score-0.388]
80 The goal is to obtain a good estimator for the entries in T based on the values of the entries in S, as well as the locations (indexes) of all elements on S. [sent-289, score-0.194]
81 We will use the smoothed empirical marginals of S, for the weighted trace-norm. [sent-290, score-0.608]
82 We now show that, for bounded loss, there may be a benefit to weighting with the smoothed empir1 ical marginals—the sample size requirement can be lowered to s = O(rn log /2 (n)). [sent-291, score-0.42]
83 For an l-Lipschitz loss bounded by b, fix any matrix Y and sample size s. [sent-293, score-0.207]
84 Let p denote the smoothed empirical marginals of S. [sent-295, score-0.48]
85 Then in expectation over the splitting of S into S and T , arg min L rn log /2 (n) b +√ s s 1 ˆ ˆ LT (XS ) ≤ inf X∈Wr [p] ˆ LT (X) + O l · . [sent-297, score-0.29]
86 (6) This result (proved in the Supplementary Materials) is stated in the transductive setting, with a somewhat different sampling procedure and evaluation criteria, but we believe the main difference is in the use of the empirical weights. [sent-298, score-0.31]
87 Although it is usually straightforward to convert a transductive guarantee to an inductive one, the situation here is more complicated, since the hypothesis class depends on the weighting, and hence on the sample S. [sent-299, score-0.358]
88 Nevertheless, we believe such a conversion might be possible, establishing a similar guarantee for learning with the (smoothed) empirically 7 weighted trace-norm also in the inductive setting. [sent-300, score-0.287]
89 Furthermore, since the empirical marginals are close to the true marginals when s = Θ(n log(n)), it might be possible to obtain a learning guarantee 1 for the true (non-empirical) weighting with a sample of size s = O n(r log /2 (n) + log(n)) . [sent-301, score-0.962]
90 Theorem 5 can be viewed as a transductive analog to Theorem 3 (with weights based on the combined sample S). [sent-302, score-0.229]
91 In the Supplementary Materials we give transductive analogs to Theorems 1 and 2. [sent-303, score-0.208]
92 3, our lower bound examples can also be stated in the transductive setting, and thus all our guarantees and lower bounds can also be obtained in this setting. [sent-305, score-0.338]
93 3 Simulations with empirical weights In order to numerically investigate the possible advantage of empirical weighting, we performed simulations on low-rank matrix reconstruction under uniform sampling with the unweighted, and the smoothed empirically weighted, trace-norms. [sent-307, score-0.537]
94 We choose to work with uniform sampling in order to emphasize the benefit of empirical weights, even in situations where one might not consider to use any weights at all. [sent-308, score-0.212]
95 We performed two types of simulations: Sample complexity comparison in the noiseless setting: We define Y = M , and compute ˆ ˆ XS = arg min X : LS (X) = 0 , where X = X tr or = X tr(pr ,pc ) , as appropriate. [sent-315, score-0.16]
96 In Figure 1(b), we plot the resulting average squared error (over 100 repetitions) over a range of sample sizes s and noise levels ν, with both uniform weighting and empirical weighting. [sent-323, score-0.358]
97 ) 5 Discussion In this paper, we prove learning guarantees for the weighted trace-norm by analyzing expected Rademacher complexities. [sent-343, score-0.21]
98 We show that weighting with smoothed marginals eliminates degenerate scenarios that can arise in the case of a non-product sampling distribution, and demonstrate in experiments on the Netflix and MovieLens datasets that this correction can be useful in applied settings. [sent-344, score-0.674]
99 We also give results for empirically-weighted trace-norm regularization, and see indications that using the empirical distribution may be better than using the true distribution, even if it is available. [sent-345, score-0.165]
100 Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. [sent-402, score-0.189]
wordName wordTfidf (topN-words)
[('wr', 0.489), ('xs', 0.289), ('marginals', 0.251), ('lp', 0.22), ('pr', 0.213), ('ls', 0.201), ('pc', 0.197), ('transductive', 0.181), ('smoothed', 0.158), ('qt', 0.152), ('weighted', 0.128), ('weighting', 0.125), ('movielens', 0.122), ('ix', 0.119), ('excess', 0.119), ('smoothing', 0.115), ('rademacher', 0.109), ('sp', 0.109), ('entries', 0.097), ('net', 0.093), ('rn', 0.088), ('tr', 0.084), ('ratings', 0.084), ('guarantees', 0.082), ('rs', 0.078), ('guarantee', 0.077), ('nm', 0.075), ('bound', 0.075), ('empirical', 0.071), ('inf', 0.066), ('srebro', 0.066), ('mpc', 0.065), ('loss', 0.062), ('matrix', 0.061), ('salakhutdinov', 0.06), ('sampling', 0.058), ('npr', 0.057), ('theorem', 0.056), ('log', 0.053), ('uniform', 0.053), ('inductive', 0.052), ('collaborative', 0.05), ('completion', 0.05), ('correction', 0.049), ('arbitrary', 0.048), ('sample', 0.048), ('arg', 0.046), ('jt', 0.046), ('distributions', 0.045), ('eit', 0.043), ('qual', 0.043), ('erm', 0.043), ('quali', 0.043), ('lipschitz', 0.042), ('supplementary', 0.041), ('yij', 0.041), ('unbounded', 0.041), ('materials', 0.04), ('distribution', 0.039), ('foygel', 0.038), ('users', 0.037), ('rank', 0.037), ('expectation', 0.037), ('simulations', 0.037), ('bounded', 0.036), ('uniformly', 0.036), ('requiring', 0.036), ('spiky', 0.035), ('squared', 0.035), ('degenerate', 0.033), ('xit', 0.033), ('js', 0.033), ('indexes', 0.032), ('theorems', 0.032), ('drawn', 0.031), ('rina', 0.031), ('yit', 0.031), ('might', 0.03), ('ij', 0.03), ('trace', 0.03), ('complexity', 0.03), ('aside', 0.029), ('elsewhere', 0.029), ('true', 0.028), ('reconstruction', 0.028), ('norm', 0.028), ('regularization', 0.028), ('ltering', 0.028), ('unweighted', 0.028), ('spectral', 0.027), ('give', 0.027), ('arxiv', 0.027), ('product', 0.026), ('revealed', 0.026), ('error', 0.026), ('negahban', 0.026), ('chicago', 0.026), ('rigorously', 0.026), ('training', 0.026), ('proof', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
2 0.16669071 265 nips-2011-Sparse recovery by thresholded non-negative least squares
Author: Martin Slawski, Matthias Hein
Abstract: Non-negative data are commonly encountered in numerous fields, making nonnegative least squares regression (NNLS) a frequently used tool. At least relative to its simplicity, it often performs rather well in practice. Serious doubts about its usefulness arise for modern high-dimensional linear models. Even in this setting − unlike first intuition may suggest − we show that for a broad class of designs, NNLS is resistant to overfitting and works excellently for sparse recovery when combined with thresholding, experimentally even outperforming 1 regularization. Since NNLS also circumvents the delicate choice of a regularization parameter, our findings suggest that NNLS may be the method of choice. 1
3 0.15490539 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
Author: Jiarong Jiang, Piyush Rai, Hal Daume
Abstract: We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and real-world datasets, against several baselines, demonstrate the efficacy of our proposed algorithm. 1
4 0.14967221 165 nips-2011-Matrix Completion for Multi-label Image Classification
Author: Ricardo S. Cabral, Fernando Torre, Joao P. Costeira, Alexandre Bernardino
Abstract: Recently, image categorization has been an active research topic due to the urgent need to retrieve and browse digital images via semantic keywords. This paper formulates image categorization as a multi-label classification problem using recent advances in matrix completion. Under this setting, classification of testing data is posed as a problem of completing unknown label entries on a data matrix that concatenates training and testing features with training labels. We propose two convex algorithms for matrix completion based on a Rank Minimization criterion specifically tailored to visual data, and prove its convergence properties. A major advantage of our approach w.r.t. standard discriminative classification methods for image categorization is its robustness to outliers, background noise and partial occlusions both in the feature and label space. Experimental validation on several datasets shows how our method outperforms state-of-the-art algorithms, while effectively capturing semantic concepts of classes. 1
5 0.14203475 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
Author: Le Song, Eric P. Xing, Ankur P. Parikh
Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference. Experiments on simulated and real data show the advantage of our proposed approach. 1
6 0.12614292 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
7 0.10458764 8 nips-2011-A Model for Temporal Dependencies in Event Streams
8 0.10152671 80 nips-2011-Efficient Online Learning via Randomized Rounding
9 0.10015263 158 nips-2011-Learning unbelievable probabilities
10 0.094500072 21 nips-2011-Active Learning with a Drifting Distribution
11 0.092697628 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
12 0.092688739 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
13 0.087862082 198 nips-2011-On U-processes and clustering performance
14 0.083503924 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
15 0.07990253 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
16 0.079404414 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
17 0.071702532 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
18 0.071251981 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
19 0.06804733 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
20 0.066018745 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
topicId topicWeight
[(0, 0.202), (1, -0.053), (2, -0.058), (3, -0.128), (4, -0.021), (5, 0.007), (6, 0.019), (7, -0.085), (8, 0.04), (9, -0.071), (10, 0.058), (11, 0.053), (12, -0.044), (13, 0.107), (14, -0.011), (15, -0.081), (16, 0.02), (17, -0.092), (18, 0.04), (19, -0.005), (20, -0.114), (21, 0.031), (22, 0.002), (23, -0.143), (24, 0.091), (25, 0.154), (26, 0.002), (27, 0.099), (28, 0.113), (29, 0.104), (30, 0.034), (31, 0.033), (32, 0.082), (33, 0.042), (34, 0.033), (35, 0.241), (36, 0.024), (37, -0.003), (38, -0.003), (39, -0.099), (40, 0.019), (41, 0.036), (42, -0.13), (43, 0.121), (44, 0.08), (45, 0.091), (46, 0.111), (47, 0.122), (48, 0.056), (49, -0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.92642576 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
2 0.65751433 265 nips-2011-Sparse recovery by thresholded non-negative least squares
Author: Martin Slawski, Matthias Hein
Abstract: Non-negative data are commonly encountered in numerous fields, making nonnegative least squares regression (NNLS) a frequently used tool. At least relative to its simplicity, it often performs rather well in practice. Serious doubts about its usefulness arise for modern high-dimensional linear models. Even in this setting − unlike first intuition may suggest − we show that for a broad class of designs, NNLS is resistant to overfitting and works excellently for sparse recovery when combined with thresholding, experimentally even outperforming 1 regularization. Since NNLS also circumvents the delicate choice of a regularization parameter, our findings suggest that NNLS may be the method of choice. 1
3 0.58925879 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
Author: Jiarong Jiang, Piyush Rai, Hal Daume
Abstract: We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and real-world datasets, against several baselines, demonstrate the efficacy of our proposed algorithm. 1
4 0.44041479 165 nips-2011-Matrix Completion for Multi-label Image Classification
Author: Ricardo S. Cabral, Fernando Torre, Joao P. Costeira, Alexandre Bernardino
Abstract: Recently, image categorization has been an active research topic due to the urgent need to retrieve and browse digital images via semantic keywords. This paper formulates image categorization as a multi-label classification problem using recent advances in matrix completion. Under this setting, classification of testing data is posed as a problem of completing unknown label entries on a data matrix that concatenates training and testing features with training labels. We propose two convex algorithms for matrix completion based on a Rank Minimization criterion specifically tailored to visual data, and prove its convergence properties. A major advantage of our approach w.r.t. standard discriminative classification methods for image categorization is its robustness to outliers, background noise and partial occlusions both in the feature and label space. Experimental validation on several datasets shows how our method outperforms state-of-the-art algorithms, while effectively capturing semantic concepts of classes. 1
5 0.42983273 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
Author: Le Song, Eric P. Xing, Ankur P. Parikh
Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference. Experiments on simulated and real data show the advantage of our proposed approach. 1
6 0.42236799 73 nips-2011-Divide-and-Conquer Matrix Factorization
7 0.41632488 8 nips-2011-A Model for Temporal Dependencies in Event Streams
8 0.39483872 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
9 0.38701174 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
10 0.38473809 33 nips-2011-An Exact Algorithm for F-Measure Maximization
11 0.38196731 95 nips-2011-Fast and Accurate k-means For Large Datasets
12 0.37373096 241 nips-2011-Scalable Training of Mixture Models via Coresets
13 0.37268588 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
14 0.36431438 21 nips-2011-Active Learning with a Drifting Distribution
15 0.36418754 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
16 0.3588081 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
17 0.35723719 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
18 0.34943259 198 nips-2011-On U-processes and clustering performance
19 0.34849602 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
20 0.33443865 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
topicId topicWeight
[(0, 0.056), (4, 0.061), (15, 0.142), (20, 0.035), (26, 0.033), (31, 0.089), (33, 0.011), (43, 0.079), (45, 0.142), (57, 0.079), (74, 0.049), (83, 0.046), (99, 0.091)]
simIndex simValue paperId paperTitle
1 0.91093779 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
Author: Kumar Sricharan, Alfred O. Hero
Abstract: Learning minimum volume sets of an underlying nominal distribution is a very effective approach to anomaly detection. Several approaches to learning minimum volume sets have been proposed in the literature, including the K-point nearest neighbor graph (K-kNNG) algorithm based on the geometric entropy minimization (GEM) principle [4]. The K-kNNG detector, while possessing several desirable characteristics, suffers from high computation complexity, and in [4] a simpler heuristic approximation, the leave-one-out kNNG (L1O-kNNG) was proposed. In this paper, we propose a novel bipartite k-nearest neighbor graph (BPkNNG) anomaly detection scheme for estimating minimum volume sets. Our bipartite estimator retains all the desirable theoretical properties of the K-kNNG, while being computationally simpler than the K-kNNG and the surrogate L1OkNNG detectors. We show that BP-kNNG is asymptotically consistent in recovering the p-value of each test point. Experimental results are given that illustrate the superior performance of BP-kNNG as compared to the L1O-kNNG and other state of the art anomaly detection schemes.
2 0.89786988 219 nips-2011-Predicting response time and error rates in visual search
Author: Bo Chen, Vidhya Navalpakkam, Pietro Perona
Abstract: A model of human visual search is proposed. It predicts both response time (RT) and error rates (RT) as a function of image parameters such as target contrast and clutter. The model is an ideal observer, in that it optimizes the Bayes ratio of target present vs target absent. The ratio is computed on the firing pattern of V1/V2 neurons, modeled by Poisson distributions. The optimal mechanism for integrating information over time is shown to be a ‘soft max’ of diffusions, computed over the visual field by ‘hypercolumns’ of neurons that share the same receptive field and have different response properties to image features. An approximation of the optimal Bayesian observer, based on integrating local decisions, rather than diffusions, is also derived; it is shown experimentally to produce very similar predictions to the optimal observer in common psychophysics conditions. A psychophyisics experiment is proposed that may discriminate between which mechanism is used in the human brain. A B C Figure 1: Visual search. (A) Clutter and camouflage make visual search difficult. (B,C) Psychologists and neuroscientists build synthetic displays to study visual search. In (B) the target ‘pops out’ (∆θ = 450 ), while in (C) the target requires more time to be detected (∆θ = 100 ) [1]. 1
same-paper 3 0.88596541 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
4 0.83620298 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
5 0.83122402 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
6 0.82475775 263 nips-2011-Sparse Manifold Clustering and Embedding
7 0.82252741 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
8 0.81743848 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
9 0.81622624 258 nips-2011-Sparse Bayesian Multi-Task Learning
10 0.81597644 171 nips-2011-Metric Learning with Multiple Kernels
11 0.81573093 102 nips-2011-Generalised Coupled Tensor Factorisation
12 0.81527537 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
13 0.81437427 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
14 0.81398749 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
15 0.81394219 150 nips-2011-Learning a Distance Metric from a Network
16 0.81368744 32 nips-2011-An Empirical Evaluation of Thompson Sampling
17 0.81184953 80 nips-2011-Efficient Online Learning via Randomized Rounding
18 0.8117497 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
19 0.81172025 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
20 0.81162041 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity