nips nips2012 nips2012-208 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Matrix reconstruction with the local max norm Rina Foygel Department of Statistics Stanford University rinafb@stanford. [sent-1, score-0.809]
2 We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. [sent-8, score-1.238]
3 We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. [sent-9, score-0.262]
4 1 Introduction In the matrix reconstruction problem, we are given a matrix Y ∈ Rn×m whose entries are only partly observed, and would like to reconstruct the unobserved entries as accurately as possible. [sent-11, score-0.357]
5 This problem has often been approached using regularization with matrix norms that promote low-rank or approximately-low-rank solutions, including the trace norm (also known as the nuclear norm) and the max norm, as well as several adaptations of the trace norm described below. [sent-15, score-2.424]
6 In this paper, we introduce a unifying family of norms that generalizes these existing matrix norms, and that can be used to interpolate between the trace and max norms. [sent-16, score-1.085]
7 We show that this family includes new norms, lying strictly between the trace and max norms, that give empirical and theoretical improvements over the existing norms. [sent-17, score-0.75]
8 We give results allowing for large-scale optimization with norms from the new family. [sent-18, score-0.256]
9 We write pi• = j pij to denote the marginal probability of row i, and prow = (p1• , . [sent-30, score-0.331]
10 1 Trace norm and max norm A common regularizer used in matrix reconstruction, and other matrix problems, is the trace norm X tr , equal to the sum of the singular values of X. [sent-37, score-2.339]
11 This norm can also be defined via a factorization 1 of X [1]: √ 1 X nm tr = 1 1 min 2 AB =X n A(i) i 2 + 1 m B(j) 2 , (1) j where the minimum is taken over factorizations of X of arbitrary dimension—that is, √ number of the columns in A and B is unbounded. [sent-38, score-0.731]
12 Note that we choose to scale the trace norm by 1/ nm in order to emphasize that we are averaging the squared row norms of A and B. [sent-39, score-1.294]
13 Regularization with the trace norm gives good theoretical and empirical results, as long as the locations of observed entries are sampled uniformly (i. [sent-40, score-1.076]
14 The factorized definition of the trace norm (1) allows for an intuitive comparison with the max norm, defined as [1]: 1 2 2 X max = (2) min sup A(i) 2 + sup B(j) 2 . [sent-43, score-1.561]
15 2 AB =X j i We see that the max norm measures the largest row norms in the factorization, while the rescaled trace norm instead considers the average row norms. [sent-44, score-2.007]
16 The max norm is therefore an upper bound on the rescaled trace norm, and can be viewed as a more conservative regularizer. [sent-45, score-1.175]
17 For the more general setting where p may not be uniform, Foygel and Srebro [4] show that the max norm is still an effective regularizer (in particular, bounds on error for the max norm are not affected by p). [sent-46, score-1.412]
18 On the other hand, Salakhutdinov and Srebro [5] show that the trace norm is not robust to non-uniform sampling—regularizing with the trace norm may yield large error due to over-fitting on the rows and columns with high marginals. [sent-47, score-2.022]
19 2 The weighted trace norm To reduce overfitting on the rows and columns with high marginal probabilities under the distribution p, Salakhutdinov and Srebro propose regularizing with the p-weighted trace norm, X := diag(prow ) /2 · X · diag(pcol ) /2 1 1 tr(p) . [sent-50, score-1.681]
20 tr If the row and the column of entries to be observed are sampled independently (i. [sent-51, score-0.193]
21 p = prow · pcol is a product distribution), then the p-weighted trace norm can be used to obtain good learning guarantees even when prow and pcol are non-uniform [3, 6]. [sent-53, score-1.733]
22 However, for non-uniform non-product sampling distributions, even the p-weighted trace norm can yield poor generalization performance. [sent-54, score-0.967]
23 The smoothed empirically-weighted trace norm is also studied in [6], where pi• is replaced with observations in row pi• = # total # observations i , the empirical marginal probability of row i (and same for p•j ). [sent-57, score-1.342]
24 More generally, for any weight vectors r ∈ ∆[n] and c ∈ ∆[m] and a matrix X ∈ Rn×m , the (r, c)-weighted trace norm is given by X 1 1/2 tr(r,c) = diag(r) · X · diag(c) 1/2 . [sent-59, score-1.03]
25 2 Of course, we can easily obtain the existing methods of the uniform trace norm, (empirically) weighted trace norm, and smoothed (empirically) weighted trace norm as special cases of this formulation. [sent-61, score-2.495]
26 Furthermore, the max norm is equal to a supremum over all possible weightings [7]: X max = sup X tr(r,c) . [sent-62, score-0.969]
27 r∈∆[n] ,c∈∆[m] 2 The local max norm We consider a generalization of these norms, which lies “in between” the trace norm and max norm. [sent-63, score-1.868]
28 This gives a norm on matrices, except in the trivial case where, for some i or some j, ri = 0 for all r ∈ R or cj = 0 for all c ∈ C. [sent-65, score-0.7]
29 We now show some existing and novel norms that can be obtained using local max norms. [sent-66, score-0.489]
30 1 Trace norm and max norm We can obtain the max norm by taking the largest possible R and C, i. [sent-68, score-1.841]
31 X max = X (∆[n] ,∆[m] ) , and similarly we can obtain the (r, c)-weighted trace norm by taking the singleton sets R = {r} and C = {c}. [sent-70, score-1.176]
32 As discussed above, this includes the standard trace norm (when r and c are uniform), as well as the weighted, empirically weighted, and smoothed weighted trace norm. [sent-71, score-1.814]
33 2 Arbitrary smoothing When using the smoothed weighted max norm, we need to choose the amount of smoothing to apply to the marginals, that is, we need to choose ζ in our definition of the smoothed row and column weights, as given in (3). [sent-73, score-1.003]
34 Alternately, we could regularize simultaneously over all possible amounts of smoothing by considering the local max norm with R = {(1 − ζ) · prow + ζ · 1/n : any ζ ∈ [0, 1]} , and same for C. [sent-74, score-0.995]
35 That is, R and C are line segments in the simplex—they are larger than any single point as for the uniform or weighted trace norm (or smoothed weighted trace norm for a fixed amount of smoothing), but smaller than the entire simplex as for the max norm. [sent-75, score-2.707]
36 It may therefore be useful to consider a penalty function of the form: 2 2 2 2 Penalty(β,τ ) (X) = min max A(i) 2 + max B(j) 2 · A(i) 2 + B(j) 2 . [sent-82, score-0.412]
37 ) (Note that max maxi A(i) , maxj B(j) is replaced with maxi A(i) 2 , 2 This penalty function does not appear to be convex in X. [sent-87, score-0.303]
38 However, the proposition below (proved in the Supplementary Materials) shows that we can use a (convex) local max norm penalty to compute a solution to any objective function with a penalty function of the form (4): Proposition 1. [sent-88, score-0.85]
39 Then, for some penalty parameter µ ≥ 0 and some t ∈ [0, 1], X = arg min Loss(X) + µ · X X R= r ∈ ∆[n] : ri ≥ t ∀i 1 + (n − 1)t and C = (R,C) , where t ∀j 1 + (m − 1)t c ∈ ∆[m] : cj ≥ . [sent-90, score-0.241]
40 Here the sets R and C impose a lower bound on each of the weights, and this lower bound can be used to interpolate between the max and trace norms: when t = 1, each ri is lower bounded by 1/n (and similarly for c ), i. [sent-92, score-0.809]
41 R and C are singletons containing only the uniform weights and we j obtain the trace norm. [sent-94, score-0.52]
42 R and C are each the entire simplex and we obtain the max norm. [sent-97, score-0.214]
43 Intermediate values of t interpolate between the trace norm and max norm and correspond to different balances between β and τ . [sent-98, score-1.708]
44 4 Interpolating between trace norm and max norm We next turn to an interpolation which relies on an upper bound, rather than a lower bound, on the weights. [sent-100, score-1.713]
45 Consider R = r ∈ ∆[n] : ri ≤ ∀i and Cδ = c ∈ ∆[n] : cj ≤ δ ∀j , (5) for some ∈ [1/n, 1] and δ ∈ [1/m, 1]. [sent-101, score-0.179]
46 The (R , Cδ )-norm is then equal to the (rescaled) trace norm when we choose = 1/n and δ = 1/m, and is equal to the max norm when we choose = δ = 1. [sent-102, score-1.681]
47 We can generalize this to an interpolation between the max norm and a smoothed weighted trace norm, which we will use in our experimental results. [sent-104, score-1.593]
48 The first is multiplicative: R× := r ∈ ∆[n] : ri ≤ γ · ((1 − ζ) · pi• + ζ · 1/n) ∀i ζ,γ , (6) where γ = 1 corresponds to choosing the singleton set R× = {(1 − ζ) · prow + ζ · 1/n} (i. [sent-106, score-0.342]
49 the ζ,γ smoothed weighted trace norm), while γ = ∞ corresponds to the max norm (for any choice of ζ) since we would get R× = ∆[n] . [sent-108, score-1.519]
50 ζ,γ The second option for an interpolation is instead defined with an exponent: 1−τ Rζ,τ := r ∈ ∆[n] : ri ≤ ((1 − ζ) · pi• + ζ · 1/n) ∀i . [sent-109, score-0.189]
51 (7) Here τ = 0 will yield the singleton set corresponding to the smoothed weighted trace norm, while τ = 1 will yield Rζ,τ = ∆[n] , i. [sent-110, score-0.881]
52 4 3 Optimization with the local max norm One appeal of both the trace norm and the max norm is that they are both SDP representable [9, 10], and thus easily optimizable, at least in small scale problems. [sent-117, score-2.396]
53 As we show in Theorem 1 below, a similar factorization-optimization approach is also possible for any local max norm with convex R and C. [sent-121, score-0.764]
54 We further give a simplified representation which is applicable when R and C are specified through element-wise upper bounds R ∈ Rn and C ∈ Rm , respectively: + + R = {r ∈ ∆[n] : ri ≤ Ri ∀i} and C = {c ∈ ∆[m] : cj ≤ Cj ∀j} , (8) with 0 ≤ Ri ≤ 1, i Ri ≥ 1, 0 ≤ Cj ≤ 1, j Cj ≥ 1 to avoid triviality. [sent-122, score-0.232]
55 If R and C are convex, then the (R, C)-norm can be calculated with the factorization 1 2 2 X (R,C) = (9) inf cj B(j) 2 . [sent-126, score-0.207]
56 sup ri A(i) 2 + sup 2 AB =X r∈R i c∈C j In the special case when R and C are defined by (8), writing (x)+ := max{0, x}, this simplifies to 1 2 2 X (R,C) = inf a+ Ri A(i) 2 − a + b + Cj B(j) 2 − b . [sent-127, score-0.405]
57 + D 2 F + sup c /2 B 1 c∈C 2 , F where for the next-to-last step we set C = r1/2 A and D = c1/2 B, and the last step follows because sup inf ≤ inf sup always (weak duality). [sent-130, score-0.498]
58 4 An approximate convex hull and a learning guarantee In this section, we look for theoretical bounds on error for the problem of estimating unobserved entries in a matrix Y that is approximately low-rank. [sent-132, score-0.345]
59 We begin with a result comparing the (R, C)-norm unit ball to a convex hull of rank-1 matrices, which will be useful for proving our learning guarantee. [sent-134, score-0.177]
60 1 Convex hull To gain a better theoretical understanding of the (R, C) norm, we first need to define corresponding vector norms on Rn and Rm . [sent-136, score-0.295]
61 For any u ∈ Rn , let u R := ri u2 = sup diag(r) i sup r∈R r∈R i 1/2 ·u . [sent-137, score-0.339]
62 2 We can think of this norm as a way to interpolate between the 2 and ∞ vector norms. [sent-138, score-0.566]
63 Defining v C analogously for v ∈ Rm , we can now relate these vector norms to the (R, C)-norm on matrices. [sent-140, score-0.231]
64 For any convex R ⊆ ∆[n] and C ⊆ ∆[m] , the (R, C)-norm unit ball is bounded above and below by a convex hull as: Conv uv : u R = v C = 1 ⊆ X: X (R,C) ≤ 1 ⊆ KG ·Conv uv : u R = v C =1 , where KG ≤ 1. [sent-142, score-0.329]
65 This result is a nontrivial extension of Srebro and Shraibman [1]’s analysis for the max norm and the trace norm. [sent-144, score-1.142]
66 when R = ∆[n] and C = ∆[m] , and that the trace norm unit ball is exactly equal to the corresponding convex hull (see Corollary 2 and Section 3. [sent-147, score-1.144]
67 For the second inclusion, we state a weighted version of Grothendieck’s Inequality (proof in the Supplementary Materials): sup Y, U V : U ∈ Rn×k , V ∈ Rm×k , U(i) = KG · sup Y, uv 2 ≤ ai ∀i, V(j) 2 ≤ bj ∀j : u ∈ Rn , v ∈ Rm , |ui | ≤ ai ∀i, |vj | ≤ bj ∀j . [sent-152, score-0.473]
68 We then apply this weighted inequality to the dual norm of the (R, C)-norm to prove the desired inclusion, as in Srebro and Shraibman [1]’s work for the max norm case (see Corollary 2 in their paper). [sent-153, score-1.301]
69 2 Learning guarantee We now give our main matrix reconstruction result, which provides error bounds for a family of norms interpolating between the max norm and the smoothed weighted trace norm. [sent-156, score-2.088]
70 Then, in expectation over the sample S, pij Yij − Xij ≤ ij inf √ X (R,C) ≤ k ij pij |Yij − Xij | + O kn s Approximation error where X = arg min X (R,C) ≤ √ k s t=1 log(n) · 1+ √ γ , Excess error |Yit jt − Xit jt |. [sent-167, score-0.379]
71 The main idea √ to use the convex hull formulation from Theorem 2 is to show that, for any X with X (R,C) ≤ k, there exists a decomposition X = X + X with √ X max ≤ O( k) and X tr(p) ≤ O( k/γ), where p represents the smoothed marginals with smoothing parameter ζ = 1/2 as in (3). [sent-170, score-0.626]
72 We then apply known bounds on the Rademacher complexity of the max norm unit ball [1] and the smoothed √ weighted trace norm unit ball [6], to bound the Rademacher complexity of X : X (R,C) ≤ k . [sent-171, score-2.194]
73 As special cases of this theorem, we can re-derive the existing results for the max norm and smoothed weighted trace norm. [sent-174, score-1.548]
74 Specifically, choosing γ = ∞ gives us an excess error term of order kn/s for the max norm, previously shown by [1], while setting γ = 1 yields an excess error term of order kn log(n)/s for the smoothed weighted trace norm as long as s ≥ n log(n), as shown in [6]. [sent-175, score-1.812]
75 What advantage does this new result offer over the existing results for the max norm and for the smoothed weighted trace norm? [sent-176, score-1.548]
76 Then, comparing to the max norm result (when γ = ∞), we see 1/ 6 Table 1: Matrix fitting for the five methods used in experiments. [sent-178, score-0.672]
77 error per entry Norm Max norm (Uniform) trace norm Empirically-weighted trace norm Arbitrarily-smoothed emp. [sent-190, score-2.471]
78 trace norm Local max norm q 30 Matrix dimension n 60 120 240 Matrix dimension n Figure 1: Simulation results for matrix reconstruction with a rank-2 (left) or rank-4 (right) signal, corrupted by noise. [sent-192, score-1.785]
79 For the rank-4 experiment, max norm error exceeded 0. [sent-194, score-0.712]
80 that the excess error term is the same in both cases (up to a constant), but the approximation error term may in general be much lower for the local max norm than for the max norm. [sent-196, score-1.051]
81 Comparing next to the weighted trace norm (when γ = 1), we see that the excess error term is lower by a factor of log(n) for the local max norm. [sent-197, score-1.438]
82 5 Experiments We test the local max norm on simulated and real matrix reconstruction tasks, and compare its performance to the max norm, the uniform and empirically-weighted trace norms, and the smoothed empirically-weighted trace norm. [sent-200, score-2.303]
83 Methods We use the two-parameter family of norms defined in (7), but replacing the true marginals pi• and p•j with the empirical marginals pi• and p•j . [sent-212, score-0.354]
84 We see that the local max norm results in lower error than any of the tested existing norms, across all the settings used. [sent-225, score-0.795]
85 Setting τ = 0 corresponds to the uniform or weighted or smoothed weighted trace norm (depending on ζ), while τ = 1 corresponds to the max norm for any ζ value. [sent-227, score-2.198]
86 2 Movie ratings data We next compare several different matrix norms on two collaborative filtering movie ratings datasets, the Netflix [14] and MovieLens [15] datasets. [sent-283, score-0.52]
87 We also test τ = 1 (the max norm—here ζ is arbitrary) and ζ = 1, τ = 0 (the uniform trace norm). [sent-291, score-0.716]
88 For both the MovieLens and Netflix data, the local max norm with τ = 0. [sent-294, score-0.726]
89 05 gives strictly better accuracy than any previously-known norm studied in this setting. [sent-296, score-0.521]
90 ) For the MovieLens data, the local max norm achieves RMSE of 0. [sent-298, score-0.726]
91 7831 achieved by the smoothed empirically-weighted trace norm with ζ = 0. [sent-300, score-1.212]
92 For the Netflix dataset the local max norm achieves RMSE of 0. [sent-302, score-0.726]
93 6 Summary In this paper, we introduce a unifying family of matrix norms, called the “local max” norms, that generalizes existing methods for matrix reconstruction, such as the max norm and trace norm. [sent-305, score-1.345]
94 We examine some interesting sub-families of local max norms, and consider several different options for interpolating between the trace (or smoothed weighted trace) and max norms. [sent-306, score-1.316]
95 We find norms lying strictly between the trace norm and the max norm that give improved accuracy in matrix reconstruction for both simulated data and real movie ratings data. [sent-307, score-2.181]
96 We show that regularizing with any local max norm is fairly simple to optimize, and give a theoretical result suggesting improved matrix reconstruction using new norms in this family. [sent-308, score-1.166]
97 For MovieLens, the local max norm gives an RMSE of 0. [sent-319, score-0.75]
98 For Netflix, the local max norm gives an RMSE of 0. [sent-324, score-0.75]
99 05, while the smoothed weighted trace norm gives an RMSE of 0. [sent-326, score-1.368]
100 Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. [sent-344, score-0.223]
wordName wordTfidf (topN-words)
[('norm', 0.497), ('trace', 0.47), ('smoothed', 0.245), ('norms', 0.231), ('prow', 0.213), ('max', 0.175), ('pcol', 0.17), ('ix', 0.132), ('weighted', 0.132), ('movielens', 0.13), ('sup', 0.122), ('srebro', 0.106), ('net', 0.099), ('ri', 0.095), ('rmse', 0.092), ('cj', 0.084), ('reconstruction', 0.083), ('foygel', 0.081), ('ab', 0.078), ('tr', 0.077), ('ratings', 0.075), ('interpolation', 0.074), ('excess', 0.07), ('interpolate', 0.069), ('inf', 0.066), ('hull', 0.064), ('entries', 0.064), ('matrix', 0.063), ('salakhutdinov', 0.063), ('penalty', 0.062), ('pi', 0.06), ('uv', 0.057), ('diag', 0.057), ('factorization', 0.057), ('smoothing', 0.056), ('local', 0.054), ('rn', 0.053), ('row', 0.052), ('jt', 0.052), ('uniform', 0.05), ('kn', 0.049), ('ball', 0.048), ('marginals', 0.048), ('kg', 0.044), ('interpolating', 0.044), ('materials', 0.044), ('grothendieck', 0.043), ('shraibman', 0.043), ('hazan', 0.042), ('movie', 0.041), ('yij', 0.04), ('pij', 0.04), ('error', 0.04), ('simplex', 0.039), ('convex', 0.038), ('regularizing', 0.038), ('sdp', 0.038), ('rm', 0.038), ('collaborative', 0.035), ('singleton', 0.034), ('rademacher', 0.034), ('rescaled', 0.033), ('representable', 0.031), ('xij', 0.031), ('supplementary', 0.03), ('existing', 0.029), ('inclusion', 0.029), ('factorizations', 0.028), ('guarantee', 0.028), ('validation', 0.028), ('bounds', 0.028), ('maxj', 0.028), ('family', 0.027), ('columns', 0.027), ('sketch', 0.027), ('unit', 0.027), ('conv', 0.026), ('marginal', 0.026), ('rennie', 0.025), ('exponent', 0.025), ('give', 0.025), ('theorem', 0.024), ('lying', 0.024), ('gives', 0.024), ('colt', 0.023), ('nm', 0.023), ('arbitrary', 0.022), ('options', 0.021), ('nuclear', 0.021), ('rows', 0.021), ('choose', 0.021), ('locations', 0.021), ('unifying', 0.021), ('test', 0.021), ('rank', 0.02), ('unobserved', 0.02), ('tted', 0.02), ('option', 0.02), ('bj', 0.02), ('completion', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 208 nips-2012-Matrix reconstruction with the local max norm
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
2 0.24253729 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
3 0.20003195 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
4 0.14016154 247 nips-2012-Nonparametric Reduced Rank Regression
Author: Rina Foygel, Michael Horrell, Mathias Drton, John D. Lafferty
Abstract: We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a q-dimensional response, with a shared p-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backfitting algorithms are derived and justified using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data. 1
5 0.11773591 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
Author: Minjie Xu, Jun Zhu, Bo Zhang
Abstract: We present a probabilistic formulation of max-margin matrix factorization and build accordingly a nonparametric Bayesian model which automatically resolves the unknown number of latent factors. Our work demonstrates a successful example that integrates Bayesian nonparametrics and max-margin learning, which are conventionally two separate paradigms and enjoy complementary advantages. We develop an efficient variational algorithm for posterior inference, and our extensive empirical studies on large-scale MovieLens and EachMovie data sets appear to justify the aforementioned dual advantages. 1
6 0.10673597 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
7 0.09902367 293 nips-2012-Relax and Randomize : From Value to Algorithms
8 0.098581158 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
9 0.093989976 86 nips-2012-Convex Multi-view Subspace Learning
10 0.086489715 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
11 0.082272284 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
12 0.06761127 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
13 0.064124331 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
14 0.063949533 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
15 0.063824117 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
17 0.0614217 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
18 0.061244309 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering
19 0.060634468 75 nips-2012-Collaborative Ranking With 17 Parameters
20 0.060242832 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
topicId topicWeight
[(0, 0.174), (1, 0.035), (2, 0.112), (3, -0.091), (4, 0.083), (5, 0.059), (6, -0.024), (7, 0.064), (8, 0.02), (9, -0.004), (10, 0.027), (11, 0.032), (12, -0.139), (13, 0.028), (14, 0.106), (15, 0.047), (16, 0.084), (17, 0.021), (18, 0.067), (19, -0.088), (20, -0.019), (21, 0.041), (22, -0.021), (23, -0.062), (24, -0.17), (25, 0.035), (26, 0.029), (27, 0.022), (28, 0.024), (29, 0.023), (30, -0.04), (31, 0.013), (32, -0.044), (33, 0.091), (34, -0.009), (35, 0.246), (36, -0.016), (37, -0.06), (38, -0.06), (39, -0.029), (40, -0.067), (41, 0.033), (42, 0.119), (43, 0.066), (44, 0.129), (45, 0.035), (46, -0.005), (47, 0.05), (48, 0.024), (49, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.98445344 208 nips-2012-Matrix reconstruction with the local max norm
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
2 0.85053968 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
3 0.76321787 247 nips-2012-Nonparametric Reduced Rank Regression
Author: Rina Foygel, Michael Horrell, Mathias Drton, John D. Lafferty
Abstract: We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a q-dimensional response, with a shared p-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backfitting algorithms are derived and justified using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data. 1
4 0.73348606 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
5 0.71508473 86 nips-2012-Convex Multi-view Subspace Learning
Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efficient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1
6 0.58348608 125 nips-2012-Factoring nonnegative matrices with linear programs
7 0.58092326 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
8 0.5548535 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
9 0.54336655 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
10 0.53707886 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
11 0.5191285 221 nips-2012-Multi-Stage Multi-Task Feature Learning
12 0.49557447 16 nips-2012-A Polynomial-time Form of Robust Regression
13 0.48828742 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
14 0.48645696 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
15 0.48636124 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
16 0.47000745 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
18 0.45493224 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
19 0.43438649 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
20 0.41377446 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
topicId topicWeight
[(0, 0.025), (9, 0.022), (38, 0.33), (42, 0.027), (53, 0.012), (54, 0.029), (55, 0.032), (74, 0.04), (76, 0.164), (78, 0.095), (80, 0.073), (92, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.97073293 208 nips-2012-Matrix reconstruction with the local max norm
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
2 0.96430194 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA
Author: Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, S. D. Babacan
Abstract: The variational Bayesian (VB) approach is one of the best tractable approximations to the Bayesian estimation, and it was demonstrated to perform well in many applications. However, its good performance was not fully understood theoretically. For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. In this paper, we focus on probabilistic PCA and give more theoretical insight into the empirical success of VB. More specifically, for the situation where the noise variance is unknown, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA. 1
3 0.9582231 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams
Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1
4 0.95760381 165 nips-2012-Iterative ranking from pair-wise comparisons
Author: Sahand Negahban, Sewoong Oh, Devavrat Shah
Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1
5 0.95741993 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
Author: Koby Crammer, Yishay Mansour
Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1
6 0.95709407 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
7 0.95548552 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
8 0.95433104 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
9 0.95280594 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button
10 0.94972688 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
11 0.94100565 285 nips-2012-Query Complexity of Derivative-Free Optimization
12 0.94038689 86 nips-2012-Convex Multi-view Subspace Learning
13 0.93736517 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss
14 0.9356941 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
15 0.93471265 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
16 0.93253225 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
17 0.93040425 187 nips-2012-Learning curves for multi-task Gaussian process regression
18 0.92863226 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
19 0.92611116 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
20 0.92334944 324 nips-2012-Stochastic Gradient Descent with Only One Projection