nips nips2010 nips2010-210 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan Salakhutdinov
Abstract: The max-norm was proposed as a convex matrix regularizer in [1] and was shown to be empirically superior to the trace-norm for collaborative filtering problems. Although the max-norm can be computed in polynomial time, there are currently no practical algorithms for solving large-scale optimization problems that incorporate the max-norm. The present work uses a factorization technique of Burer and Monteiro [2] to devise scalable first-order algorithms for convex programs involving the max-norm. These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. Empirically, the new methods outperform mature techniques from all three areas. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The max-norm was proposed as a convex matrix regularizer in [1] and was shown to be empirically superior to the trace-norm for collaborative filtering problems. [sent-9, score-0.282]
2 The present work uses a factorization technique of Burer and Monteiro [2] to devise scalable first-order algorithms for convex programs involving the max-norm. [sent-11, score-0.122]
3 These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. [sent-12, score-0.419]
4 In a wide variety of applications, such as collaborative filtering, multi-task learning, multi-class learning and clustering of multivariate observations, matrices offer a natural way to tabulate data. [sent-15, score-0.295]
5 For such matrix models, the matrix rank provides an intellectually appealing way to describe complexity. [sent-16, score-0.145]
6 Unfortunately, optimization problems involving rank constraints are computationally intractable except in a few basic cases. [sent-20, score-0.124]
7 A particular example of a low-rank regularizer that has received a huge amount of recent attention is the trace-norm, equal to the sum of the matrix’s singular values (See the comprehensive survey [3] and its bibliography). [sent-22, score-0.144]
8 The tracenorm promotes low-rank decompositions because it minimizes the 1 norm of the vector of singular values, which encourages many zero singular values. [sent-23, score-0.207]
9 Although the trace-norm is a very successful regularizer in many applications, it does not seem to be widely known or appreciated that there are many other interesting norms that promote low rank. [sent-24, score-0.191]
10 The current work focuses on another rank-promoting regularizer, sometimes called the maxnorm, that has been proposed as an alternative to the rank for collaborative filtering problems [1, 5]. [sent-26, score-0.226]
11 The max-norm can be defined via matrix factorizations: X where · max := inf denotes the maximum 2,∞ A 2 U 2,∞ V 2,∞ : X = UV (1) row norm of a matrix: 2,∞ := maxj k A2 jk 1/2 . [sent-27, score-0.242]
12 When X is positive semidefinite, we may force U = V and then verify that X max = maxj xjj , which should explain the terminology. [sent-29, score-0.122]
13 The fundamental result in the metric theory of tensor products, due to Grothendieck, states that the max-norm is comparable with a nuclear norm (see Chapter 10 of [6]): X max ≈ inf σ 1 :X= j σj uj vj where uj ∞ = 1 and vj ∞ =1 . [sent-30, score-0.451]
14 The trace-norm, on the other hand, is equal to X tr := inf σ 1 :X= j σj uj vj where uj 2 = 1 and vj 2 =1 . [sent-34, score-0.278]
15 This perspective reveals that the max-norm promotes low-rank decompositions with factors in ∞ , rather than the 2 factors produced by the trace-norm! [sent-35, score-0.079]
16 The literature already contains theoretical and empirical evidence that the max-norm is superior to the trace-norm for certain types of problems. [sent-37, score-0.077]
17 Indeed, the max-norm offers better generalization error bounds for collaborative filtering [5], and it outperforms the trace-norm in small-scale experiments [1]. [sent-38, score-0.14]
18 The paper [7] provides further evidence that the max-norm serves better for collaborative filtering with nonuniform sampling patterns. [sent-39, score-0.227]
19 We believe that the max-norm has not achieved the same prominence as the trace-norm because of an apprehension that it is challenging to solve optimization problems involving a max-norm regularizer. [sent-40, score-0.104]
20 In particular, we study convex programs of the form min f (X) + µ X (2) max where f is a smooth function and µ is a positive penalty parameter. [sent-43, score-0.12]
21 We also study the bound-constrained problem min f (X) subject to X max ≤ B. [sent-45, score-0.123]
22 Section 3 provides a projected gradient method for (3), and Section 5 develops a stochastic implementation that is appropriate for decomposable loss functions. [sent-47, score-0.341]
23 In Section 6, we apply these new algorithms to large-scale collaborative filtering problems, and we demonstrate performance superior to methods based on the trace-norm. [sent-49, score-0.179]
24 We apply the algorithms to solve enormous instances of graph cut problems, and we establish that clustering based on these cuts outperforms spectral clustering on several data sets. [sent-50, score-0.723]
25 2 2 The SDP and Factorization Approaches The max-norm of an m × n matrix X can be expressed as the solution to a semidefinite program: X max = min t subject to W1 X X W2 diag(W1 ) ≤ t, 0, diag(W2 ) ≤ t. [sent-51, score-0.169]
26 For large-scale problems, we use an alternative formulation suggested by (1) that explicitly works with a factorization of the decision variable X. [sent-53, score-0.088]
27 R Burer and Monteiro showed that as long as L and R have sufficiently many columns, then the global optimum of (4) is equal to that of X max = min max{ L (L,R) : LR =X 2 2,∞ , R 2 2,∞ } . [sent-56, score-0.082]
28 This formulation of the max-norm is nonconvex because it involves a constraint on the product LR , but Burer and Monteiro proved that each local minimum of the reformulated problem is also a global optimum [9]. [sent-58, score-0.099]
29 On the other hand, the new formulation is nonconvex with respect to L and R so it might not be efficiently solvable. [sent-60, score-0.099]
30 3 Projected Gradient Method The constrained formulation (3) admits a simple projected gradient algorithm. [sent-62, score-0.34]
31 We replace X with the product LR and use the factored form of the max-norm (5) to obtain minimize(L,R) f (LR ) subject to max{ L 2 2,∞ , R 2 2,∞ } ≤ B. [sent-63, score-0.086]
32 (6) The projected gradient descent method fixes a step size τ and computes updates with the rule L ← PB R L − τ f (LR)R R − τ f (LR) L 2 2 where PB denotes the Euclidean projection onto the set {(L, R) : max( L 2,∞ , R 2,∞ ) ≤ B}. [sent-64, score-0.298]
33 This projection can be computed by re-scaling the rows of the current iterate whose norms exceed √ √ √ B so their norms equal B. [sent-65, score-0.224]
34 Rows with norms less than B are unchanged by the projection. [sent-66, score-0.091]
35 The projected gradient algorithm is elegant and simple, and it has an online implementation, described below. [sent-67, score-0.298]
36 We employ a classical proximal point method, proposed by Fukushima and Mine [8], which forms the algorithmic foundation of many popular first-order methods of for 1 -norm minimization [11, 12] and trace-norm minimization [13, 14]. [sent-72, score-0.148]
37 The new cost function can then be minimized in closed form. [sent-75, score-0.08]
38 Before describing the proximal point algorithm in detail, we first discuss how a simple max-norm problem (the Frobenius norm plus a max-norm penalty) admits an explicit formula for its unique optimal solution. [sent-76, score-0.188]
39 Consider the simple regularization problem minimizeW W −V 3 2 F +β W 2 2,∞ (7) Algorithm 1 Compute W = squash(V , β) Require: A d × D matrix V , a positive scalar β. [sent-77, score-0.092]
40 F 1: for k = 1 to d set nk ← vk 2 2: sort {nk } in descending order. [sent-79, score-0.116]
41 We call this procedure squash because the rows of V with large norm have their magnitude clipped at a critical value η = η(V , β). [sent-86, score-0.508]
42 Note that squash can be computed in O(d max{log(d), D}) flops. [sent-90, score-0.392]
43 Computing the row norms requires O(dD) flops, and then the sort requires O(d log d) flops. [sent-91, score-0.137]
44 With the squash function in hand, we can now describe our proximal-point algorithm. [sent-94, score-0.392]
45 Using the squash algorithm, we can solve minimize −1 ˜ f (Ak ), A + τk A − Ak 2 F +µ A 2 2,∞ (9) in closed form. [sent-100, score-0.498]
46 That is, the optimal solution ˜ of (9) is squash Ak − τk f (Ak ), τk µ . [sent-103, score-0.392]
47 The cost function f is replaced with a quadratic approximation localized at the previous iterate Ak , and the resulting approximation (9) can be solved in closed form. [sent-106, score-0.08]
48 This algorithm is guaranteed to converge to a critical point of (8) as long as the step sizes are chosen ˜ commensurate with the norm of the Hessian [8]. [sent-109, score-0.074]
49 In particular, Nesterov has recently shown that if f has a Lipschitz-continuous gradient with Lipschitz constant L, then the algorithm will converge at a rate of 1/k where k is the iteration counter [15]. [sent-110, score-0.151]
50 4 Algorithm 2 A proximal-point method for max-norm regularization Require: Algorithm parameters α > 0, 1 > γ > 0, tol > 0. [sent-111, score-0.099]
51 When dealing with very large datasets, S may consist of hundreds of millions of pairs, and there are algorithmic advantages to utilizing stochastic gradient methods that only query a very small subset of S at each iteration. [sent-120, score-0.157]
52 We can also implement an efficient algorithm for stochastic gradient descent for problem (2). [sent-126, score-0.157]
53 If we wanted to apply the squash algorithm to such a stochastic gradient step, only the norms corresponding to Li and Rj would be modified. [sent-127, score-0.64]
54 Hence, in Algorithm 1, if the set of row norms of L and R is sorted from the previous iteration, we can implement a balanced-tree data structure that allows us to perform individual updates in amortized logarithmic time. [sent-128, score-0.091]
55 In the experiments, however, we demonstrate that the proximal point method is still quite efficient and fast when dealing with stochastic gradient updates corresponding to medium-size batches {(i, j)} selected from S, even if a full sort is performed at each squash operation. [sent-130, score-0.743]
56 We tested our proximal point and projected gradient methods on the Netflix dataset, which is the largest publicly available collaborative filtering dataset. [sent-132, score-0.623]
57 The training set contains 100,480,507 ratings from 480,189 anonymous users on 17,770 movie titles. [sent-133, score-0.122]
58 The “qualification set” pairs were selected by Netflix from the most recent ratings for a subset of the users. [sent-135, score-0.082]
59 It includes users with over 10,000 ratings as well as users who rated fewer than 5 movies. [sent-142, score-0.162]
60 05 Algorithm RMSE Training RMSE X max f (X) + + µ X max 2. [sent-146, score-0.164]
61 75 0 5 10 15 20 25 30 35 40 Number of epochs Figure 1: Performance of regularization methods on the Netflix dataset. [sent-161, score-0.092]
62 In our experiments, all ratings were normalized to be zero-mean by subtracting 3. [sent-165, score-0.082]
63 Both proximal-point and projected gradient methods performed 40 epochs (or passes through the training set), with parameters {L, R} updated after each minibatch. [sent-168, score-0.344]
64 For the proximal-point method, µ was set to 5×10−4 , and for the projected gradient algorithm, B was set to 2. [sent-173, score-0.298]
65 5 GHz Intel Xeon, our implementation of projected gradient takes 20. [sent-177, score-0.298]
66 Figure 1 shows predictive performance of both the proximal-point and projected gradient algorithms on the training and qualification set. [sent-180, score-0.298]
67 Observe that the proximal-point algorithm converges considerably faster than projected gradient, but both algorithms achieve a similar RMSE of 0. [sent-181, score-0.184]
68 Figure 1, left panel, further shows that the max-norm based regularization significantly outperforms the corresponding trace-norm based regularization, which is widely used in many large-scale collaborative filtering applications. [sent-184, score-0.186]
69 (i,j)∈E In our nonconvex formulation, this optimization becomes (1 − Ai Aj ) subject to A minimize 2 2,∞ ≤ 1. [sent-188, score-0.137]
70 2 SDPLR Time 3 4 7 29 6 15 21 15 20 33 |V | 2000 2000 2000 5000 7000 10000 10000 10000 14000 20000 |E| 19990 11778 11766 29570 17148 20000 9999 20000 28000 40000 Table 1: Performance of projected gradient on Gset graphs. [sent-235, score-0.298]
71 1% of optimal, running time for 1% of optimal, number of iterations to reach 1% of optimal, primal objective using SDPLR, running time of SDPLR, number of vertices, and number of edges. [sent-239, score-0.127]
72 (a) Spectral Clustering (b) Max-cut clustering Figure 2: Comparison of spectral clustering (left) with MAX - CUT clustering (right). [sent-241, score-0.557]
73 We tested our projected gradient algorithm on graphs drawn from the Gset, a collection of graphs designed for testing the efficacy of max-cut algorithms [17]. [sent-242, score-0.298]
74 On the same modern hardware, a Matlab implementation of our projected gradient method can reach . [sent-244, score-0.298]
75 For the 2-class clustering problem, we first build a K-nearest neighbor graph with K = 10 and weights wij defined as wij = max(si (j), sj (i)), with si (j) = ||x −x ||2 exp − i2σ2j and σi equal to the distance from xi to its Kth closest neighbor. [sent-247, score-0.339]
76 We solve the MAX - CUT problem on the graph Q to find our cluster assignments. [sent-250, score-0.081]
77 For the two moons experiments, we fix D = 100, n = 2000 and σ = . [sent-253, score-0.106]
78 For the clustering experiments, we repeat the randomized rounding technique [16] for 100 trials, and we choose the rounding with highest primal objective. [sent-257, score-0.282]
79 We compare our MAX - CUT clusterings with the spectral clustering method [20] and the Total Variation Graph Cut algorithm [19]. [sent-258, score-0.247]
80 Figure 2 shows the clustering results for spectral clustering and maxcut clustering. [sent-259, score-0.402]
81 In all the trials, spectral clustering incorrectly clustered the two ends of both half-circles. [sent-260, score-0.247]
82 For the clustering problems, the two measures of performance we consider are misclassification error rate (number of misclassified points divided by n) and cut cost. [sent-261, score-0.395]
83 The MAX - CUT clustering obtained smaller misclassification error in 98 of the 100 trials we performed and smaller cut cost in every trial. [sent-263, score-0.441]
84 Error rate, cut cost, and running time comparison for total variation (TV) algorithms. [sent-282, score-0.275]
85 The balance of the cut is computed as are averaged over 100 trials. [sent-283, score-0.24]
86 |V1 |+|V2 | spectral, and The two moons results approximately 1 minute to run 1,000 iterations. [sent-286, score-0.106]
87 Our MAX - CUT clustering algorithm again performs substantially better than the spectral method. [sent-289, score-0.247]
88 7 Summary In this paper we presented practical methods for solving very large scale optimization problems involving a max-norm constraint or regularizer. [sent-290, score-0.071]
89 Using this approaches, we showed evidence that the max-norm can often be superior to established techniques such as trace-norm regularization and spectral clustering, supplementing previous evidence on small-scale problems. [sent-291, score-0.253]
90 JAT supported by ONR award N00014-08-1-0883, DARPA award N66001-08-1-2065, and AFOSR award FA955009-1-0643. [sent-294, score-0.129]
91 Pd i=1 pi = β, (d) wi 2 ≤t With our candidate W = squash(V , β), we need only find t and p to verify the optimality conditions. [sent-297, score-0.107]
92 Let π be as in Algorithm 1 and set t = η 2 and ( vk − 1 1 ≤ π(k) ≤ q η pk = 0 otherwise This definition of p immediately gives (a). [sent-298, score-0.07]
93 For (b), note that by the definition of q, vk ≥ η for 1 ≤ π(k) ≤ q. [sent-299, score-0.07]
94 Moreover, P d X 1≤π(k)≤q vk pk = −q =q+β−q =β, η k=1 yielding (c). [sent-301, score-0.07]
95 Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. [sent-315, score-0.19]
96 A generalized proximal point algorithm for certain non-convex minimization problems. [sent-345, score-0.148]
97 A singular value thresholding algorithm for e matrix completion. [sent-376, score-0.09]
98 Fixed point and Bregman iterative methods for matrix rank minimization. [sent-383, score-0.099]
99 Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. [sent-400, score-0.273]
100 A total variation-based graph clustering algorithm for cheeger ratio cuts. [sent-414, score-0.203]
wordName wordTfidf (topN-words)
[('squash', 0.392), ('ak', 0.315), ('cut', 0.24), ('projected', 0.184), ('sdplr', 0.181), ('lr', 0.169), ('burer', 0.159), ('clustering', 0.155), ('proximal', 0.148), ('ix', 0.143), ('collaborative', 0.14), ('fukushima', 0.121), ('gset', 0.121), ('gradient', 0.114), ('preprint', 0.114), ('ltering', 0.107), ('net', 0.106), ('moons', 0.106), ('quali', 0.097), ('monteiro', 0.097), ('spectral', 0.092), ('norms', 0.091), ('email', 0.089), ('rj', 0.087), ('ratings', 0.082), ('max', 0.082), ('pb', 0.076), ('yij', 0.074), ('uj', 0.072), ('vk', 0.07), ('rmse', 0.07), ('wij', 0.068), ('nathan', 0.068), ('semide', 0.067), ('samuel', 0.062), ('mine', 0.062), ('grothendieck', 0.06), ('maxnorm', 0.06), ('minimizew', 0.06), ('nonconvex', 0.057), ('regularizer', 0.057), ('primal', 0.057), ('wi', 0.054), ('li', 0.053), ('rank', 0.053), ('tol', 0.053), ('pi', 0.053), ('nuclear', 0.051), ('srebro', 0.051), ('vj', 0.05), ('nonuniform', 0.049), ('graph', 0.048), ('matrix', 0.046), ('factorization', 0.046), ('sort', 0.046), ('regularization', 0.046), ('cost', 0.046), ('epochs', 0.046), ('kkt', 0.046), ('factored', 0.045), ('singular', 0.044), ('mnist', 0.044), ('award', 0.043), ('promote', 0.043), ('armijo', 0.043), ('stochastic', 0.043), ('huge', 0.043), ('http', 0.042), ('formulation', 0.042), ('rows', 0.042), ('misclassi', 0.042), ('dd', 0.041), ('decompositions', 0.041), ('ruslan', 0.041), ('subject', 0.041), ('norm', 0.04), ('users', 0.04), ('maxj', 0.04), ('recht', 0.04), ('minimize', 0.039), ('superior', 0.039), ('programs', 0.038), ('promotes', 0.038), ('involving', 0.038), ('evidence', 0.038), ('counter', 0.037), ('pd', 0.037), ('available', 0.037), ('running', 0.035), ('rounding', 0.035), ('jth', 0.034), ('balancing', 0.034), ('sdp', 0.034), ('closed', 0.034), ('inf', 0.034), ('critical', 0.034), ('solve', 0.033), ('nesterov', 0.033), ('jason', 0.033), ('problems', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999881 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
Author: Jason Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan Salakhutdinov
Abstract: The max-norm was proposed as a convex matrix regularizer in [1] and was shown to be empirically superior to the trace-norm for collaborative filtering problems. Although the max-norm can be computed in polynomial time, there are currently no practical algorithms for solving large-scale optimization problems that incorporate the max-norm. The present work uses a factorization technique of Burer and Monteiro [2] to devise scalable first-order algorithms for convex programs involving the max-norm. These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. Empirically, the new methods outperform mature techniques from all three areas. 1
2 0.17361596 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
Author: Nathan Srebro, Ruslan Salakhutdinov
Abstract: We show that matrix completion with trace-norm regularization can be significantly hurt when entries of the matrix are sampled non-uniformly, but that a properly weighted version of the trace-norm regularizer works well with non-uniform sampling. We show that the weighted trace-norm regularization indeed yields significant gains on the highly non-uniformly sampled Netflix dataset.
3 0.15645291 221 nips-2010-Random Projections for $k$-means Clustering
Author: Christos Boutsidis, Anastasios Zouzias, Petros Drineas
Abstract: This paper discusses the topic of dimensionality reduction for k-means clustering. We prove that any set of n points in d dimensions (rows in a matrix A ∈ Rn×d ) can be projected into t = Ω(k/ε2 ) dimensions, for any ε ∈ (0, 1/3), in O(nd⌈ε−2 k/ log(d)⌉) time, such that with constant probability the optimal k-partition of the point set is preserved within a factor of 2 + √ The projection is done ε. √ by post-multiplying A with a d × t random matrix R having entries +1/ t or −1/ t with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.
Author: Matthias Hein, Thomas Bühler
Abstract: Many problems in machine learning and statistics can be formulated as (generalized) eigenproblems. In terms of the associated optimization problem, computing linear eigenvectors amounts to finding critical points of a quadratic function subject to quadratic constraints. In this paper we show that a certain class of constrained optimization problems with nonquadratic objective and constraints can be understood as nonlinear eigenproblems. We derive a generalization of the inverse power method which is guaranteed to converge to a nonlinear eigenvector. We apply the inverse power method to 1-spectral clustering and sparse PCA which can naturally be formulated as nonlinear eigenproblems. In both applications we achieve state-of-the-art results in terms of solution quality and runtime. Moving beyond the standard eigenproblem should be useful also in many other applications and our inverse power method can be easily adapted to new problems. 1
5 0.12256341 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
Author: Hairong Liu, Longin J. Latecki, Shuicheng Yan
Abstract: In this paper, we regard clustering as ensembles of k-ary affinity relations and clusters correspond to subsets of objects with maximal average affinity relations. The average affinity relation of a cluster is relaxed and well approximated by a constrained homogenous function. We present an efficient procedure to solve this optimization problem, and show that the underlying clusters can be robustly revealed by using priors systematically constructed from the data. Our method can automatically select some points to form clusters, leaving other points un-grouped; thus it is inherently robust to large numbers of outliers, which has seriously limited the applicability of classical methods. Our method also provides a unified solution to clustering from k-ary affinity relations with k ≥ 2, that is, it applies to both graph-based and hypergraph-based clustering problems. Both theoretical analysis and experimental results show the superiority of our method over classical solutions to the clustering problem, especially when there exists a large number of outliers.
6 0.11872406 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
7 0.11602692 162 nips-2010-Link Discovery using Graph Feature Tracking
8 0.10831907 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
9 0.10569905 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
10 0.10446548 181 nips-2010-Network Flow Algorithms for Structured Sparsity
11 0.10394668 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
12 0.09308406 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
13 0.085178405 63 nips-2010-Distributed Dual Averaging In Networks
14 0.08215826 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
15 0.07989309 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
16 0.07604599 261 nips-2010-Supervised Clustering
17 0.074767575 258 nips-2010-Structured sparsity-inducing norms through submodular functions
18 0.072378695 166 nips-2010-Minimum Average Cost Clustering
19 0.072185963 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
20 0.070150331 124 nips-2010-Inductive Regularized Learning of Kernel Functions
topicId topicWeight
[(0, 0.221), (1, 0.042), (2, 0.112), (3, 0.045), (4, 0.051), (5, -0.12), (6, 0.051), (7, 0.025), (8, 0.061), (9, -0.018), (10, 0.105), (11, -0.04), (12, 0.166), (13, 0.015), (14, 0.172), (15, -0.041), (16, 0.03), (17, 0.002), (18, 0.001), (19, 0.071), (20, 0.117), (21, 0.06), (22, -0.052), (23, -0.098), (24, 0.018), (25, -0.077), (26, 0.001), (27, -0.038), (28, 0.064), (29, -0.029), (30, -0.066), (31, -0.078), (32, 0.091), (33, -0.055), (34, -0.096), (35, 0.069), (36, 0.035), (37, -0.106), (38, 0.037), (39, -0.032), (40, 0.125), (41, -0.08), (42, 0.048), (43, -0.006), (44, 0.055), (45, 0.076), (46, -0.082), (47, -0.06), (48, -0.065), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.95432955 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
Author: Jason Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan Salakhutdinov
Abstract: The max-norm was proposed as a convex matrix regularizer in [1] and was shown to be empirically superior to the trace-norm for collaborative filtering problems. Although the max-norm can be computed in polynomial time, there are currently no practical algorithms for solving large-scale optimization problems that incorporate the max-norm. The present work uses a factorization technique of Burer and Monteiro [2] to devise scalable first-order algorithms for convex programs involving the max-norm. These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. Empirically, the new methods outperform mature techniques from all three areas. 1
2 0.70232296 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection
Author: Prateek Jain, Raghu Meka, Inderjit S. Dhillon
Abstract: Minimizing the rank of a matrix subject to affine constraints is a fundamental problem with many important applications in machine learning and statistics. In this paper we propose a simple and fast algorithm SVP (Singular Value Projection) for rank minimization under affine constraints (ARMP) and show that SVP recovers the minimum rank solution for affine constraints that satisfy a restricted isometry property (RIP). Our method guarantees geometric convergence rate even in the presence of noise and requires strictly weaker assumptions on the RIP constants than the existing methods. We also introduce a Newton-step for our SVP framework to speed-up the convergence with substantial empirical gains. Next, we address a practically important application of ARMP - the problem of lowrank matrix completion, for which the defining affine constraints do not directly obey RIP, hence the guarantees of SVP do not hold. However, we provide partial progress towards a proof of exact recovery for our algorithm by showing a more restricted isometry property and observe empirically that our algorithm recovers low-rank incoherent matrices from an almost optimal number of uniformly sampled entries. We also demonstrate empirically that our algorithms outperform existing methods, such as those of [5, 18, 14], for ARMP and the matrix completion problem by an order of magnitude and are also more robust to noise and sampling schemes. In particular, results show that our SVP-Newton method is significantly robust to noise and performs impressively on a more realistic power-law sampling scheme for the matrix completion problem. 1
3 0.70201856 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
Author: Nathan Srebro, Ruslan Salakhutdinov
Abstract: We show that matrix completion with trace-norm regularization can be significantly hurt when entries of the matrix are sampled non-uniformly, but that a properly weighted version of the trace-norm regularizer works well with non-uniform sampling. We show that the weighted trace-norm regularization indeed yields significant gains on the highly non-uniformly sampled Netflix dataset.
4 0.69506234 221 nips-2010-Random Projections for $k$-means Clustering
Author: Christos Boutsidis, Anastasios Zouzias, Petros Drineas
Abstract: This paper discusses the topic of dimensionality reduction for k-means clustering. We prove that any set of n points in d dimensions (rows in a matrix A ∈ Rn×d ) can be projected into t = Ω(k/ε2 ) dimensions, for any ε ∈ (0, 1/3), in O(nd⌈ε−2 k/ log(d)⌉) time, such that with constant probability the optimal k-partition of the point set is preserved within a factor of 2 + √ The projection is done ε. √ by post-multiplying A with a d × t random matrix R having entries +1/ t or −1/ t with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.
Author: Matthias Hein, Thomas Bühler
Abstract: Many problems in machine learning and statistics can be formulated as (generalized) eigenproblems. In terms of the associated optimization problem, computing linear eigenvectors amounts to finding critical points of a quadratic function subject to quadratic constraints. In this paper we show that a certain class of constrained optimization problems with nonquadratic objective and constraints can be understood as nonlinear eigenproblems. We derive a generalization of the inverse power method which is guaranteed to converge to a nonlinear eigenvector. We apply the inverse power method to 1-spectral clustering and sparse PCA which can naturally be formulated as nonlinear eigenproblems. In both applications we achieve state-of-the-art results in terms of solution quality and runtime. Moving beyond the standard eigenproblem should be useful also in many other applications and our inverse power method can be easily adapted to new problems. 1
6 0.63920391 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
7 0.61666572 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
8 0.59310335 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
9 0.55604446 162 nips-2010-Link Discovery using Graph Feature Tracking
10 0.55457991 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
11 0.54238141 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
12 0.54213339 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
13 0.52634305 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
14 0.52286446 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
15 0.50546199 45 nips-2010-CUR from a Sparse Optimization Viewpoint
16 0.49826318 202 nips-2010-Parallelized Stochastic Gradient Descent
17 0.48782787 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
18 0.48165759 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
19 0.47986704 287 nips-2010-Worst-Case Linear Discriminant Analysis
20 0.46439809 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
topicId topicWeight
[(13, 0.114), (17, 0.014), (27, 0.046), (30, 0.068), (35, 0.024), (45, 0.186), (50, 0.059), (51, 0.239), (52, 0.044), (60, 0.048), (77, 0.048), (78, 0.011), (90, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.81319255 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
Author: Jason Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan Salakhutdinov
Abstract: The max-norm was proposed as a convex matrix regularizer in [1] and was shown to be empirically superior to the trace-norm for collaborative filtering problems. Although the max-norm can be computed in polynomial time, there are currently no practical algorithms for solving large-scale optimization problems that incorporate the max-norm. The present work uses a factorization technique of Burer and Monteiro [2] to devise scalable first-order algorithms for convex programs involving the max-norm. These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. Empirically, the new methods outperform mature techniques from all three areas. 1
2 0.76589394 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
3 0.72338408 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
Author: Gael Varoquaux, Alexandre Gramfort, Jean-baptiste Poline, Bertrand Thirion
Abstract: Spontaneous brain activity, as observed in functional neuroimaging, has been shown to display reproducible structure that expresses brain architecture and carries markers of brain pathologies. An important view of modern neuroscience is that such large-scale structure of coherent activity reflects modularity properties of brain connectivity graphs. However, to date, there has been no demonstration that the limited and noisy data available in spontaneous activity observations could be used to learn full-brain probabilistic models that generalize to new data. Learning such models entails two main challenges: i) modeling full brain connectivity is a difficult estimation problem that faces the curse of dimensionality and ii) variability between subjects, coupled with the variability of functional signals between experimental runs, makes the use of multiple datasets challenging. We describe subject-level brain functional connectivity structure as a multivariate Gaussian process and introduce a new strategy to estimate it from group data, by imposing a common structure on the graphical model in the population. We show that individual models learned from functional Magnetic Resonance Imaging (fMRI) data using this population prior generalize better to unseen data than models based on alternative regularization schemes. To our knowledge, this is the first report of a cross-validated model of spontaneous brain activity. Finally, we use the estimated graphical model to explore the large-scale characteristics of functional architecture and show for the first time that known cognitive networks appear as the integrated communities of functional connectivity graph. 1
4 0.70703983 261 nips-2010-Supervised Clustering
Author: Pranjal Awasthi, Reza B. Zadeh
Abstract: Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class in that model. Our algorithm is query-efficient in the sense that it involves only a small amount of interaction with the teacher. We also present and study two natural generalizations of the model. The model assumes that the teacher response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model and give an algorithm for clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees a random subset of the points. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing Single-Linkage will find the target clustering under the strongest property.
5 0.7065931 221 nips-2010-Random Projections for $k$-means Clustering
Author: Christos Boutsidis, Anastasios Zouzias, Petros Drineas
Abstract: This paper discusses the topic of dimensionality reduction for k-means clustering. We prove that any set of n points in d dimensions (rows in a matrix A ∈ Rn×d ) can be projected into t = Ω(k/ε2 ) dimensions, for any ε ∈ (0, 1/3), in O(nd⌈ε−2 k/ log(d)⌉) time, such that with constant probability the optimal k-partition of the point set is preserved within a factor of 2 + √ The projection is done ε. √ by post-multiplying A with a d × t random matrix R having entries +1/ t or −1/ t with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.
6 0.70309913 117 nips-2010-Identifying graph-structured activation patterns in networks
7 0.69999456 265 nips-2010-The LASSO risk: asymptotic results and real world examples
8 0.69757193 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
9 0.69360834 45 nips-2010-CUR from a Sparse Optimization Viewpoint
10 0.69303054 262 nips-2010-Switched Latent Force Models for Movement Segmentation
11 0.69283521 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
12 0.69256216 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
13 0.69200277 284 nips-2010-Variational bounds for mixed-data factor analysis
14 0.69189346 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
15 0.69003999 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
16 0.68978149 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
17 0.68950629 222 nips-2010-Random Walk Approach to Regret Minimization
18 0.68923706 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
19 0.68749744 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
20 0.68695158 146 nips-2010-Learning Multiple Tasks using Manifold Regularization