nips nips2012 nips2012-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. [sent-3, score-0.238]
2 Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. [sent-4, score-0.121]
3 In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. [sent-5, score-0.215]
4 Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. [sent-6, score-0.418]
5 We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. [sent-8, score-0.169]
6 1 Introduction Our focus in this paper is on unsupervised learning problems such as matrix factorization or latent subspace identification. [sent-9, score-0.111]
7 We focus in particular on coding or dictionary learning problems, where one seeks to decompose a data matrix ˆ X into an approximate factorization X = U V that minimizes reconstruction error while satisfying other properties like low rank or sparsity in the factors. [sent-12, score-0.27]
8 Since imposing a bound on the rank or number of non-zero elements generally makes the problem intractable, such constraints are usually replaced by carefully designed regularizers that promote low rank or sparse solutions [1–3]. [sent-13, score-0.257]
9 Interestingly, for a variety of dictionary constraints and regularizers, the problem is equivalent to ˆ a matrix-norm regularized problem on the reconstruction matrix X [1, 4]. [sent-14, score-0.197]
10 One intensively studied example is the trace norm, which corresponds to bounding the Euclidean norm of the code vectors in U while penalizing V via its 21 norm. [sent-15, score-0.247]
11 [8] converts the trace norm problem into an optimization over positive semidefinite (PSD) matrices, then solves the problem via greedy sparse approximation [9, 10]. [sent-18, score-0.39]
12 [11] further generalizes the algorithm from trace norm to gauge functions [12], dispensing with the PSD conversion. [sent-19, score-0.299]
13 Despite their theoretical equivalence, many practical applications require the solution to the regularized problem, e. [sent-21, score-0.092]
14 In this paper, we optimize the regularized objective directly by reformulating the problem in the framework of 1 penalized boosting [13, 14], allowing it to be solved with a general procedure developed in Section 2. [sent-24, score-0.35]
15 Each iteration of this procedure calls an oracle to find a weak hypothesis ∗ Xinhua Zhang is now at the National ICT Australia (NICTA), Machine Learning Group. [sent-25, score-0.155]
16 Our first key contribution is to establish that, when the loss is convex and smooth, the procedure finds an accurate solution within O(1/ ) iterations. [sent-28, score-0.222]
17 To the best of our knowledge, this is the first O(1/ ) objective value rate that has been rigorously established for 1 regularized boosting. [sent-29, score-0.167]
18 [15] considered a similar boosting approach, but required totally corrective updates. [sent-30, score-0.302]
19 We also show in Section 3 how the empirical performance of 1 penalized boosting can be greatly improved by introducing an auxiliary rank-constrained local-optimization within each iteration. [sent-33, score-0.232]
20 Interlacing rank constrained optimization with sparse updates has been shown effective in semi-definite programming [19–21]. [sent-34, score-0.225]
21 [22] applied the idea to trace norm optimization by factoring the reconstruction matrix into two orthonormal matrices and a positive semi-definite matrix. [sent-35, score-0.401]
22 Unfortunately, this strategy creates a very difficult constrained optimization problem, compelling [22] to resort to manifold techniques. [sent-36, score-0.112]
23 Instead, we use a simpler variational representation of matrix norms that leads to a new local objective that is both unconstrained and smooth. [sent-37, score-0.252]
24 This allows the application of much simpler and much more efficient solvers to greatly accelerate the overall optimization. [sent-38, score-0.119]
25 Underlying standard sparse approximation methods is an oracle that efficiently selects a weak hypothesis (using boosting terminology). [sent-39, score-0.349]
26 Our next major contribution, in Section 4, is to formulate an efficient oracle for latent multiview factorization models [2, 4], based on a positive semi-definite relaxation that we prove incurs no gap. [sent-41, score-0.278]
27 Finally, we point out that our focus in this paper is on the optimization of convex problems that relax the “hard” rank constraint. [sent-42, score-0.194]
28 ∗ Notation We use γK to denote the gauge induced by set K; · to denote the dual norm of · ; and · F , · tr and · sp to denote the Frobenius norm, trace norm and spectral norm respectively. [sent-44, score-0.695]
29 X R,1 denotes the row-wise norm i Xi: R , while X, Y := tr(X Y ) denotes the inner product. [sent-45, score-0.141]
30 2 The Boosting Framework with 1 Regularization Consider a coding problem where one is presented an n×m matrix Z, whose columns correspond to m training examples. [sent-47, score-0.116]
31 Our goal is to learn an n×k dictionary matrix U , consisting of k basis vectors, and a k × m coefficient matrix V , such that U V approximates Z under some loss L(U V ). [sent-48, score-0.262]
32 columns of U , to the unit ball of some norm · C . [sent-52, score-0.141]
33 k, and instead impose regularizers on the matrix V that encourage a low rank or sparse solution. [sent-56, score-0.198]
34 To be more specific, the following optimization problem lies at the heart of many sparse learning models [e. [sent-57, score-0.106]
35 1, 3, 4, 24, 25]: ˜ ˜ min min L(U V ) + λ V R,1 , (1) C ≤1 U : U:i ˜ V where λ ≥ 0 specifies the tradeoff between loss and regularization. [sent-59, score-0.196]
36 The · R norm in the block R-1 norm provides the flexibility of promoting useful structures in the solution, e. [sent-60, score-0.282]
37 1 norm for sparse solutions, 2 norm for low rank solutions, and block structured norms for group sparsity. [sent-62, score-0.44]
38 Now (1) ˜ can be reformulated by introducing the reconstruction matrix X := U V : (1) = min L(X) + λ X ˜ U,V : U:i min ˜ =X ˜ V R,1 C ≤1,U V = min L(X) + λ X min σ,U,V :σ≥0,U ΣV =X σi , (2) i where Σ = diag{σi }, and U and V in the last minimization also carry norm constraints. [sent-64, score-0.361]
39 First it reveals that the regularizer essentially seeks a rank-one decomposition of the reconstruction matrix X, and penalizes the 1 norm of the combination coefficients as a proxy of the “rank”. [sent-66, score-0.26]
40 Second, the regularizer in (2) is now expressed precisely in the form of the 2 Algorithm 1: The vanilla boosting algorithm. [sent-67, score-0.177]
41 An even more aggressive scheme is (k) the totally corrective update [15], which in Step 4 finds the weights for all Ai ’s selected so far: k σi ≥0 k (k) min L σ i Ai +λ i=1 σi . [sent-88, score-0.184]
42 For boosting without regularization, the 1/ rate of convergence is known to be optimal [27]. [sent-90, score-0.178]
43 by local greedy search), then the overall optimization can benefit from it. [sent-98, score-0.157]
44 Perform local optimization and recover {βi , Bi }. [sent-109, score-0.12]
45 Proposition 1 For the gauge γK induced by K, the convex hull of A in (3), we have γK (X) = min U,V :U V =X 1 2 U:i 2 C + Vi: 2 R . [sent-118, score-0.147]
46 4 2 If · R = · C = · 2 , then γK becomes the trace norm (as we saw before), and i ( U:i C + 2 2 2 Vi: R ) is simply U F + V F . [sent-121, score-0.247]
47 Then Proposition 1 is a well-known variational form of the trace norm [28]. [sent-122, score-0.247]
48 (9) 2 i Proposition 2 For any U ∈ Rm×k and V ∈ Rk×n, there exist σi ≥ 0, ui ∈ Rm , and vi ∈ Rn such that k UV = k k σi ui vi , ui C ≤ 1, vi R 1 σi = 2 i=1 ≤ 1, i=1 U:i 2 C + Vi: 2 R . [sent-124, score-0.531]
49 (10) i=1 Now we can specify concrete details for local optimization in the context of matrix norms: 1. [sent-125, score-0.165]
50 Initialize: given {σi ≥ 0, ui vi ∈ A}k , set (Uinit , Vinit ) to satisfy g(Uinit , Vinit ) = f ({σi , ui vi }): i=1 √ √ √ √ and Vinit = ( σ1 v1 , . [sent-126, score-0.354]
51 Recovery: use Proposition 2 to (conceptually) recover {βi , ui , vi } from (U ∗ , V ∗ ). [sent-136, score-0.177]
52 The key advantage of this procedure is that Proposition 2 allows Xk and sk to be computed directly from (U ∗ , V ∗ ), keeping the recovery completely implicit: k k ˆ ˆ βi ui vi = U ∗ V ∗ , Xk = and sk = i=1 σi = i=1 1 2 k ∗ U:i 2 C ∗ + Vi: 2 R . [sent-137, score-0.293]
53 (12) i=1 In addition, Proposition 2 ensures that locally improving the solution does not incur an increment in the number of weak hypotheses. [sent-138, score-0.109]
54 Different from the local optimization for trace norm in [21] which naturally works on the original objective, our scheme requires a nontrivial (variational) reformulation of the objective based on Propositions 1 and 2. [sent-140, score-0.441]
55 Compared to the local optimization in [22], which is hampered by orthogonal and PSD constraints, our (local) objective in (9) is unconstrained and smooth for many instances of · C and · R . [sent-142, score-0.224]
56 This is plausible because no other constraints (besides the norm constraint), such as orthogonality, are imposed on U and V in Proposition 2. [sent-143, score-0.141]
57 Thus the local optimization we face, albeit non-convex in general, is more amenable to efficient solvers such as L-BFGS. [sent-144, score-0.211]
58 Remark Consider if one performs totally corrective update as in (7). [sent-145, score-0.152]
59 For example, in the case of trace norm, this leads to a full SVD on U ∗ V ∗ . [sent-147, score-0.106]
60 4 Latent Generative Model with Multiple Views Underlying most boosting algorithms is an oracle that identifies the steepest descent weak hypothesis (Step 3 of Algorithm 1). [sent-149, score-0.332]
61 When · R and · C are both Euclidean norms, this oracle can be efficiently computed via the leading left and right singular vector pair. [sent-151, score-0.103]
62 However, for most other interesting cases like low rank tensors, such an oracle is intractable [29]. [sent-152, score-0.146]
63 In this section we discover that for an important problem of multiview learning, the oracle can be surprisingly solved in polynomial time, yielding an efficient computational strategy. [sent-153, score-0.212]
64 In this case, beyond a single dictionary U and coefficient matrix V that model a single view Z (1) , multiple dictionaries U (k) are needed to reconstruct multiple views Z (k) , while keeping the latent representation V shared across all views. [sent-155, score-0.148]
65 Formally the problem in multiview factorization is to optimize [2, 4]: k min (1) U (1) : U:i . [sent-156, score-0.225]
66 (13) We can easily re-express the problem as an equivalent “single” view formulation (1) by stacking all (t) {U (t) } into the rows of a big matrix U , with a new column norm U:i C := max U:i C . [sent-160, score-0.186]
67 The only challenge left is the oracle problem in (4), which takes the following form when all norms are Euclidean: 2 max u C ≤1, v ≤1 u Λv = max u C ≤1 Λu 2 = Λt ut max u:∀t, ut ≤1 . [sent-172, score-0.2]
68 M1 , zz ≤ 0 M2 , zz ≤ 0 I00 , zz = 1, z (16) r 0 c Λ1 c Λ2 −1 −1 u1 , M0 = − Λ1 c Λ1 Λ1 Λ1 Λ2 , M1 = I 0 , M2 = , u2 Λ2 c Λ2 Λ1 Λ2 Λ2 0 I and I00 is a zero matrix with only the (1, 1)-th entry being 1. [sent-186, score-0.285]
69 Let X = zz , a semi-definite programming relaxation for (QP ) can be obtained by dropping the rank-one constraint: (SP ) min M0 , X , s. [sent-187, score-0.112]
70 5 Experimental Results We compared our Algorithm 2 with three state-of-the-art solvers for trace norm regularized objectives: MMBS3 [22], DHM [15], and JS [8]. [sent-205, score-0.403]
71 X tr ≤ ζ, which makes it hard to compare with solvers for the penalized problem: minX L(X) + λ X tr . [sent-208, score-0.198]
72 Traditional solvers such as proximal methods [6] were not included because they are much slower. [sent-214, score-0.124]
73 1 0 10 Ours MMBS DHM JS 2 4 10 10 Running time (seconds) 6 10 (b) Test NMAE vs time (semilogx) (b) Test NMAE vs time (semilogx) (b) Test NMAE vs time (semilogx) Figure 1: MovieLens100k. [sent-245, score-0.273]
74 Comparison 1: Matrix completion We first compared all methods on a matrix completion problem, using the standard datasets MovieLens100k, MovieLens1M, and MovieLens10M [6, 8, 21], which are sized 943 × 1682, 6040 × 3706, and 69878 × 10677 respectively (#user × #movie). [sent-248, score-0.111]
75 In Figure 1 to 3, we show how fast various algorithms drive down the training objective, training loss L (squared Euclidean distance), and the normalized mean absolute error (NMAE) on the test data [see, e. [sent-251, score-0.237]
76 From Figure 1(a), 2(a), 3(a), it is clear that it takes much less amount of CPU time for our method to reduce the objective value (solid line) and the loss L (dashed line). [sent-255, score-0.206]
77 This implies that local search and partially corrective updates in our method are very effective. [sent-256, score-0.155]
78 However it is still slower because their local search is conducted on a constrained manifold. [sent-258, score-0.108]
79 In contrast, our local search objective is entirely unconstrained and smooth, which we manage to solve efficiently by L-BFGS. [sent-259, score-0.162]
80 We observed that DHM kept running coordinate descent with a constant step size, while the totally corrective update was rarely taken. [sent-261, score-0.206]
81 We tried accelerating it by using a smaller value of the estimate of the Lipschitz constant of the gradient of L, but it leads to divergence after a rapid decrease of the objective for the first few iterations. [sent-262, score-0.104]
82 For this we compared the matrix reconstruction after each iteration against the ground truth. [sent-265, score-0.092]
83 Comparison 2: multitask and multiclass learning Secondly, we tested on a multiclass classification problem with synthetic dataset. [sent-267, score-0.174]
84 We used the logistic loss for a model matrix W ∈ RD×C . [sent-274, score-0.177]
85 01 4 10 Ours−obj Ours−loss DHM−obj DHM−loss JS−loss 1 10 0 10 −1 10 0 Objective and loss (training) Li (W ) = − log p(yi |xi ; W ), 2 Objective and loss (training) training example xi with label yi ∈ {1, . [sent-281, score-0.298]
86 , C}, we defined an individual loss Li (W ) as 2 10 10 Running time (seconds) School multitask, λ = 0. [sent-283, score-0.132]
87 1 10 Ours−obj Ours−loss DHM−obj DHM−loss JS−loss 3 10 2 10 1 10 0 2 10 10 Running time (seconds) (a) Objective & loss vs time (loglog) (a) Objective & loss vs time (loglog) Objective and loss (training) Test error Test regression error Then L(W ) is defined as the Multiclass, λ = 0. [sent-284, score-0.578]
88 001 2 We also applied the solvers to a multitask learning problem with 10 Ours−obj the school dataset [25]. [sent-300, score-0.201]
89 Again, as shown in Figure 5 our approach is much faster than DHM and 0 10 2 3 JS in finding the optimal solution for training objective and test 10 10 Running time (seconds) error. [sent-304, score-0.172]
90 As the problem requires a large λ, the trace norm penalty Figure 6: Multiview training. [sent-305, score-0.247]
91 is small, making the loss close to the objective. [sent-306, score-0.132]
92 Comparison 3: Multiview learning Finally we perform an initial test on our global optimization technique for learning latent models with multiple views. [sent-307, score-0.133]
93 We used squared loss for the first view, and logistic loss for the other views. [sent-312, score-0.264]
94 We compared our method with a local optimization approach to solving (13). [sent-313, score-0.12]
95 The local method first fixes all U (t) and minimizes V , which is a convex problem that can be solved by FISTA [32]. [sent-314, score-0.121]
96 Our method also reduces both the objective and the loss slightly faster than Alt. [sent-318, score-0.206]
97 6 Conclusion and Outlook We have proposed a new boosting algorithm for a wide range of matrix norm regularized problems. [sent-319, score-0.401]
98 We also applied the method to a novel problem, latent multiview learning, for which we designed a new efficient oracle. [sent-322, score-0.169]
99 We plan to study randomized boosting with 1 regularization [34] , and to extend the framework to more general nonlinear regularization [3]. [sent-323, score-0.216]
100 An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. [sent-363, score-0.299]
wordName wordTfidf (topN-words)
[('dhm', 0.526), ('obj', 0.316), ('mmbs', 0.263), ('js', 0.256), ('boosting', 0.15), ('norm', 0.141), ('multiview', 0.135), ('loss', 0.132), ('loglog', 0.119), ('semilogx', 0.119), ('uinit', 0.119), ('vinit', 0.119), ('nmae', 0.117), ('vi', 0.115), ('qp', 0.111), ('trace', 0.106), ('corrective', 0.097), ('solvers', 0.091), ('vs', 0.091), ('movielens', 0.087), ('seconds', 0.087), ('ai', 0.082), ('zz', 0.08), ('sp', 0.078), ('oracle', 0.077), ('xk', 0.075), ('objective', 0.074), ('multitask', 0.072), ('rank', 0.069), ('regularized', 0.065), ('convex', 0.063), ('ui', 0.062), ('alt', 0.062), ('optimization', 0.062), ('local', 0.058), ('semide', 0.056), ('totally', 0.055), ('bi', 0.055), ('running', 0.054), ('weak', 0.052), ('gauge', 0.052), ('multiclass', 0.051), ('constrained', 0.05), ('uv', 0.048), ('interlacing', 0.048), ('auxiliary', 0.047), ('reconstruction', 0.047), ('recovery', 0.046), ('bases', 0.046), ('matrix', 0.045), ('norms', 0.045), ('proposition', 0.044), ('siam', 0.044), ('sparse', 0.044), ('psd', 0.044), ('xinhua', 0.042), ('dictionary', 0.04), ('regularizers', 0.04), ('ut', 0.039), ('school', 0.038), ('coding', 0.037), ('greedy', 0.037), ('flickr', 0.037), ('test', 0.037), ('tr', 0.036), ('penalized', 0.035), ('sk', 0.035), ('solutions', 0.035), ('training', 0.034), ('latent', 0.034), ('sd', 0.034), ('fortunately', 0.034), ('xes', 0.033), ('proximal', 0.033), ('completion', 0.033), ('regularization', 0.033), ('min', 0.032), ('factorization', 0.032), ('duality', 0.031), ('locally', 0.03), ('accelerated', 0.03), ('unconstrained', 0.03), ('gradient', 0.03), ('views', 0.029), ('ensemble', 0.028), ('propositions', 0.028), ('sdp', 0.028), ('minx', 0.028), ('rate', 0.028), ('accelerate', 0.028), ('regularizer', 0.027), ('steepest', 0.027), ('solution', 0.027), ('bach', 0.027), ('euclidean', 0.027), ('optimize', 0.026), ('srebro', 0.026), ('zhang', 0.026), ('singular', 0.026), ('hypothesis', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
2 0.20003195 208 nips-2012-Matrix reconstruction with the local max norm
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
3 0.12465386 86 nips-2012-Convex Multi-view Subspace Learning
Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efficient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1
4 0.12020247 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
Author: Minjie Xu, Jun Zhu, Bo Zhang
Abstract: We present a probabilistic formulation of max-margin matrix factorization and build accordingly a nonparametric Bayesian model which automatically resolves the unknown number of latent factors. Our work demonstrates a successful example that integrates Bayesian nonparametrics and max-margin learning, which are conventionally two separate paradigms and enjoy complementary advantages. We develop an efficient variational algorithm for posterior inference, and our extensive empirical studies on large-scale MovieLens and EachMovie data sets appear to justify the aforementioned dual advantages. 1
5 0.11938702 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
6 0.11081449 282 nips-2012-Proximal Newton-type methods for convex optimization
7 0.10709526 247 nips-2012-Nonparametric Reduced Rank Regression
8 0.10541451 27 nips-2012-A quasi-Newton proximal splitting method
9 0.10410269 300 nips-2012-Scalable nonconvex inexact proximal splitting
10 0.1018545 16 nips-2012-A Polynomial-time Form of Robust Regression
11 0.09965767 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
12 0.098646052 227 nips-2012-Multiclass Learning with Simplex Coding
13 0.095374756 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
14 0.095100082 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
15 0.088411786 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
16 0.088211298 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
17 0.088022076 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
18 0.084593393 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization
19 0.08203274 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
20 0.080427118 200 nips-2012-Local Supervised Learning through Space Partitioning
topicId topicWeight
[(0, 0.224), (1, 0.044), (2, 0.115), (3, -0.119), (4, 0.12), (5, 0.075), (6, -0.014), (7, 0.008), (8, 0.087), (9, 0.004), (10, -0.015), (11, 0.083), (12, -0.046), (13, 0.062), (14, 0.063), (15, 0.071), (16, 0.008), (17, 0.019), (18, 0.078), (19, -0.039), (20, -0.023), (21, 0.035), (22, -0.04), (23, 0.001), (24, -0.127), (25, -0.003), (26, -0.032), (27, -0.012), (28, -0.013), (29, 0.046), (30, -0.046), (31, 0.032), (32, 0.011), (33, 0.082), (34, -0.021), (35, 0.117), (36, 0.029), (37, -0.026), (38, -0.018), (39, -0.038), (40, 0.035), (41, -0.014), (42, 0.085), (43, -0.017), (44, 0.102), (45, -0.065), (46, -0.047), (47, 0.114), (48, 0.006), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.94553423 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
2 0.80937749 208 nips-2012-Matrix reconstruction with the local max norm
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
3 0.79848444 86 nips-2012-Convex Multi-view Subspace Learning
Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efficient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1
4 0.77811092 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
5 0.67783469 125 nips-2012-Factoring nonnegative matrices with linear programs
Author: Ben Recht, Christopher Re, Joel Tropp, Victor Bittorf
Abstract: This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identifies a matrix C that satisfies X ≈ CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to find a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efficient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes. 1
6 0.67338812 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
7 0.62273443 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
8 0.61838448 16 nips-2012-A Polynomial-time Form of Robust Regression
9 0.61449248 247 nips-2012-Nonparametric Reduced Rank Regression
10 0.60713941 27 nips-2012-A quasi-Newton proximal splitting method
11 0.60046685 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
12 0.588871 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
13 0.56199956 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
14 0.56094837 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
15 0.55879635 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
16 0.55641609 221 nips-2012-Multi-Stage Multi-Task Feature Learning
17 0.5542767 225 nips-2012-Multi-task Vector Field Learning
18 0.55059367 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
19 0.5429644 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
20 0.53533268 285 nips-2012-Query Complexity of Derivative-Free Optimization
topicId topicWeight
[(0, 0.037), (9, 0.016), (21, 0.013), (38, 0.563), (42, 0.026), (54, 0.019), (55, 0.014), (74, 0.032), (76, 0.1), (80, 0.076), (92, 0.029)]
simIndex simValue paperId paperTitle
Author: Zhuo Wang, Alan Stocker, Daniel Lee
Abstract: In this work we study how the stimulus distribution influences the optimal coding of an individual neuron. Closed-form solutions to the optimal sigmoidal tuning curve are provided for a neuron obeying Poisson statistics under a given stimulus distribution. We consider a variety of optimality criteria, including maximizing discriminability, maximizing mutual information and minimizing estimation error under a general Lp norm. We generalize the Cramer-Rao lower bound and show how the Lp loss can be written as a functional of the Fisher Information in the asymptotic limit, by proving the moment convergence of certain functions of Poisson random variables. In this manner, we show how the optimal tuning curve depends upon the loss function, and the equivalence of maximizing mutual information with minimizing Lp loss in the limit as p goes to zero. 1
same-paper 2 0.98518169 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
3 0.98228192 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola
Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1
4 0.97606373 165 nips-2012-Iterative ranking from pair-wise comparisons
Author: Sahand Negahban, Sewoong Oh, Devavrat Shah
Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1
5 0.95416415 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams
Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1
6 0.95238256 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA
7 0.95196193 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
8 0.93866181 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
9 0.936988 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
10 0.93175423 208 nips-2012-Matrix reconstruction with the local max norm
11 0.89155883 285 nips-2012-Query Complexity of Derivative-Free Optimization
12 0.88568228 86 nips-2012-Convex Multi-view Subspace Learning
13 0.88470304 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
14 0.87836981 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
15 0.86832213 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
16 0.86665189 324 nips-2012-Stochastic Gradient Descent with Only One Projection
17 0.86496657 30 nips-2012-Accuracy at the Top
18 0.86020046 187 nips-2012-Learning curves for multi-task Gaussian process regression
19 0.85908931 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback