jmlr jmlr2012 jmlr2012-83 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning
Reference: text
sentIndex sentText sentNum sentScore
1 We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. [sent-11, score-0.386]
2 While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. [sent-12, score-0.51]
3 Low-rank matrix models could therefore scale to handle substantially many more features and classes than models with full rank dense matrices. [sent-30, score-0.282]
4 Unfortunately, the rank constraint is non-convex, and in the general case, minimizing a convex function subject to a rank constraint is NP-hard (Natarajan, 1995). [sent-31, score-0.404]
5 Sometimes, a matrix W ∈ Rn×m of rank k is represented as a product of two low dimension matrices W = ABT , A ∈ Rn×k , B ∈ Rm×k and simple gradient descent techniques are applied to each of the product terms separately (Bai et al. [sent-33, score-0.48]
6 Second, projected gradient algorithms can be applied by repeatedly taking a gradient step and projecting back to the manifold of low-rank matrices. [sent-35, score-0.372]
7 Unfortunately, computing the projection to that manifold becomes prohibitively costly for large matrices and cannot be computed after every gradient step. [sent-36, score-0.424]
8 In this paper we propose new algorithms for online learning on the manifold of low-rank matrices. [sent-44, score-0.264]
9 It is based on an operation called retraction, which is an operator that maps from a vector space that is tangent to the manifold, into the manifold (Do Carmo, 1992; Absil et al. [sent-45, score-0.368]
10 L ORETA has a memory and run time complexity of O ((n + m)k) per update when the gradients have rank one. [sent-49, score-0.255]
11 L ORETA performed better than other techniques that operate on a factorized model, and also improves retrieval precision by 33% as compared with training a full rank model over pre-selected most informative features, using comparable memory footprint. [sent-53, score-0.375]
12 L ORETA significantly improved over full rank models, using a fraction of the memory required. [sent-55, score-0.28]
13 An embedded manifold is a smooth subset of an ambient space Rn . [sent-75, score-0.355]
14 For instance, the set {x : ||x||2 = 1, x ∈ Rn }, the unit sphere, is an n−1 dimensional manifold embedded in n-dimensional space Rn . [sent-76, score-0.258]
15 As another example, the orthogonal group On , which comprises of the set of orthogonal n × n matrices, is an n(n−1) dimensional manifold embedded in Rn×n . [sent-77, score-0.338]
16 Here we focus on the 2 manifold of low-rank matrices, namely, the set of n × m matrices of rank k where k < m, n. [sent-78, score-0.502]
17 It is an (n + m)k − k2 dimensional manifold embedded in Rn×m , which we denote Mkn,m , or plainly M . [sent-79, score-0.258]
18 Motivated by online learning, we focus here on developing a stochastic gradient descent procedure to minimize a loss function L over the manifold of low-rank matrices Mkn,m , min L (W ) W s. [sent-82, score-0.515]
19 At every step t of the algorithm, a gradient step update W t − ∇L (W t ) takes the model outside of the manifold M and has to be mapped back onto the manifold. [sent-86, score-0.315]
20 Unfortunately, the projection operation is very expensive to compute for the manifold of low-rank matrices, since it basically involves a singular value decomposition. [sent-88, score-0.238]
21 To explain how retractions are computed, we first describe the notion of a tangent space and the Riemannian gradient of a function on a manifold. [sent-90, score-0.383]
22 1 Riemannian Gradient and the Tangent Space Each point W in an embedded manifold M has a tangent space associated with it, denoted TW M , as shown in Figure 2 (for a formal definition of the tangent space, see Appendix A). [sent-92, score-0.594]
23 The tangent space is a vector space of the same dimension as the manifold that can be identified in a natural way 431 S HALIT, W EINSHALL AND C HECHIK Figure 1: Projection onto the manifold is just a particular case of a retraction. [sent-93, score-0.597]
24 It is usually simple to compute the linear projection PW of any point in the ambient space onto the tangent space TW M . [sent-96, score-0.332]
25 Given a manifold M and a differentiable function L : M → R, the Riemannian gradient ∇L (W ) of L on M at a point W is a vector in the tangent space TW M . [sent-97, score-0.454]
26 A very useful property of embedded manifolds is the following: given a differentiable function f defined on the ambient space (and thus on the manifold), the Riemannian gradient of f at point W is simply the linear projection PW of the Euclidean gradient of f onto the tangent space TW M . [sent-98, score-0.651]
27 The mathematically ideal retraction is called the exponential mapping (Do Carmo, 1992, Chapter 3): it maps the tangent vector ξ ∈ TW M to a point along a geodesic curve which goes through W in the direction of ξ Figure 1. [sent-107, score-0.404]
28 Unfortunately, for many manifolds (including the low-rank manifold considered here) calculating the geodesic curve is computationally expensive (Vandereycken et al. [sent-108, score-0.289]
29 A major insight from the field of Riemannian manifold optimization is that one can use retractions which merely approximate the exponential mapping. [sent-110, score-0.329]
30 Definition 1 Given a point W in an embedded manifold M , a retraction is any function RW : TW M → M which satisfies the following two conditions (Absil et al. [sent-112, score-0.494]
31 It can be shown that any such retraction approximates the exponential mapping to a first order (Absil et al. [sent-117, score-0.236]
32 Second-order retractions, which approximate the exponential mapping to second order around W , have to satisfy in addition the following stricter condition: PW dRW (τξ) |τ=0 dτ2 = 0, for all ξ ∈ TW M , where PW is the linear projection from the ambient space onto the tangent space TW M . [sent-119, score-0.332]
33 When viewed intrinsically, the curve RW (τξ) defined by a second-order retraction has zero acceleration at point W , namely, its second order derivatives are all normal to the manifold. [sent-120, score-0.236]
34 The best known example of a second-order retraction onto embedded manifolds is the projection operation (Absil and Malick, 2010), which maps a point X to the point Y ∈ M which is closest to it in the Frobenius norm. [sent-121, score-0.45]
35 Given the tangent space and a retraction, we now define a Riemannian gradient descent procedure for the loss L at point W t ∈ M . [sent-123, score-0.291]
36 Step 2: Riemannian gradient: Linearly project the ambient gradient onto the tangent space ˜ TW M . [sent-127, score-0.38]
37 With a proper choice of step size, this procedure can be proved to have local convergence for any retraction (Absil et al. [sent-131, score-0.236]
38 Step 2: linearly project ambient gradient onto tangent space TW M in order to get the Riemannian gradient ξt . [sent-136, score-0.466]
39 Online Learning on the Low-rank Manifold Based on the retractions described above, we now present an online algorithm for learning low-rank matrices, by performing stochastic gradient descent on the manifold of low-rank matrices. [sent-139, score-0.544]
40 At every iteration the algorithm suffers some loss, and performs a Riemannian gradient step followed by a retraction to the manifold Mkn,m . [sent-141, score-0.522]
41 2 discusses the very common case where the online updates induce a gradient of rank r = 1. [sent-145, score-0.352]
42 Algorithm 1 : Online algorithm for learning in the manifold of low-rank matrices Input: Initial low-rank model matrix W 0 ∈ Mkn,m . [sent-146, score-0.355]
43 The set of n × k matrices of rank k is denoted R∗ . [sent-159, score-0.302]
44 As a first step we find its linear projection onto the tangent ˆ space Z = PW (Z). [sent-162, score-0.235]
45 We start with a lemma that gives a representation of the tangent space TW M (Figure 2), extending the constructions given by Vandereycken and Vandewalle (2010) to the general manifold of low-rank matrices. [sent-163, score-0.368]
46 We note that the assumption that A and B are both of full column rank is tantamount to assuming that the model W is exactly of rank k, and no less. [sent-168, score-0.429]
47 The following theorem presents the retraction we will apply. [sent-184, score-0.236]
48 The mapping RW (ξ) = V1W †V2 is a second order retraction from a neighborhood ΘW ⊂ TW Mkn,m to Mkn,m . [sent-187, score-0.236]
49 A more succinct representation of this retraction is the following: Lemma 4 The retraction RW (ξ) can be presented as: 1 1 1 RW (ξ) = A Ik + M − M 2 + A⊥ N2 Ik − M 2 8 2 1 1 B Ik + M T − M T 2 8 2 · 1 + B⊥ N1 Ik − M T 2 T . [sent-189, score-0.472]
50 As a result from Lemma 4, we can calculate the retraction as the product of two low-rank factors: ˜ the first is an n × k matrix, the second a k × m matrix. [sent-191, score-0.236]
51 Given a gradient ∇L (x) in the ambient space, we can calculate the matrices M, N1 and N2 which allow us to represent its projection onto the tangent space, and furthermore allow us to calculate the retraction. [sent-192, score-0.518]
52 Algorithm 2 explicitly computes and stores the orthogonal complement matrices A⊥ and B⊥ , which in the low rank case k ≪ m, n, have size O(mn), the same as the full sized W . [sent-202, score-0.367]
53 Using the factorized gradient we can reformulate the 2 algorithm to keep in memory only matrices of size at most max(n, m)×k or max(n, m)×r. [sent-208, score-0.281]
54 2 L ORETA With Rank-one Gradients ˜ In many learning problems, the gradient matrix ∇L (W ) required for a gradient step update has a rank of one. [sent-212, score-0.429]
55 The set of n-by-n PSD matrices of rank-k forms a manifold of dimension nk − k(k−1) , embedded in the Euclidean space Rn×n (Vandereycken et al. [sent-261, score-0.426]
56 Given a rank-r gradient matrix Z, and using the projection matrices PY and PY⊥ they show that: Z + ZT PY , 2 Z + ZT Z + ZT PY + PY PY⊥ . [sent-272, score-0.279]
57 ξP = PY⊥ 2 2 Using this characterization of the tangent vector when given an ambient gradient Z, one can (2) define a retraction analogous to that defined in Section 3. [sent-273, score-0.587]
58 This retraction is referred to as RW in Vandereycken and Vandewalle (2010). [sent-274, score-0.236]
59 2 8 2 PSD The mapping RW (ξ) = VW †V is a second order retraction from a neighborhood ΘW ⊂ TW S+ (k, n) to S+ (k, n). [sent-277, score-0.236]
60 This approach is related to the idea of manifold identification (Oberlin and Wright, 2007), where the learning of A, B "identifies" a manifold of rank k and the inner steps operate to tune the representation within that subspace. [sent-320, score-0.602]
61 In this work, Meyer and colleagues develop a framework for Riemannian stochastic gradient descent on the manifold of PSD matrices, and apply it to the problem of kernel learning and the learning of Mahalanobis distances. [sent-350, score-0.351]
62 Their main technical tool is that of quotient manifolds mentioned above, as opposed to the embedded manifold we use in this work. [sent-351, score-0.377]
63 Another paper which uses a quotient manifold representation is that of Journée et al. [sent-352, score-0.23]
64 introduced a retraction for PSD matrices in the context of modeling systems of partial differential equations. [sent-355, score-0.336]
65 (2009) deal with learning low-rank PSD matrices, and use the rank-preserving log-det divergence and clever factorization and optimization in order to derive an update rule with runtime complexity of O(nk2 ) for an n × n matrix of rank k. [sent-368, score-0.281]
66 (2008) use online learning in order to find a minimal rank square matrix under approximate affine constraints. [sent-370, score-0.321]
67 Experiments We tested L ORETA in two learning tasks: learning a similarity measure between pairs of text documents using the 20-newsgroups data collected by Lang (1995), and learning to rank image label annotations based on a multi-label annotated set, using the ImageNet data set (Deng et al. [sent-379, score-0.339]
68 We varied the number of features n and the rank of the matrix k 443 S HALIT, W EINSHALL AND C HECHIK so as to use a fixed amount of memory. [sent-399, score-0.257]
69 We view every document q in the test set as a query, and rank the remaining test documents p by their similarity scores qT W p. [sent-425, score-0.313]
70 (2011), this is in fact equivalent to Riemannian stochastic GD in the manifold of PSD matrices when this manifold is endowed with a certain metric the authors call the flat metric. [sent-445, score-0.564]
71 This approach was more stable in the PSD case than in the general case, probably because the invariant space here is only the group of orthogonal matrices (which are well-conditioned), as opposed to the group of invertible matrices which might be ill-conditioned. [sent-448, score-0.24]
72 We compared with two full rank online metric learning methods, LEGO (Jain et al. [sent-459, score-0.327]
73 We have also compared with both full-rank models using rank 2000, that is, 4 times the memory constraint. [sent-463, score-0.255]
74 The reason may be that similarity was defined based on two samples belonging to a common class, and this relation is symmetric and transitive, two relations which are respected by PSD matrices but not by general similarity matrices. [sent-474, score-0.248]
75 Interestingly, for both L ORETA algorithms learning a low-rank model of rank 30, using the best 16660 features, was significantly more precise than learning a much fuller model of rank 100 and 5000 features, or a model using the full 50000 word vocabulary but with rank 10 . [sent-476, score-0.631]
76 The matrix G is ¯ rank one, unless no loss was suffered in which case it is 0. [sent-506, score-0.257]
77 This matrix is not included in the low-rank manifold Mkn,m , since its rank is less than k. [sent-508, score-0.457]
78 We therefore perform a simple pre-training session in which we add up subgradients until a matrix of rank k is obtained. [sent-509, score-0.257]
79 4 rank = 100 rank = 75 rank = 50 rank = 40 rank = 30 rank = 20 rank = 10 0. [sent-516, score-1.414]
80 1 400 rank = 100 rank = 100 PSD rank = 40 rank = 40 PSD rank = 10 rank = 10 PSD 0. [sent-522, score-1.212]
81 45 10 30 50 75 100 matrix rank k 1000 2000 x4 memory Figure 3: (a) Mean average precision (mAP) over 20 newsgroups test set as traced along L ORETA learning for various ranks. [sent-527, score-0.402]
82 For each rank, a different number of features was selected using an information gain criterion, such that the total memory requirement is kept fixed (number of features × rank is constant). [sent-532, score-0.255]
83 We chose 2k because we wanted to ensure that the matrix we obtain has rank greater or equal to k. [sent-536, score-0.257]
84 PA / zero Matrix Perceptron Group MC Perceptron 10 50 150 250 400 1000 matrix rank k matrix rank k Figure 4: ImageNet data. [sent-559, score-0.514]
85 Mean average precision (mAP) as a function of the rank k. [sent-560, score-0.255]
86 1 0 5 3 rank = 150 4 0 5 3 rank = 250 4 5 rank = 400 0. [sent-603, score-0.606]
87 Matrix Perceptron (black squares) and Group Multi-Class Perceptron (purple crosses) are both full rank (rank=1000), and their curves are reproduced on all six panels for comparison. [sent-613, score-0.227]
88 After a sufficient number of rounds, the model is typically full rank and dense. [sent-626, score-0.227]
89 5 R ESULTS Figure 4 plots the mAP precision of L ORETA and PA for different model ranks, while showing on the right the mAP of the full rank 1000 Matrix Perceptron and (2, 1) norm algorithms. [sent-635, score-0.28]
90 However, we note that L ORETA, being a non-convex algorithm, does depend significantly on the method of initialization, with the zero-padded identity matrix being the best initialization for lower rank models, and the zero matrix the best initialization for higher rank models (rank ≥ 150). [sent-637, score-0.514]
91 Discussion We presented L ORETA, an algorithm which learns a low-rank matrix based on stochastic Riemannian gradient descent and efficient retraction to the manifold of low-rank matrices. [sent-646, score-0.642]
92 Proof of Lemma 2 We formally define the tangent space of a manifold at a point on the manifold, and then describe an auxiliary parametrization of the tangent space to the manifold Mkn,m at a point W ∈ Mkn,m . [sent-679, score-0.736]
93 Definition 7 The tangent space TW M to a manifold M ⊂ Rn at a point W ∈ M is the linear space spanned by all the tangent vectors at 0 to smooth curves γ : R → M such that γ(0) = W . [sent-680, score-0.536]
94 For any such curve, because of the rank k assumption, we may assume that for all t ∈ R, there n×k exist (non-unique) matrices A(t) ∈ R∗ , B(t) ∈ Rm×k , such that γ(t) = A(t)B(t)T . [sent-683, score-0.302]
95 (5) ˙ ˙ This is because any choice of matrices X, Y such that X = B, Y = A will give us some tangent vector, and for any tangent vector there exist such matrices. [sent-687, score-0.436]
96 The mapping RW (ξ) = V1W †V2 is a second order retraction from a neighborhood ΘW ⊂ TW Mkn,m (7) to Mkn,m . [sent-697, score-0.236]
97 A sufficient condition for the matrices Z1 and Z2 to be of full rank is that the matrix M is of limited norm. [sent-700, score-0.382]
98 Thus, for all tangent vectors lying in some neighborhood ΘW ⊂ TW Mkn,m of 0 ∈ TW Mkn,m , the above relation is indeed a retraction to the manifold. [sent-701, score-0.404]
99 In practice this is never a problem, as the set of matrices not of full rank is of zero measure, and in practice we have found these matrices to T T always be of full rank. [sent-702, score-0.452]
100 This concludes the proof that the retraction is a second order retraction. [sent-713, score-0.236]
wordName wordTfidf (topN-words)
[('oreta', 0.386), ('bt', 0.263), ('retraction', 0.236), ('rw', 0.211), ('ab', 0.206), ('rank', 0.202), ('manifold', 0.2), ('tw', 0.196), ('riemannian', 0.195), ('abw', 0.184), ('abt', 0.175), ('tangent', 0.168), ('psd', 0.154), ('einshall', 0.138), ('halit', 0.138), ('hechik', 0.138), ('mbedded', 0.129), ('retractions', 0.129), ('vandereycken', 0.11), ('imagenet', 0.101), ('matrices', 0.1), ('anifold', 0.099), ('ambient', 0.097), ('atrices', 0.091), ('manifolds', 0.089), ('gradient', 0.086), ('absil', 0.085), ('loreta', 0.083), ('nline', 0.078), ('pw', 0.078), ('ik', 0.077), ('similarity', 0.074), ('ambt', 0.073), ('bw', 0.071), ('nk', 0.068), ('pa', 0.064), ('shalit', 0.064), ('online', 0.064), ('perceptron', 0.062), ('embedded', 0.058), ('matrix', 0.055), ('vandewalle', 0.055), ('rn', 0.053), ('precision', 0.053), ('memory', 0.053), ('earning', 0.048), ('chechik', 0.047), ('meyer', 0.046), ('lw', 0.042), ('bilinear', 0.042), ('factorized', 0.042), ('py', 0.042), ('orthogonal', 0.04), ('uri', 0.039), ('newsgroups', 0.039), ('projection', 0.038), ('documents', 0.037), ('descent', 0.037), ('daphna', 0.037), ('fro', 0.037), ('journ', 0.037), ('gd', 0.037), ('kulis', 0.037), ('metric', 0.036), ('deng', 0.035), ('rm', 0.034), ('ranking', 0.032), ('bai', 0.031), ('meka', 0.031), ('pb', 0.031), ('jain', 0.031), ('quotient', 0.03), ('fazel', 0.03), ('map', 0.03), ('onto', 0.029), ('zb', 0.029), ('supervision', 0.028), ('gal', 0.028), ('stochastic', 0.028), ('elseif', 0.028), ('grangier', 0.028), ('lego', 0.028), ('manning', 0.028), ('oasis', 0.028), ('pqt', 0.028), ('retract', 0.028), ('sqrt', 0.028), ('wsabie', 0.028), ('zpb', 0.028), ('qt', 0.027), ('yy', 0.027), ('cpu', 0.026), ('jerusalem', 0.026), ('image', 0.026), ('full', 0.025), ('sw', 0.024), ('rk', 0.024), ('runtime', 0.024), ('svd', 0.024), ('weinshall', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning
2 0.13451968 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
Author: Yui Man Lui
Abstract: Action videos are multidimensional data and can be naturally represented as data tensors. While tensor computing is widely used in computer vision, the geometry of tensor space is often ignored. The aim of this paper is to demonstrate the importance of the intrinsic geometry of tensor space which yields a very discriminating structure for action recognition. We characterize data tensors as points on a product manifold and model it statistically using least squares regression. To this aim, we factorize a data tensor relating to each order of the tensor using Higher Order Singular Value Decomposition (HOSVD) and then impose each factorized element on a Grassmann manifold. Furthermore, we account for underlying geometry on manifolds and formulate least squares regression as a composite function. This gives a natural extension from Euclidean space to manifolds. Consequently, classification is performed using geodesic distance on a product manifold where each factor manifold is Grassmannian. Our method exploits appearance and motion without explicitly modeling the shapes and dynamics. We assess the proposed method using three gesture databases, namely the Cambridge hand-gesture, the UMD Keck body-gesture, and the CHALEARN gesture challenge data sets. Experimental results reveal that not only does the proposed method perform well on the standard benchmark data sets, but also it generalizes well on the one-shot-learning gesture challenge. Furthermore, it is based on a simple statistical model and the intrinsic geometry of tensor space. Keywords: gesture recognition, action recognition, Grassmann manifolds, product manifolds, one-shot-learning, kinect data
3 0.10317044 68 jmlr-2012-Minimax Manifold Estimation
Author: Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli, Larry Wasserman
Abstract: We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold. Under certain conditions, we show that the optimal rate of convergence is n−2/(2+d) . Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded. Keywords: manifold learning, minimax estimation
4 0.091394342 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
5 0.071495362 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
6 0.070606701 97 jmlr-2012-Regularization Techniques for Learning with Matrices
7 0.062220734 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
8 0.061777528 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
9 0.06040027 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
10 0.059416831 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
11 0.058173608 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
12 0.057498354 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
13 0.050909273 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
14 0.043986719 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
15 0.042865481 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
16 0.039116528 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
17 0.038706567 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
18 0.03834689 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
19 0.038023997 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
20 0.034951322 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
topicId topicWeight
[(0, -0.187), (1, -0.078), (2, 0.069), (3, 0.02), (4, -0.05), (5, -0.077), (6, 0.086), (7, -0.136), (8, 0.151), (9, -0.285), (10, -0.003), (11, 0.126), (12, -0.165), (13, 0.095), (14, 0.083), (15, 0.023), (16, 0.004), (17, 0.038), (18, 0.004), (19, -0.182), (20, 0.112), (21, 0.021), (22, 0.049), (23, 0.134), (24, 0.036), (25, -0.021), (26, -0.042), (27, 0.092), (28, 0.132), (29, 0.065), (30, -0.074), (31, -0.018), (32, -0.099), (33, -0.018), (34, 0.146), (35, 0.005), (36, -0.058), (37, -0.091), (38, 0.024), (39, -0.054), (40, -0.007), (41, -0.133), (42, -0.016), (43, 0.039), (44, 0.02), (45, -0.064), (46, 0.091), (47, 0.029), (48, 0.035), (49, 0.07)]
simIndex simValue paperId paperTitle
same-paper 1 0.94346273 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning
2 0.62544286 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
Author: Yui Man Lui
Abstract: Action videos are multidimensional data and can be naturally represented as data tensors. While tensor computing is widely used in computer vision, the geometry of tensor space is often ignored. The aim of this paper is to demonstrate the importance of the intrinsic geometry of tensor space which yields a very discriminating structure for action recognition. We characterize data tensors as points on a product manifold and model it statistically using least squares regression. To this aim, we factorize a data tensor relating to each order of the tensor using Higher Order Singular Value Decomposition (HOSVD) and then impose each factorized element on a Grassmann manifold. Furthermore, we account for underlying geometry on manifolds and formulate least squares regression as a composite function. This gives a natural extension from Euclidean space to manifolds. Consequently, classification is performed using geodesic distance on a product manifold where each factor manifold is Grassmannian. Our method exploits appearance and motion without explicitly modeling the shapes and dynamics. We assess the proposed method using three gesture databases, namely the Cambridge hand-gesture, the UMD Keck body-gesture, and the CHALEARN gesture challenge data sets. Experimental results reveal that not only does the proposed method perform well on the standard benchmark data sets, but also it generalizes well on the one-shot-learning gesture challenge. Furthermore, it is based on a simple statistical model and the intrinsic geometry of tensor space. Keywords: gesture recognition, action recognition, Grassmann manifolds, product manifolds, one-shot-learning, kinect data
3 0.56402379 68 jmlr-2012-Minimax Manifold Estimation
Author: Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli, Larry Wasserman
Abstract: We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold. Under certain conditions, we show that the optimal rate of convergence is n−2/(2+d) . Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded. Keywords: manifold learning, minimax estimation
4 0.56030738 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
5 0.48912287 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
Author: Karthik Mohan, Maryam Fazel
Abstract: The problem of minimizing the rank of a matrix subject to affine constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to find the minimum rank matrix under suitable assumptions. In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 ≤ p ≤ 1), as a computationally efficient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p = 1, we give theoretical guarantees similar to those for nuclear norm minimization, that is, recovery of low-rank matrices under certain assumptions on the operator defining the constraints. For p < 1, IRLSp shows better empirical performance in terms of recovering low-rank matrices than nuclear norm minimization. We provide an efficient implementation for IRLS-p, and also present a related family of algorithms, sIRLS-p. These algorithms exhibit competitive run times and improved recovery when compared to existing algorithms for random instances of the matrix completion problem, as well as on the MovieLens movie recommendation data set. Keywords: matrix rank minimization, matrix completion, iterative algorithms, null-space property
6 0.38691634 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
7 0.28860965 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
8 0.28752011 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
9 0.27580929 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
10 0.27577129 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
11 0.25956947 97 jmlr-2012-Regularization Techniques for Learning with Matrices
12 0.25870988 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
13 0.24141027 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks
14 0.21706827 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
15 0.21551211 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
16 0.20712109 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
17 0.20682847 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
18 0.20588505 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization
19 0.20115855 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
20 0.19931191 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
topicId topicWeight
[(7, 0.035), (21, 0.027), (26, 0.047), (27, 0.013), (29, 0.021), (35, 0.013), (56, 0.017), (69, 0.042), (75, 0.056), (77, 0.016), (79, 0.016), (81, 0.42), (92, 0.064), (96, 0.12)]
simIndex simValue paperId paperTitle
1 0.83939528 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
Author: Gil Tahan, Lior Rokach, Yuval Shahar
Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufficient to employ one simple detection rule for classifying executable files. Keywords: computer security, malware detection, common segment analysis, supervised learning
2 0.83501369 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle
Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can find representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it finds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically significant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning
same-paper 3 0.74548632 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning
4 0.43253672 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
Author: Jun Zhu, Amr Ahmed, Eric P. Xing
Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efficient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efficient than existing supervised topic models, especially for classification. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines
5 0.38733938 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
Author: Marius Kloft, Pavel Laskov
Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection
6 0.38427019 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
7 0.38052219 106 jmlr-2012-Sign Language Recognition using Sub-Units
8 0.36641672 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
9 0.36117166 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
10 0.36034006 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
11 0.35629615 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
12 0.3550908 100 jmlr-2012-Robust Kernel Density Estimation
13 0.35393709 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
14 0.35250735 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
15 0.34819147 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
16 0.34708035 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
17 0.34597224 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
18 0.34415281 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification
19 0.34312892 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
20 0.34013829 118 jmlr-2012-Variational Multinomial Logit Gaussian Process