nips nips2010 nips2010-195 knowledge-graph by maker-knowledge-mining

195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

Source: pdf

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches for minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is hard to compute, and so is the projection operator that approximates it, we describe another second-order retraction that can be computed efﬁciently, with run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, given rank-one gradients. We use this algorithm, LORETA, to learn a matrixform similarity measure over pairs of documents represented as high dimensional vectors. LORETA improves the mean average precision over a passive- aggressive approach in a factorized model, and also improves over a full model trained over pre-selected features using the same memory requirements. LORETA also showed consistent improvement over standard methods in a large (1600 classes) multi-label image classiﬁcation task. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, naive approaches for minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low rank matrix). [sent-11, score-0.592]

2 We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. [sent-12, score-0.579]

3 While the ideal retraction is hard to compute, and so is the projection operator that approximates it, we describe another second-order retraction that can be computed efﬁciently, with run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, given rank-one gradients. [sent-13, score-0.941]

4 LORETA improves the mean average precision over a passive- aggressive approach in a factorized model, and also improves over a full model trained over pre-selected features using the same memory requirements. [sent-15, score-0.262]

5 In many of these models, a natural way to regularize the model is to limit the rank of the corresponding matrix. [sent-19, score-0.361]

6 In metric learning, a low rank constraint allows to learn a low dimensional representation of the data in a discriminative way. [sent-20, score-0.534]

7 In multi-task problems, low rank constraints provide a way to tie together different tasks. [sent-21, score-0.409]

8 In all cases, low-rank matrices can be represented in a factorized form that dramatically reduces the memory and run-time complexity of learning and inference with that model. [sent-22, score-0.318]

9 Low-rank matrix models could therefore scale to handle substantially many more features and classes than with full rank dense matrices. [sent-23, score-0.523]

10 As with many other problems, the rank constraint is non-convex, and in the general case, minimizing a convex function subject to a rank constraint is NP-hard [1] 1 . [sent-24, score-0.722]

11 Sometimes, a matrix W ∈ Rn×m of rank k is represented as a product of two low dimension matrices W = AB T , A ∈ Rn×k , B ∈ Rm×k and simple gradient descent techniques are applied to each of the product terms separately [3]. [sent-26, score-0.819]

12 Second, projected gradient algorithms can be applied by repeatedly taking a gradient step and projecting back to the manifold of low-rank matrices. [sent-27, score-0.518]

13 Unfortunately, computing the projection to that manifold becomes prohibitively costly for large matrices and cannot be computed after every gradient step. [sent-28, score-0.561]

14 The ﬁrst step computes the Riemannian gradient ξ (the projection of the gradient onto the tan1 gent space Tx Mn,m ), yielding xt+ 2 = k t t t x + η ∇L(x ). [sent-32, score-0.441]

15 The second step computes the retraction onto the manifold xt+1 = Rx (ξ t ). [sent-33, score-0.661]

16 In this paper we propose new algorithms for online learning on the manifold of low-rank matrices, which are based on an operation called retraction. [sent-34, score-0.366]

17 Retractions are operators that map from a vector space that is tangent to the manifold, into the manifold. [sent-35, score-0.231]

18 They include the projection operator as a special case, but also include other retractions that can be computed dramatically more efﬁciently. [sent-36, score-0.319]

19 We use second order retractions to develop LORETA – an online algorithm for learning low rank matrices. [sent-37, score-0.729]

20 It has a memory and run time complexity of O ((n + m)k) when the gradients have rank one, a case which is relevant to numerous online learning problems as we show below. [sent-38, score-0.573]

21 Loreta performed better than other techniques that operate on a factorized model, and also improves retrieval precision by 33% as compared with training a full rank model over pre-selected most informative features, using comparable memory footprint. [sent-41, score-0.642]

22 Loreta signiﬁcantly improved over full rank models, using a fraction of the memory required. [sent-43, score-0.465]

23 We then derive our low-rank online learning algorithm, and test it in two applications: learning similarity of text documents, and multi-label ranking for images. [sent-47, score-0.274]

24 2 Optimization on Riemannian manifolds The ﬁeld of numerical optimization on smooth manifolds has advanced signiﬁcantly in the past few years. [sent-48, score-0.262]

25 An embedded manifold is a smooth subset of an ambient space Rn . [sent-50, score-0.445]

26 For instance the set {x : ||x||2 = 1, x ∈ Rn }, the unit sphere, is an n − 1 dimensional manifold embedded in ndimensional space Rn . [sent-51, score-0.366]

27 Here we focus on the manifold of low-rank matrices, namely, the set of n × m matrices of rank k where k < m, n. [sent-52, score-0.725]

28 It is an (n + m)k − k 2 dimensional manifold embedded in Rn×m , which we denote Mn,m . [sent-53, score-0.342]

29 Embedded manifolds inherit many properties from the ambient k space, a fact which simpliﬁes their analysis. [sent-54, score-0.247]

30 For example, the Riemannian metric for embedded manifolds is simply the Euclidean metric restricted to the manifold. [sent-55, score-0.282]

31 Motivated by online learning, we focus here on developing a stochastic gradient descent procedure to minimize a loss function L over the manifold of low-rank matrices Mn,m , k min x L(x) s. [sent-56, score-0.746]

32 At every step t of the algorithm, a gradient step update takes xt+ 2 outside of the manifold M and has to be mapped back onto the manifold. [sent-61, score-0.492]

33 Unfortunately, the projection operation is very expensive to compute for the manifold of low rank matrices, since it basically involves a singular value decomposition. [sent-63, score-0.776]

34 Here we describe a wider class of operations called retractions, that serve a similar purpose: they ﬁnd a point on the manifold that is in the direction of the gradient. [sent-64, score-0.234]

35 Importantly, we describe a speciﬁc retraction that can be computed efﬁciently. [sent-65, score-0.33]

36 Its runtime complexity depends on 4 quantities: the model matrix dimensions m and n; its rank k; and the rank of the gradient matrix, r. [sent-66, score-0.979]

37 2 To explain how retractions are computed, we ﬁrst describe the notion of a tangent space and the Riemannian gradient of a function on a manifold. [sent-68, score-0.534]

38 Riemannian gradient and the tangent space Each point x in an embedded manifold M has a tangent space associated with it, denoted Tx M (see Fig. [sent-69, score-0.81]

39 The tangent space is a vector space of the same dimension as the manifold that can be identiﬁed in a natural way with a linear subspace of the ambient space. [sent-71, score-0.591]

40 It is usually simple to compute the linear projection Px of any point in the ambient space onto the tangent space Tx M. [sent-72, score-0.464]

41 Given a manifold M and a differentiable function L : M → R, the Riemannian gradient ∇L(x) of L on M at a point x is a vector in the tangent space Tx M. [sent-73, score-0.548]

42 A very useful property of embedded manifolds is the following: given a differentiable function f deﬁned on the ambient space (and thus on the manifold), the Riemannian gradient of f at point x is simply the linear projection Px of the ordinary gradient of f onto the tangent space Tx M. [sent-74, score-0.938]

43 An important consequence follows in case the manifold represents the set of points obeying a certain constraint. [sent-75, score-0.234]

44 In this case the Riemannian gradient of f is equivalent to the ordinary gradient of the f minus the component which is normal to the constraint. [sent-76, score-0.272]

45 1 The Riemannian gradient allows us to compute xt+ 2 = xt + η t ∇L(x), for a given iterate point xt 1 and step size η t . [sent-78, score-0.321]

46 The mathematically ideal retraction is called the exponential mapping: it maps the tangent vector ξ ∈ Tx M to a point along a geodesic curve which goes through x in the direction of ξ. [sent-81, score-0.553]

47 Unfortunately, for many manifolds (including the low-rank manifold considered here) calculating the geodesic curve is computationally expensive. [sent-82, score-0.421]

48 A major insight from the ﬁeld of Riemannian manifold optimization is that using the exponential mapping is unnecessary since computationally cheaper retractions exist. [sent-83, score-0.482]

49 Formally, for a point x in an embedded manifold M, a retraction is any function Rx : Tx M → M which satisﬁes the following two conditions [4]: (1) Centering: Rx (0) = x. [sent-84, score-0.635]

50 It can be shown that any such retraction approximates the exponential mapping to a ﬁrst order [4]. [sent-86, score-0.358]

51 Second-order retractions, which approximate the exponential mapping to second order around x, have to satisfy the following stricter conditions: Px dRx (τ ξ) |τ =0 = 0, for all ξ ∈ Tx M, where Px is the linear projection from the dτ 2 ambient space onto the tangent space Tx M. [sent-87, score-0.492]

52 When viewed intrinsically, the curve Rx (τ ξ) deﬁned by a second-order retraction has zero acceleration at point x, namely, its second order derivatives are all normal to the manifold. [sent-88, score-0.358]

53 The best known example of a second-order retraction onto embedded manifolds is the projection operation [5]. [sent-89, score-0.697]

54 Given the tangent space and a retraction, we can now deﬁne a Riemannian gradient descent step for the loss L at point xt ∈ M: 1 ˜ ˜ (1) Gradient step: Compute xt+ 2 = xt + ξ t , with ξ t = ∇L(xt ) = Pxt (∇L(xt )), where ∇L(xt ) is the ordinary gradient of L in the ambient space. [sent-91, score-0.875]

55 For a proper step size, this procedure can be proved to have local convergence for any retraction [4]. [sent-93, score-0.396]

56 3 Online learning on the low rank manifold Based on the retractions described above, we now present an online algorithm for learning lowrank matrices, by performing stochastic gradient descent on the manifold of low rank matrices. [sent-94, score-1.824]

57 At every iteration the algorithm suffers a loss, and performs a Riemannian gradient step followed by a retraction to the manifold Mn,m . [sent-95, score-0.725]

58 2 k discusses the very common case where the online updates induce a gradient of rank r = 1. [sent-99, score-0.584]

59 In what follows, a lowercase x denotes an abstract point on the manifold, lowercase Greek letters like ξ denote an abstract tangent vector, and uppercase Roman letters like A denote concrete matrix 3 representations as kept in memory (taking n × m ﬂoat numbers to store). [sent-100, score-0.393]

60 The set of n × k matrices of n×k rank k is denoted R∗ . [sent-102, score-0.491]

61 1 The general LORETA algorithm We start with a Lemma that gives a representation of the tangent space Tx M, extending the constructions given in [6] to the general manifold of low-rank matrices. [sent-104, score-0.425]

62 The tangent space ⊥ ⊥ n,m to Mk at x is: Tx M = [A A⊥ ] M N2 T N1 0 BT T B⊥ : M ∈ Rk×k , N1 ∈ R(m−k)×k , N2 ∈ R(n−k)×k (2) Let ξ ∈ Mn,m be a tangent vector to x = AB T . [sent-109, score-0.358]

63 In online learning we are repeatedly given a rank-r gradient matrix Z, and want to compute a step on Mn,m in the direction of Z. [sent-111, score-0.33]

64 As k a ﬁrst step we wish to ﬁnd its projection Px (Z) onto the tangent space. [sent-112, score-0.338]

65 ⊥ The following theorem deﬁnes the retraction that we use. [sent-118, score-0.33]

66 2 8 2 The mapping Rx (ξ) = w1 x† w2 is a second order retraction from a neighborhood Θx ⊂ Tx Mn,m k to Mn,m . [sent-123, score-0.358]

67 G1 GT = ∗ 2 n×m ˜ ˜ −ηξ = −η ∇L(x) ∈ R , where ∇L(x) is the gradient in the ambient space and η > 0 is the step size. [sent-130, score-0.301]

68 4 Algorithm 1 explicitly computes and stores the orthogonal complement matrices A⊥ and B⊥ , which in the low rank case k ≪ m, n, have size O(mn) as the original x. [sent-134, score-0.574]

69 To improve the memory complexity, we use the fact that the matrices A⊥ and B⊥ always operate with their transpose. [sent-135, score-0.235]

70 Because of the orthogonal complementarity, these projection matrices are equal to In − PA and Im − PB respectively. [sent-137, score-0.239]

71 G1 GT = −ηξ = −η ∇L(x) ∈ Rn×m , where ∇L(x) 2 is the gradient in the ambient space and η > 0 is the step size. [sent-146, score-0.301]

72 2 LORETA with rank-one gradients In many learning problems, the gradient matrix required for a gradient step update has a rank of one. [sent-152, score-0.714]

73 We now show how to reduce the complexity of each iteration to be linear in the model rank k when the rank of the gradient matrix r is one. [sent-159, score-0.949]

74 The memory requirement of Loreta-1 is about 4nk (assuming m = n), since it receives four input † † matrices of size nk (A, B, A† , B † ) and assuming it can compute the four outputs (Z1 , Z2 , Z1 , Z2 ), in-place while destroying previously computed terms. [sent-167, score-0.238]

75 More speciﬁc to the ﬁeld of low rank matrix manifolds, some work has been done on the general problem of optimization with low rank positive semi-deﬁnite (PSD) matrices. [sent-169, score-0.887]

76 These include [10] and [6]; the latter introduced the retraction for PSD matrices which we extended here to general low-rank matrices. [sent-170, score-0.46]

77 The problem of minimizing a convex function over the set of low rank matrices, was addressed by several authors, including [11], and [12] which also considers additional afﬁne constraints, and its connection to recent advances in compresses sensing. [sent-171, score-0.409]

78 The ﬁrst deals with learning low rank PSD matrices, and uses the rank-preserving log-det divergence and clever factorization and optimization in order to derive an update rule with runtime complexity of O(nk 2 ) for an n × n matrix of rank k. [sent-177, score-0.904]

79 The second uses online learning in order to ﬁnd a minimal rank square matrix under approximate afﬁne constraints. [sent-178, score-0.53]

80 5 Experiments We tested Loreta-1 in two learning tasks: learning a similarity measure between pairs of text documents using the 20-newsgroups data collected by [15], and learning to rank image label annotations based on a multi-label annotated set, using the imagenet dataset [16]. [sent-181, score-0.667]

81 In each of our experiments, we selected a subset of n features, and trained a rank k model. [sent-192, score-0.361]

82 We varied the number of features n and the rank of the matrix k so as to use a ﬁxed amount of memory. [sent-193, score-0.469]

83 We wish that the model assigns a higher similarity score to the pair (q, p1 ) than the pair (q, p2 ), hence use the online ranking hinge loss deﬁned as lW (q, p1 , p2 ) = [1 − SW (q, p1 ) + SW (q, p2 )]+ . [sent-198, score-0.275]

84 We view every document q in the test set as a query, and rank the remaining test documents p by their similarity scores qT W p. [sent-213, score-0.561]

85 This procedure is numerically more stable because of the normalization by the norms of the matrices multiplied by the gradient factors. [sent-230, score-0.281]

86 2 (p −p ) 1 2 (4) Full rank similarity learning models. [sent-232, score-0.441]

87 More importantly, learning a low rank model of rank 30, using the best 16660 features, is signiﬁcantly more precise than learning a much fuller model of rank 100 and 5000 features. [sent-237, score-1.131]

88 The intuition is that Loreta can be viewed as adaptively learning a linear projection of the data into low dimensional space, which is tailored to the pairwise similarity task. [sent-238, score-0.239]

89 2 Image multilabel ranking Our second set of experiments tackled the problem of learning to rank labels for images taken from a large number of classes (L = 1661) with multiple labels per image. [sent-240, score-0.533]

90 Given a ground truth labeling, a good model would rank the true labels higher than the false ones. [sent-246, score-0.392]

91 Imposing a low rank constraint on the model implies that these sub-models are linear combinations of a smaller number of latent models. [sent-248, score-0.409]

92 The matrix G is rank one, unless no loss was suffered ¯ in which case it is 0. [sent-262, score-0.466]

93 4 rank = 100 rank = 75 rank = 50 rank = 40 rank = 30 rank = 20 rank = 10 0. [sent-267, score-2.527]

94 02 0 10 50 150 250 400 1000 matrix rank k Figure 2: (a) Mean average precision (mAP) over 20 newsgroups test set as traced along Loreta learning for various ranks. [sent-283, score-0.573]

95 For each rank, a different number of features was selected using an information gain criterion, such that the total memory requirement is kept ﬁxed (number of features × rank is constant). [sent-286, score-0.54]

96 (2) Matrix Perceptron: a full rank conservative gradient descent (3) Group Multi-Class Perceptron a mixed (2,1) norm online mirror descent algorithm [21]. [sent-318, score-0.735]

97 2c plots the mAP precision of Loreta and PA for different model ranks, while showing on the right the mAP of the full rank 1000 gradient descent and (2, 1) norm algorithms. [sent-322, score-0.641]

98 6 Discussion We presented Loreta, an algorithm which learns a low-rank matrix based on stochastic Riemannian gradient descent and efﬁcient retraction to the manifold of low-rank matrices. [sent-324, score-0.851]

99 Loreta yields a factorized representation of the low rank matrix. [sent-326, score-0.46]

100 A discriminative kernel-based model to rank images from text queries. [sent-371, score-0.396]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('loreta', 0.477), ('rank', 0.361), ('retraction', 0.33), ('manifold', 0.234), ('retractions', 0.22), ('riemannian', 0.207), ('tangent', 0.167), ('tx', 0.134), ('pa', 0.132), ('manifolds', 0.131), ('matrices', 0.13), ('gradient', 0.123), ('ambient', 0.116), ('rx', 0.1), ('online', 0.1), ('ab', 0.096), ('gbproj', 0.092), ('rojag', 0.092), ('gt', 0.086), ('xt', 0.08), ('similarity', 0.08), ('memory', 0.077), ('newsgroups', 0.075), ('projection', 0.074), ('embedded', 0.071), ('documents', 0.069), ('matrix', 0.069), ('precision', 0.068), ('imagenet', 0.063), ('descent', 0.062), ('ranking', 0.059), ('onto', 0.059), ('px', 0.058), ('lego', 0.055), ('oasis', 0.055), ('zpb', 0.055), ('rn', 0.055), ('factorized', 0.051), ('absil', 0.048), ('low', 0.048), ('bilinear', 0.044), ('lw', 0.044), ('sw', 0.044), ('perceptron', 0.042), ('metric', 0.04), ('map', 0.04), ('gd', 0.039), ('features', 0.039), ('rm', 0.038), ('step', 0.038), ('dimensional', 0.037), ('daphna', 0.037), ('zb', 0.037), ('loss', 0.036), ('ik', 0.036), ('text', 0.035), ('complexity', 0.035), ('orthogonal', 0.035), ('kulis', 0.035), ('image', 0.034), ('psd', 0.033), ('stochastic', 0.033), ('operation', 0.032), ('ilan', 0.032), ('gonda', 0.032), ('labels', 0.031), ('nk', 0.031), ('runtime', 0.03), ('retrieval', 0.03), ('shalit', 0.03), ('mapping', 0.028), ('bt', 0.028), ('operate', 0.028), ('procedure', 0.028), ('curve', 0.028), ('lowercase', 0.028), ('meka', 0.028), ('geodesic', 0.028), ('classes', 0.027), ('singular', 0.027), ('full', 0.027), ('document', 0.027), ('unstable', 0.026), ('pseudoinverse', 0.026), ('mk', 0.026), ('chechik', 0.026), ('iterative', 0.026), ('ordinary', 0.026), ('dimension', 0.026), ('complements', 0.025), ('gal', 0.025), ('label', 0.025), ('dramatically', 0.025), ('scores', 0.024), ('kept', 0.024), ('multilabel', 0.024), ('space', 0.024), ('siam', 0.023), ('pt', 0.023), ('pb', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

2 0.17269517 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

Author: Hariharan Narayanan, Sanjoy Mitter

Abstract: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of ﬁtting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume log 1 and curvature is unavoidable. Whether the known lower bound of O( k + 2 δ ) 2 for the sample complexity of Empirical Risk minimization on k−means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 [3]. Here is the desired bound on the error and δ is a bound on the probability of failure. We improve the best currently known upper bound [14] of 2 log 1 log4 k log 1 O( k2 + 2 δ ) to O k min k, 2 + 2 δ . Based on these results, we 2 devise a simple algorithm for k−means and another that uses a family of convex programs to ﬁt a piecewise linear curve of a speciﬁed length to high dimensional data, where the sample complexity is independent of the ambient dimension. 1

3 0.15924658 146 nips-2010-Learning Multiple Tasks using Manifold Regularization

Author: Arvind Agarwal, Samuel Gerber, Hal Daume

Abstract: We present a novel method for multitask learning (MTL) based on manifold regularization: assume that all task parameters lie on a manifold. This is the generalization of a common assumption made in the existing literature: task parameters share a common linear subspace. One proposed method uses the projection distance from the manifold to regularize the task parameters. The manifold structure and the task parameters are learned using an alternating optimization framework. When the manifold structure is ﬁxed, our method decomposes across tasks which can be learnt independently. An approximation of the manifold regularization scheme is presented that preserves the convexity of the single task learning problem, and makes the proposed MTL framework efﬁcient and easy to implement. We show the efﬁcacy of our method on several datasets. 1

4 0.10814212 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma

Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1

5 0.10503173 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu

Abstract: We pose transductive classiﬁcation as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspeciﬁed, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modiﬁed ﬁxed-point continuation method that is guaranteed to ﬁnd the global optimum. 1

6 0.10498556 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection

7 0.10397886 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm

8 0.09454824 222 nips-2010-Random Walk Approach to Regret Minimization

9 0.089332059 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints

10 0.087014563 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

11 0.08215826 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

12 0.07933972 114 nips-2010-Humans Learn Using Manifolds, Reluctantly

13 0.079006203 124 nips-2010-Inductive Regularized Learning of Kernel Functions

14 0.078001998 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

15 0.076645046 162 nips-2010-Link Discovery using Graph Feature Tracking

16 0.074061304 9 nips-2010-A New Probabilistic Model for Rank Aggregation

17 0.070386589 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

18 0.070237733 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

19 0.068173781 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

20 0.067977823 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, 0.036), (2, 0.092), (3, -0.029), (4, 0.04), (5, 0.009), (6, 0.059), (7, -0.051), (8, -0.15), (9, 0.019), (10, 0.127), (11, 0.008), (12, 0.124), (13, 0.04), (14, 0.041), (15, 0.041), (16, 0.019), (17, -0.182), (18, -0.014), (19, 0.092), (20, 0.041), (21, 0.027), (22, -0.046), (23, -0.183), (24, 0.074), (25, -0.021), (26, -0.116), (27, -0.113), (28, -0.049), (29, 0.017), (30, -0.129), (31, 0.018), (32, 0.115), (33, 0.066), (34, -0.06), (35, -0.048), (36, 0.068), (37, 0.02), (38, -0.052), (39, 0.072), (40, 0.084), (41, 0.076), (42, -0.003), (43, 0.026), (44, -0.082), (45, -0.088), (46, -0.091), (47, 0.067), (48, 0.046), (49, -0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94146878 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

2 0.71546298 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection

Author: Prateek Jain, Raghu Meka, Inderjit S. Dhillon

Abstract: Minimizing the rank of a matrix subject to afﬁne constraints is a fundamental problem with many important applications in machine learning and statistics. In this paper we propose a simple and fast algorithm SVP (Singular Value Projection) for rank minimization under afﬁne constraints (ARMP) and show that SVP recovers the minimum rank solution for afﬁne constraints that satisfy a restricted isometry property (RIP). Our method guarantees geometric convergence rate even in the presence of noise and requires strictly weaker assumptions on the RIP constants than the existing methods. We also introduce a Newton-step for our SVP framework to speed-up the convergence with substantial empirical gains. Next, we address a practically important application of ARMP - the problem of lowrank matrix completion, for which the deﬁning afﬁne constraints do not directly obey RIP, hence the guarantees of SVP do not hold. However, we provide partial progress towards a proof of exact recovery for our algorithm by showing a more restricted isometry property and observe empirically that our algorithm recovers low-rank incoherent matrices from an almost optimal number of uniformly sampled entries. We also demonstrate empirically that our algorithms outperform existing methods, such as those of [5, 18, 14], for ARMP and the matrix completion problem by an order of magnitude and are also more robust to noise and sampling schemes. In particular, results show that our SVP-Newton method is signiﬁcantly robust to noise and performs impressively on a more realistic power-law sampling scheme for the matrix completion problem. 1

3 0.63533688 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints

Author: Kaushik Mitra, Sameer Sheorey, Rama Chellappa

Abstract: Matrix factorization in the presence of missing data is at the core of many computer vision problems such as structure from motion (SfM), non-rigid SfM and photometric stereo. We formulate the problem of matrix factorization with missing data as a low-rank semideﬁnite program (LRSDP) with the advantage that: 1) an efﬁcient quasi-Newton implementation of the LRSDP enables us to solve large-scale factorization problems, and 2) additional constraints such as orthonormality, required in orthographic SfM, can be directly incorporated in the new formulation. Our empirical evaluations suggest that, under the conditions of matrix completion theory, the proposed algorithm ﬁnds the optimal solution, and also requires fewer observations compared to the current state-of-the-art algorithms. We further demonstrate the effectiveness of the proposed algorithm in solving the afﬁne SfM problem, non-rigid SfM and photometric stereo problems.

4 0.625965 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm

Author: Nathan Srebro, Ruslan Salakhutdinov

Abstract: We show that matrix completion with trace-norm regularization can be signiﬁcantly hurt when entries of the matrix are sampled non-uniformly, but that a properly weighted version of the trace-norm regularizer works well with non-uniform sampling. We show that the weighted trace-norm regularization indeed yields signiﬁcant gains on the highly non-uniformly sampled Netﬂix dataset.

5 0.54249686 146 nips-2010-Learning Multiple Tasks using Manifold Regularization

Author: Arvind Agarwal, Samuel Gerber, Hal Daume

6 0.51339871 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

7 0.48949638 9 nips-2010-A New Probabilistic Model for Rank Aggregation

8 0.4864603 114 nips-2010-Humans Learn Using Manifolds, Reluctantly

9 0.47389293 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

10 0.45633647 219 nips-2010-Random Conic Pursuit for Semidefinite Programming

11 0.45409811 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

12 0.44733047 220 nips-2010-Random Projection Trees Revisited

13 0.44725296 162 nips-2010-Link Discovery using Graph Feature Tracking

14 0.44714016 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

15 0.42487383 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

16 0.42298099 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

17 0.38976404 231 nips-2010-Robust PCA via Outlier Pursuit

18 0.37709481 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

19 0.36601594 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

20 0.36501792 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.093), (17, 0.027), (27, 0.048), (28, 0.177), (30, 0.099), (35, 0.016), (45, 0.195), (50, 0.055), (52, 0.039), (60, 0.037), (77, 0.033), (78, 0.068), (90, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87470323 281 nips-2010-Using body-anchored priors for identifying actions in single images

Author: Leonid Karlinsky, Michael Dinerstein, Shimon Ullman

Abstract: This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance. 1

2 0.84725463 54 nips-2010-Copula Processes

Author: Andrew Wilson, Zoubin Ghahramani

Abstract: We deﬁne a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We ﬁnd our model can outperform GARCH on simulated and ﬁnancial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures. Imagine measuring the distance of a rocket as it leaves Earth, and wanting to know how these measurements correlate with one another. How much does the value of the measurement at ﬁfteen minutes depend on the measurement at ﬁve minutes? Once we’ve learned this correlation structure, suppose we want to compare it to the dependence between measurements of the rocket’s velocity. To do this, it is convenient to separate dependence from the marginal distributions of our measurements. At any given time, a rocket’s distance from Earth could have a Gamma distribution, while its velocity could have a Gaussian distribution. And separating dependence from marginal distributions is precisely what a copula function does. While copulas have recently become popular, especially in ﬁnancial applications [1, 2], as Nelsen [3] writes, “the study of copulas and the role they play in probability, statistics, and stochastic processes is a subject still in its infancy. There are many open problems. . . ” Typically only bivariate (and recently trivariate) copulas are being used and studied. In our introductory example, we are interested in learning the correlations in different stochastic processes, and comparing them. It would therefore be useful to have a copula process, which can describe the dependencies between arbitrarily many random variables independently of their marginal distributions. We deﬁne such a process. And as an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV). In doing so, we provide a Bayesian framework for the learning the marginal distributions and dependency structure of what we call a Gaussian copula process. The volatility of a random variable is its standard deviation. Stochastic volatility models are used to predict the volatilities of a heteroscedastic sequence – a sequence of random variables with different variances, like distance measurements of a rocket as it leaves the Earth. As the rocket gets further away, the variance on the measurements increases. Heteroscedasticity is especially important in econometrics; the returns on equity indices, like the S&P; 500, or on currency exchanges, are heteroscedastic. Indeed, in 2003, Robert Engle won the Nobel Prize in economics “for methods of analyzing economic time series with time-varying volatility”. GARCH [4], a generalized version of Engle’s ARCH, is arguably unsurpassed for predicting the volatility of returns on equity indices and currency exchanges [5, 6, 7]. GCPV can outperform GARCH, and is competitive on ﬁnancial data that especially suits GARCH [8, 9, 10]. Before discussing GCPV, we ﬁrst introduce copulas and the copula process. For a review of Gaussian processes, see Rasmussen and Williams [11]. ∗ † http://mlg.eng.cam.ac.uk/andrew Also at the machine learning department at Carnegie Mellon University. 1 1 Copulas Copulas are important because they separate the dependency structure between random variables from their marginal distributions. Intuitively, we can describe the dependency structure of any multivariate joint distribution H(x1 , . . . , xn ) = P (X1 ≤ x1 , . . . Xn ≤ xn ) through a two step process. First we take each univariate random variable Xi and transform it through its cumulative distribution function (cdf) Fi to get Ui = Fi (Xi ), a uniform random variable. We then express the dependencies between these transformed variables through the n-copula C(u1 , . . . , un ). Formally, an n-copula C : [0, 1]n → [0, 1] is a multivariate cdf with uniform univariate marginals: C(u1 , u2 , . . . , un ) = P (U1 ≤ u1 , U2 ≤ u2 , . . . , Un ≤ un ), where U1 , U2 , . . . , Un are standard uniform random variables. Sklar [12] precisely expressed our intuition in the theorem below. Theorem 1.1. Sklar’s theorem Let H be an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . Then there exists an n-copula C such that for all (x1 , x2 , . . . , xn ) ∈ [−∞, ∞]n , H(x1 , x2 , . . . , xn ) = C(F1 (x1 ), F2 (x2 ), . . . , Fn (xn )) = C(u1 , u2 , . . . , un ). (1) If F1 , F2 , . . . , Fn are all continuous then C is unique; otherwise C is uniquely determined on Range F1 × Range F2 × · · · × Range Fn . Conversely, if C is an n-copula and F1 , F2 , . . . , Fn are distribution functions, then the function H is an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . (−1) As a corollary, if Fi (u) = inf{x : F (x) ≥ u}, the quasi-inverse of Fi , then for all u1 , u2 , . . . , un ∈ [0, 1]n , (−1) C(u1 , u2 , . . . , un ) = H(F1 (−1) (u1 ), F2 (−1) (u2 ), . . . , Fn (un )). (2) In other words, (2) can be used to construct a copula. For example, the bivariate Gaussian copula is deﬁned as C(u, v) = Φρ (Φ−1 (u), Φ−1 (v)), (3) where Φρ is a bivariate Gaussian cdf with correlation coefﬁcient ρ, and Φ is the standard univariate Gaussian cdf. Li [2] popularised the bivariate Gaussian copula, by showing how it could be used to study ﬁnancial risk and default correlation, using credit derivatives as an example. By substituting F (x) for u and G(y) for v in equation (3), we have a bivariate distribution H(x, y), with a Gaussian dependency structure, and marginals F and G. Regardless of F and G, the resulting H(x, y) can still be uniquely expressed as a Gaussian copula, so long as F and G are continuous. It is then a copula itself that captures the underlying dependencies between random variables, regardless of their marginal distributions. For this reason, copulas have been called dependence functions [13, 14]. Nelsen [3] contains an extensive discussion of copulas. 2 Copula Processes Imagine choosing a covariance function, and then drawing a sample function at some ﬁnite number of points from a Gaussian process. The result is a sample from a collection of Gaussian random variables, with a dependency structure encoded by the speciﬁed covariance function. Now, suppose we transform each of these values through a univariate Gaussian cdf, such that we have a sample from a collection of uniform random variables. These uniform random variables also have this underlying Gaussian process dependency structure. One might call the resulting values a draw from a Gaussian-Uniform Process. We could subsequently put these values through an inverse beta cdf, to obtain a draw from what could be called a Gaussian-Beta Process: the values would be a sample from beta random variables, again with an underlying Gaussian process dependency structure. We could also transform the uniform values with different inverse cdfs, which would give a sample from different random variables, with dependencies encoded by the Gaussian process. The above procedure is a means to generate samples from arbitrarily many random variables, with arbitrary marginal distributions, and desired dependencies. It is an example of how to use what we call a copula process – in this case, a Gaussian copula process, since a Gaussian copula describes the dependency structure of a ﬁnite number of samples. We now formally deﬁne a copula process. 2 Deﬁnition 2.1. Copula Process Let {Wt } be a collection of random variables indexed by t ∈ T , with marginal distribution functions Ft , and let Qt = Ft (Wt ). Further, let µ be a stochastic process measure with marginal distribution functions Gt , and joint distribution function H. Then Wt is copula process distributed with base measure µ, or Wt ∼ CP(µ), if and only if for all n ∈ N, ai ∈ R, n P( (−1) {Gti (Qti ) ≤ ai }) = Ht1 ,t2 ,...,tn (a1 , a2 , . . . , an ). (4) i=1 (−1) Each Qti ∼ Uniform(0, 1), and Gti is the quasi-inverse of Gti , as previously deﬁned. Deﬁnition 2.2. Gaussian Copula Process Wt is Gaussian copula process distributed if it is copula process distributed and the base measure µ is a Gaussian process. If there is a mapping Ψ such that Ψ(Wt ) ∼ GP(m(t), k(t, t )), then we write Wt ∼ GCP(Ψ, m(t), k(t, t )). For example, if we have Wt ∼ GCP with m(t) = 0 and k(t, t) = 1, then in the deﬁnition of a copula process, Gt = Φ, the standard univariate Gaussian cdf, and H is the usual GP joint distribution function. Supposing this GCP is a Gaussian-Beta process, then Ψ = Φ−1 ◦ FB , where FB is a univariate Beta cdf. One could similarly deﬁne other copula processes. We described generally how a copula process can be used to generate samples of arbitrarily many random variables with desired marginals and dependencies. We now develop a speciﬁc and practical application of this framework. We introduce a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), as an example of how to learn the joint distribution of arbitrarily many random variables, the marginals of these random variables, and to make predictions. To do this, we ﬁt a Gaussian copula process by using a type of Warped Gaussian Process [15]. However, our methodology varies substantially from Snelson et al. [15], since we are doing inference on latent variables as opposed to observations, which is a much greater undertaking that involves approximations, and we are doing so in a different context. 3 Gaussian Copula Process Volatility Assume we have a sequence of observations y = (y1 , . . . , yn ) at times t = (t1 , . . . , tn ) . The observations are random variables with different latent standard deviations. We therefore have n unobserved standard deviations, σ1 , . . . , σn , and want to learn the correlation structure between these standard deviations, and also to predict the distribution of σ∗ at some unrealised time t∗ . We model the standard deviation function as a Gaussian copula process: σt ∼ GCP(g −1 , 0, k(t, t )). (5) f (t) ∼ GP(m(t) = 0, k(t, t )) σ(t) = g(f (t), ω) (6) (7) y(t) ∼ N (0, σ 2 (t)), (8) Speciﬁcally, where g is a monotonic warping function, parametrized by ω. For each of the observations y = (y1 , . . . , yn ) we have corresponding GP latent function values f = (f1 , . . . , fn ) , where σ(ti ) = g(fi , ω), using the shorthand fi to mean f (ti ). σt ∼ GCP, because any ﬁnite sequence (σ1 , . . . , σp ) is distributed as a Gaussian copula: P (σ1 ≤ a1 , . . . , σp ≤ ap ) = P (g −1 (σ1 ) ≤ g −1 (a1 ), . . . , g −1 (σp ) ≤ g −1 (ap )) = ΦΓ (g −1 (a1 ), . . . , g −1 −1 (ap )) = ΦΓ (Φ = ΦΓ (Φ −1 (F (a1 )), . . . , Φ (u1 ), . . . , Φ −1 −1 (9) (F (ap ))) (up )) = C(u1 , . . . , up ), where Φ is the standard univariate Gaussian cdf (supposing k(t, t) = 1), ΦΓ is a multivariate Gaussian cdf with covariance matrix Γij = cov(g −1 (σi ), g −1 (σj )), and F is the marginal distribution of 3 each σi . In (5), we have Ψ = g −1 , because it is g −1 which maps σt to a GP. The speciﬁcation in (5) is equivalently expressed by (6) and (7). With GCPV, the form of g is learned so that g −1 (σt ) is best modelled by a GP. By learning g, we learn the marginal of each σ: F (a) = Φ(g −1 (a)) for a ∈ R. Recently, a different sort of ‘kernel copula process’ has been used, where the marginals of the variables being modelled are not learned [16].1 Further, we also consider a more subtle and ﬂexible form of our model, where the function g itself is indexed by time: g = gt (f (t), ω). We only assume that the marginal distributions of σt are stationary over ‘small’ time periods, and for each of these time periods (5)-(7) hold true. We return to this in the ﬁnal discussion section. Here we have assumed that each observation, conditioned on knowing its variance, is normally distributed with zero mean. This is a common assumption in heteroscedastic models. The zero mean and normality assumptions can be relaxed and are not central to this paper. 4 Predictions with GCPV Ultimately, we wish to infer p(σ(t∗ )|y, z), where z = {θ, ω}, and θ are the hyperparameters of the GP covariance function. To do this, we sample from p(f∗ |y, z) = p(f∗ |f , θ)p(f |y, z)df (10) and then transform these samples by g. Letting (Cf )ij = δij g(fi , ω)2 , where δij is the Kronecker delta, Kij = k(ti , tj ), (k∗ )i = k(t∗ , ti ), we have p(f |y, z) = N (f ; 0, K)N (y; 0, Cf )/p(y|z), p(f∗ |f , θ) = N (k∗ K −1 f , k(t∗ , t∗ ) − k∗ K −1 (11) k∗ ). (12) We also wish to learn z, which we can do by ﬁnding the z that maximizes the marginal likelihood, ˆ p(y|z) = p(y|f , ω)p(f |θ)df . (13) Unfortunately, for many functions g, (10) and (13) are intractable. Our methods of dealing with this can be used in very general circumstances, where one has a Gaussian process prior, but an (optionally parametrized) non-Gaussian likelihood. We use the Laplace approximation to estimate p(f |y, z) as a Gaussian. Then we can integrate (10) for a Gaussian approximation to p(f∗ |y, z), which we sample from to make predictions of σ∗ . Using Laplace, we can also ﬁnd an expression for an approximate marginal likelihood, which we maximize to determine z. Once we have found z with Laplace, we use Markov chain Monte Carlo to sample from p(f∗ |y, z), and compare that to using Laplace to sample from p(f∗ |y, z). In the supplement we relate this discussion to (9). 4.1 Laplace Approximation The goal is to approximate (11) with a Gaussian, so that we can evaluate (10) and (13) and make predictions. In doing so, we follow Rasmussen and Williams [11] in their treatment of Gaussian process classiﬁcation, except we use a parametrized likelihood, and modify Newton’s method. First, consider as an objective function the logarithm of an unnormalized (11): s(f |y, z) = log p(y|f , ω) + log p(f |θ). (14) ˆ The Laplace approximation uses a second order Taylor expansion about the f which maximizes ˆ, for which we use (14), to ﬁnd an approximate objective s(f |y, z). So the ﬁrst step is to ﬁnd f ˜ Newton’s method. The Newton update is f new = f − ( s(f ))−1 s(f ). Differentiating (14), s(f |y, z) = s(f |y, z) = where W is the diagonal matrix − 1 log p(y|f , ω) − K −1 f log p(y|f , ω) − K −1 = −W − K −1 , log p(y|f , ω). Note added in proof : Also, for a very recent related model, see Rodr´guez et al. [17]. ı 4 (15) (16) If the likelihood function p(y|f , ω) is not log concave, then W may have negative entries. Vanhatalo et al. [18] found this to be problematic when doing Gaussian process regression with a Student-t ˆ likelihood. They instead use an expectation-maximization (EM) algorithm for ﬁnding f , and iterate ordered rank one Cholesky updates to evaluate the Laplace approximate marginal likelihood. But EM can converge slowly, especially near a local optimum, and each of the rank one updates is vulnerable to numerical instability. With a small modiﬁcation of Newton’s method, we often get close to ˆ quadratic convergence for ﬁnding f , and can evaluate the Laplace approximate marginal likelihood in a numerically stable fashion, with no approximate Cholesky factors, and optimal computational requirements. Some comments are in the supplementary material but, in short, we use an approximate negative Hessian, − s ≈ M + K −1 , which is guaranteed to be positive deﬁnite, since M is formed on each iteration by zeroing the negative entries of W . For stability, we reformulate our 1 1 1 1 optimization in terms of B = I + M 2 KM 2 , and let Q = M 2 B −1 M 2 , b = M f + log p(y|f ), a = b − QKb. Since (K −1 + M )−1 = K − KQK, the Newton update becomes f new = Ka. ˆ With these updates we ﬁnd f and get an expression for s which we use to approximate (13) and ˜ (11). The approximate marginal likelihood q(y|z) is given by exp(˜)df . Taking its logarithm, s 1ˆ 1 ˆ log q(y|z) = − f af + log p(y|f ) − log |Bf |, (17) ˆ ˆ 2 2 ˆ ˆ where Bf is B evaluated at f , and af is a numerically stable evaluation of K −1 f . ˆ ˆ To learn the parameters z, we use conjugate gradient descent to maximize (17) with respect to z. ˆ ˆ Since f is a function of z, we initialize z, and update f every time we vary z. Once we have found an optimum z , we can make predictions. By exponentiating s, we ﬁnd a Gaussian approximation to ˆ ˜ ˆ the posterior (11), q(f |y, z) = N (f , K − KQK). The product of this approximate posterior with p(f∗ |f ) is Gaussian. Integrating this product, we approximate p(f∗ |y, z) as ˆ q(f∗ |y, z) = N (k∗ log p(y|f ), k(t∗ , t∗ ) − k∗ Qk∗ ). (18) Given n training observations, the cost of each Newton iteration is dominated by computing the cholesky decomposition of B, which takes O(n3 ) operations. The objective function typically changes by less than 10−6 after 3 iterations. Once Newton’s method has converged, it takes only O(1) operations to draw from q(f∗ |y, z) and make predictions. 4.2 Markov chain Monte Carlo We use Markov chain Monte Carlo (MCMC) to sample from (11), so that we can later sample from p(σ∗ |y, z) to make predictions. Sampling from (11) is difﬁcult, because the variables f are strongly coupled by a Gaussian process prior. We use a new technique, Elliptical Slice Sampling [19], and ﬁnd it extremely effective for this purpose. It was speciﬁcally designed to sample from posteriors with correlated Gaussian priors. It has no free parameters, and jointly updates every element of f . For our setting, it is over 100 times as fast as axis aligned slice sampling with univariate updates. To make predictions, we take J samples of p(f |y, z), {f 1 , . . . , f J }, and then approximate (10) as a mixture of J Gaussians: J 1 p(f∗ |f i , θ). (19) p(f∗ |y, z) ≈ J i=1 Each of the Gaussians in this mixture have equal weight. So for each sample of f∗ |y, we uniformly choose a random p(f∗ |f i , θ) and draw a sample. In the limit J → ∞, we are sampling from the exact p(f∗ |y, z). Mapping these samples through g gives samples from p(σ∗ |y, z). After one O(n3 ) and one O(J) operation, a draw from (19) takes O(1) operations. 4.3 Warping Function The warping function, g, maps fi , a GP function value, to σi , a standard deviation. Since fi can take any value in R, and σi can take any non-negative real value, g : R → R+ . For each fi to correspond to a unique deviation, g must also be one-to-one. We use K g(x, ω) = aj log[exp[bj (x + cj )] + 1], j=1 5 aj , bj > 0. (20) This is monotonic, positive, inﬁnitely differentiable, asymptotic towards zero as x → −∞, and K tends to ( j=1 aj bj )x as x → ∞. In practice, it is useful to add a small constant to (20), to avoid rare situations where the parameters ω are trained to make g extremely small for certain inputs, at the expense of a good overall ﬁt; this can happen when the parameters ω are learned by optimizing a likelihood. A suitable constant could be one tenth the absolute value of the smallest nonzero observation. By inferring the parameters of the warping function, or distributions of these parameters, we are learning a transformation which will best model σt with a Gaussian process. The more ﬂexible the warping function, the more potential there is to improve the GCPV ﬁt – in other words, the better we can estimate the ‘perfect’ transformation. To test the importance of this ﬂexibility, we also try a simple unparametrized warping function, g(x) = ex . In related work, Goldberg et al. [20] place a GP prior on the log noise level in a standard GP regression model on observations, except for inference they use Gibbs sampling, and a high level of ‘jitter’ for conditioning. Once g is trained, we can infer the marginal distribution of each σ: F (a) = Φ(g −1 (a)), for a ∈ R. This suggests an alternate way to initialize g: we can initialize F as a mixture of Gaussians, and then map through Φ−1 to ﬁnd g −1 . Since mixtures of Gaussians are dense in the set of probability distributions, we could in principle ﬁnd the ‘perfect’ g using an inﬁnite mixture of Gaussians [21]. 5 Experiments In our experiments, we predict the latent standard deviations σ of observations y at times t, and also σ∗ at unobserved times t∗ . To do this, we use two versions of GCPV. The ﬁrst variant, which we simply refer to as GCPV, uses the warping function (20) with K = 1, and squared exponential covariance function, k(t, t ) = A exp(−(t−t )2 /l2 ), with A = 1. The second variant, which we call GP-EXP, uses the unparametrized warping function ex , and the same covariance function, except the amplitude A is a trained hyperparameter. The other hyperparameter l is called the lengthscale of the covariance function. The greater l, the greater the covariance between σt and σt+a for a ∈ R. We train hyperparameters by maximizing the Laplace approximate log marginal likelihood (17). We then sample from p(f∗ |y) using the Laplace approximation (18). We also do this using MCMC (19) with J = 10000, after discarding a previous 10000 samples of p(f |y) as burn-in. We pass 2 these samples of f∗ |y through g and g 2 to draw from p(σ∗ |y) and p(σ∗ |y), and compute the sample mean and variance of σ∗ |y. We use the sample mean as a point predictor, and the sample variance for error bounds on these predictions, and we use 10000 samples to compute these quantities. For GCPV we use Laplace and MCMC for inference, but for GP-EXP we only use Laplace. We compare predictions to GARCH(1,1), which has been shown in extensive and recent reviews to be competitive with other GARCH variants, and more sophisticated models [5, 6, 7]. GARCH(p,q) speciﬁes y(t) ∼ p 2 2 N (0, σ 2 (t)), and lets the variance be a deterministic function of the past: σt = a0 + i=1 ai yt−i + q 2 j=1 bj σt−j . We use the Matlab Econometrics Toolbox implementation of GARCH, where the parameters a0 , ai and bj are estimated using a constrained maximum likelihood. We make forecasts of volatility, and we predict historical volatility. By ‘historical volatility’ we mean the volatility at observed time points, or between these points. Uncovering historical volatility is important. It could, for instance, be used to study what causes ﬂuctuations in the stock market, or to understand physical systems. To evaluate our model, we use the Mean Squared Error (MSE) between the true variance, or proxy for the truth, and the predicted variance. Although likelihood has advantages, we are limited in space, and we wish to harmonize with the econometrics literature, and other assessments of volatility models, where MSE is the standard. In a similar assessment of volatility models, Brownlees et al. [7] found that MSE and quasi-likelihood rankings were comparable. When the true variance is unknown we follow Brownlees et al. [7] and use squared observations as a proxy for the truth, to compare our model to GARCH.2 The more observations, the more reliable these performance estimates will be. However, not many observations (e.g. 100) are needed for a stable ranking of competing models; in Brownlees et al. [7], the rankings derived from high frequency squared observations are similar to those derived using daily squared observations. 2 Since each observation y is assumed to have zero mean and variance σ 2 , E[y 2 ] = σ 2 . 6 5.1 Simulations We simulate observations from N (0, σ 2 (t)), using σ(t) = sin(t) cos(t2 ) + 1, at t = (0, 0.02, 0.04, . . . , 4) . We call this data set TRIG. We also simulate using a standard deviation that jumps from 0.1 to 7 and back, at times t = (0, 0.1, 0.2, . . . , 6) . We call this data set JUMP. To forecast, we use all observations up until the current time point, and make 1, 7, and 30 step ahead predictions. So, for example, in TRIG we start by observing t = 0, and make forecasts at t = 0.02, 0.14, 0.60. Then we observe t = 0, 0.02 and make forecasts at t = 0.04, 0.16, 0.62, and so on, until all data points have been observed. For historical volatility, we predict the latent σt at the observation times, which is safe since we are comparing to the true volatility, which is not used in training; the results are similar if we interpolate. Figure 1 panels a) and b) show the true volatility for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP Laplace, and GARCH(1,1) predictions of historical volatility. Table 1 shows the results for forecasting and historical volatility. In panel a) we see that GCPV more accurately captures the dependencies between σ at different times points than GARCH: if we manually decrease the lengthscale in the GCPV covariance function, we can replicate the erratic GARCH behaviour, which inaccurately suggests that the covariance between σt and σt+a decreases quickly with increases in a. We also see that GCPV with an unparametrized exponential warping function tends to overestimates peaks and underestimate troughs. In panel b), the volatility is extremely difﬁcult to reconstruct or forecast – with no warning it will immediately and dramatically increase or decrease. This behaviour is not suited to a smooth squared exponential covariance function. Nevertheless, GCPV outperforms GARCH, especially in regions of low volatility. We also see this in panel a) for t ∈ (1.5, 2). GARCH is known to respond slowly to large returns, and to overpredict volatility [22]. In JUMP, the greater the peaks, and the smaller the troughs, the more GARCH suffers, while GCPV is mostly robust to these changes. 5.2 Financial Data The returns on the daily exchange rate between the Deutschmark (DM) and the Great Britain Pound (GBP) from 1984 to 1992 have become a benchmark for assessing the performance of GARCH models [8, 9, 10]. This exchange data, which we refer to as DMGBP, can be obtained from www.datastream.com, and the returns are calculated as rt = log(Pt+1 /Pt ), where Pt is the number of DM to GBP on day t. The returns are assumed to have a zero mean function. We use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility forecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992 (659 trading days). Every 7 days, we retrain the parameters of GCPV and GARCH. Every time we retrain parameters, we predict historical volatility over the past 120 days. The average MSE for these historical predictions is given in Table 1, although they should be observed with caution; unlike with the simulations, the DMGBP historical predictions are trained using the same data they are assessed on. In Figure 1c), we see that the GARCH one day ahead forecasts are lifted above the GCPV forecasts, but unlike in the simulations, they are now operating on a similar lengthscale. This suggests that GARCH could still be overpredicting volatility, but that GCPV has adapted its estimation of how σt and σt+a correlate with one another. Since GARCH is suited to this ﬁnancial data set, it is reassuring that GCPV predictions have a similar time varying structure. Overall, GCPV and GARCH are competitive with one another for forecasting currency exchange returns, as seen in Table 1. Moreover, a learned warping function g outperforms an unparametrized one, and a full Laplace solution is comparable to using MCMC for inference, in accuracy and speed. This is also true for the simulations. Therefore we recommend whichever is more convenient to implement. 6 Discussion We deﬁned a copula process, and as an example, developed a stochastic volatility model, GCPV, which can outperform GARCH. With GCPV, the volatility σt is distributed as a Gaussian Copula Process, which separates the modelling of the dependencies between volatilities at different times from their marginal distributions – arguably the most useful property of a copula. Further, GCPV ﬁts the marginals in the Gaussian copula process by learning a warping function. If we had simply chosen an unparametrized exponential warping function, we would incorrectly be assuming that the log 7 Table 1: MSE for predicting volatility. Data set Model Historical 1 step 7 step 30 step TRIG GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.0953 0.0760 0.193 0.938 0.588 0.622 0.646 1.04 0.951 0.979 1.36 1.79 1.71 1.76 1.15 5.12 JUMP GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.588 1.21 1.43 1.88 0.891 0.951 1.76 1.58 1.38 1.37 6.95 3.43 1.35 1.35 14.7 5.65 GCPV (LA) GCPV (MCMC) GP-EXP GARCH 2.43 2.39 2.52 2.83 3.00 3.00 3.20 3.03 3.08 3.08 3.46 3.12 3.17 3.17 5.14 3.32 ×103 DMGBP ×10−9 TRIG JUMP DMGBP 20 DMGBP 0.015 600 Probability Density 3 1 Volatility Volatility Volatility 15 2 10 0.01 0.005 5 0 0 1 2 Time (a) 3 4 0 0 2 4 0 6 Time (b) 0 200 400 Days (c) 600 400 200 0 0 0.005 σ (d) 0.01 Figure 1: Predicting volatility and learning its marginal pdf. For a) and b), the true volatility, and GCPV (MCMC), GCPV (LA), GP-EXP, and GARCH predictions, are shown respectively by a thick green line, a dashed thick blue line, a dashed black line, a cyan line, and a red line. a) shows predictions of historical volatility for TRIG, where the shade is a 95% conﬁdence interval about GCPV (MCMC) predictions. b) shows predictions of historical volatility for JUMP. In c), a black line and a dashed red line respectively show GCPV (LA) and GARCH one day ahead volatility forecasts for DMGBP. In d), a black line and a dashed blue line respectively show the GCPV learned marginal pdf of σt in DMGBP and a Gamma(4.15,0.00045) pdf. volatilities are marginally Gaussian distributed. Indeed, for the DMGBP data, we trained the warping function g over a 120 day period, and mapped its inverse through the univariate standard Gaussian cdf Φ, and differenced, to estimate the marginal probability density function (pdf) of σt over this period. The learned marginal pdf, shown in Figure 1d), is similar to a Gamma(4.15,0.00045) distribution. However, in using a rolling window to retrain the parameters of g, we do not assume that the marginals of σt are stationary; we have a time changing warping function. While GARCH is successful, and its simplicity is attractive, our model is also simple and has a number of advantages. We can effortlessly handle missing data, we can easily incorporate covariates other than time (like interest rates) in our covariance function, and we can choose from a rich class of covariance functions – squared exponential, Brownian motion, Mat´ rn, periodic, etc. In fact, the e volatility of high frequency intradaily returns on equity indices and currency exchanges is cyclical [23], and GCPV with a periodic covariance function is uniquely well suited to this data. And the parameters of GCPV, like the covariance function lengthscale, or the learned warping function, provide insight into the underlying source of volatility, unlike the parameters of GARCH. Finally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are being used. With our copula process one can learn the dependencies between arbitrarily many random variables independently of their marginal distributions. We hope the Gaussian Copula Process Volatility model will encourage other applications of copula processes. More generally, we hope our work will help bring together the machine learning and econometrics communities. Acknowledgments: Thanks to Carl Edward Rasmussen and Ferenc Husz´ r for helpful conversaa tions. AGW is supported by an NSERC grant. 8 References [1] Paul Embrechts, Alexander McNeil, and Daniel Straumann. Correlation and dependence in risk management: Properties and pitfalls. In Risk Management: Value at risk and beyond, pages 176–223. Cambridge University Press, 1999. [2] David X. Li. On default correlation: A copula function approach. Journal of Fixed Income, 9(4):43–54, 2000. [3] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics, second edition, 2006. [4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31 (3):307–327, 1986. [5] Ser-Huang Poon and Clive W.J. Granger. Practical issues in forecasting volatility. Financial Analysts Journal, 61(1):45–56, 2005. [6] Peter Reinhard Hansen and Asger Lunde. A forecast comparison of volatility models: Does anything beat a GARCH(1,1). Journal of Applied Econometrics, 20(7):873–889, 2005. [7] Christian T. Brownlees, Robert F. Engle, and Bryan T. Kelly. A practical guide to volatility forecasting through calm and storm, 2009. Available at SSRN: http://ssrn.com/abstract=1502915. [8] T. Bollerslev and E. Ghysels. Periodic autoregressive conditional heteroscedasticity. Journal of Business and Economic Statistics, 14:139–151, 1996. [9] B.D. McCullough and C.G. Renfro. Benchmarks and software standards: A case study of GARCH procedures. Journal of Economic and Social Measurement, 25:59–71, 1998. [10] C. Brooks, S.P. Burke, and G. Persand. Benchmarks and the accuracy of GARCH model estimation. International Journal of Forecasting, 17:45–56, 2001. [11] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006. ` [12] Abe Sklar. Fonctions de r´ partition a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8: e 229–231, 1959. [13] P Deheuvels. Caract´ isation compl` te des lois extrˆ mes multivari´ s et de la convergence des types e e e e extrˆ mes. Publications de l’Institut de Statistique de l’Universit´ de Paris, 23:1–36, 1978. e e [14] G Kimeldorf and A Sampson. Uniform representations of bivariate distributions. Communications in Statistics, 4:617–627, 1982. [15] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes. In NIPS, 2003. [16] Sebastian Jaimungal and Eddie K.H. Ng. Kernel-based Copula processes. In ECML PKDD, 2009. [17] A. Rodr´guez, D.B. Dunson, and A.E. Gelfand. Latent stick-breaking processes. Journal of the American ı Statistical Association, 105(490):647–659, 2010. [18] Jarno Vanhatalo, Pasi Jylanki, and Aki Vehtari. Gaussian process regression with Student-t likelihood. In NIPS, 2009. [19] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical Slice Sampling. In AISTATS, 2010. [20] Paul W. Goldberg, Christopher K.I. Williams, and Christopher M. Bishop. Regression with inputdependent noise: A Gaussian process treatment. In NIPS, 1998. [21] Carl Edward Rasmussen. The Inﬁnite Gaussian Mixture Model. In NIPS, 2000. [22] Ruey S. Tsay. Analysis of Financial Time Series. John Wiley & Sons, 2002. [23] Torben G. Andersen and Tim Bollerslev. Intraday periodicity and volatility persistence in ﬁnancial markets. Journal of Empirical Finance, 4(2-3):115–158, 1997. 9

same-paper 3 0.83358914 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

4 0.81854868 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efﬁcient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efﬁcient method for implementing mixture forecasting strategies. 1

5 0.78520685 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

Author: Alessandro Chiuso, Gianluigi Pillonetto

Abstract: We introduce a new Bayesian nonparametric approach to identiﬁcation of sparse dynamic linear systems. The impulse responses are modeled as Gaussian processes whose autocovariances encode the BIBO stability constraint, as deﬁned by the recently introduced “Stable Spline kernel”. Sparse solutions are obtained by placing exponential hyperpriors on the scale factors of such kernels. Numerical experiments regarding estimation of ARMAX models show that this technique provides a deﬁnite advantage over a group LAR algorithm and state-of-the-art parametric identiﬁcation techniques based on prediction error minimization. 1

6 0.77958477 265 nips-2010-The LASSO risk: asymptotic results and real world examples

7 0.77661884 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

8 0.77621996 81 nips-2010-Evaluating neuronal codes for inference using Fisher information

9 0.77587795 221 nips-2010-Random Projections for $k$-means Clustering

10 0.77488339 261 nips-2010-Supervised Clustering

11 0.77457142 117 nips-2010-Identifying graph-structured activation patterns in networks

12 0.77378255 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

13 0.77294886 36 nips-2010-Avoiding False Positive in Multi-Instance Learning

14 0.77255982 63 nips-2010-Distributed Dual Averaging In Networks

15 0.77188629 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

16 0.77163178 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

17 0.77107525 27 nips-2010-Agnostic Active Learning Without Constraints

18 0.7710641 243 nips-2010-Smoothness, Low Noise and Fast Rates

19 0.77078766 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

20 0.77002251 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information