jmlr jmlr2011 jmlr2011-80 knowledge-graph by maker-knowledge-mining

80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Source: pdf

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a ﬁxed-rank positive semideﬁnite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of ﬁxedrank positive semideﬁnite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semideﬁnite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semideﬁnite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of ﬁxedrank positive semideﬁnite matrices. [sent-11, score-0.278]

2 Learning a parametric model in S+ (r, d) amounts to jointly learn a r-dimensional subspace and a quadratic distance in this subspace. [sent-24, score-0.211]

3 Our approach makes use of two quotient geometries of the set S+ (r, d) that have been recently studied by Journée et al. [sent-29, score-0.236]

4 ˆ ˆ As it is generally not possible to compute F(W) explicitly, batch learning algorithms minimize instead the empirical cost 1 n fn (W) = ˆ (1) ∑ (yi − yi )2 , 2n i=1 which is the average loss computed over a ﬁnite number of samples {(Xi , yi )}n . [sent-68, score-0.327]

5 At time t, the online learning algorithm minimizes the instantaneous cost 1 ft (W) = (yt − yt )2 . [sent-70, score-0.24]

6 The single necessary change to convert an online algorithm into its batch counterpart is to perform, at each iteration, the minimization of the empirical cost fn instead of the minimization of the instantaneous cost ft . [sent-72, score-0.327]

7 (2008), an abstract gradient descent algorithm can then be derived based on the update formula Wt+1 = RWt (−st grad f (Wt )). [sent-77, score-0.498]

8 595 (2) M EYER , B ONNABEL AND S EPULCHRE The gradient grad f (Wt ) is an element of the tangent space TWt W . [sent-78, score-0.538]

9 The retraction RWt is a mapping from the tangent space TWt W to the Riemannian manifold. [sent-80, score-0.278]

10 Under mild conditions on the retraction R, the classical convergence theory of line-search algorithms in linear spaces generalizes to Riemannian manifolds (see Absil et al. [sent-81, score-0.218]

11 Observe that the standard (online) learning algorithm for linear regression in Rd , wt+1 = wt − st (wtT xt − yt )xt , (3) can be interpreted as a particular case of (2) for the linear model y = wT x in the linear search space ˆ d . [sent-83, score-0.533]

12 This example illustrates that the main ingredients to obtain a concrete algorithm are convenient formulas for the gradient and for the retraction mapping. [sent-87, score-0.259]

13 Each of those sets will be equipped with quotient Riemannian geometries that provide convenient formulas for the gradient and for the retractions. [sent-89, score-0.331]

14 However, when the search space is endowed with a particular manifold structure, it is possible to design an exploration strategy that is consistent with the geometry of the problem and that appropriately turns the problem into an unconstrained optimization problem. [sent-97, score-0.243]

15 One usually makes the distinction between embedded submanifolds (subsets of larger manifolds) and quotient manifolds (manifolds described by a set of equivalence classes). [sent-100, score-0.292]

16 A typical example of quotient manifold is the set of rdimensional subspaces in Rd , viewed as a collection of r-dimensional orthogonal frames that cannot be superposed by a rotation. [sent-102, score-0.268]

17 The rotational variants of a given frame thus deﬁne an equivalence class (denoted using square brackets [·]), which is identiﬁed as a single point on the quotient manifold. [sent-103, score-0.238]

18 The updated point Wt+1 automatically remains inside the manifold thanks to the retraction mapping. [sent-107, score-0.231]

19 For quotient manifolds W = W / ∼, where W is the total space and ∼ is the equivalence relation that deﬁnes the quotient, the tangent space T[W] W at [W] is sufﬁciently described by the directions that do not induce any displacement in the set of equivalence classes [W]. [sent-111, score-0.443]

20 Provided that the metric gW in the total space is invariant along the ¯ equivalence classes, it deﬁnes a metric in the quotient space g[W] (ξ[W] , ζ[W] ) gW (ξW , ζW ). [sent-113, score-0.448]

21 ¯ ¯ ¯ The horizontal gradient grad f (W) is obtained by projecting the gradient grad f (W) in the total ¯ space onto the set of horizontal vectors ξW at W. [sent-114, score-0.934]

22 Natural displacements at W in a direction ξW on the manifold are performed by following geodesics (paths of shortest length on the manifold) starting from W and tangent to ξW . [sent-115, score-0.222]

23 It is a quotient representation of the set of r-dimensional subspaces in Rd , that is, the Grassmann manifold Gr(r, d). [sent-132, score-0.268]

24 The quotient geometries of Gr(r, d) have been well studied (Edelman et al. [sent-133, score-0.236]

25 The metric g[U] (ξ[U] , ζ[U] ) gU (ξU , ζU ), ¯ ¯ ¯ is induced by the standard metric in Rd×r , ∆1 gU (∆ 1 , ∆2 ) = Tr(∆ T ∆2 ), ¯ ∆ which is invariant along the ﬁbers, that is, equivalence classes. [sent-136, score-0.247]

26 Therefore, the gradient admits the simple horizontal representation grad f (U) = ΠU grad f (U), 598 (5) R EGRESSION ON F IXED -R ANK PSD M ATRICES : A R IEMANNIAN A PPROACH where grad f (U) is deﬁned by the identity ∆ D f (U)[∆ ] = gU (∆ , grad f (U)). [sent-138, score-1.547]

27 A possible advantage of the retraction (7) over the retraction (6) is that, in contrast to the SVD computation, the QR decomposition is computed in a ﬁxed number O(dr2 ) of arithmetic operations. [sent-143, score-0.328]

28 With the formulas (5) and (7) applied to the cost function (4), the abstract update (2) becomes Ut+1 = qf(Ut + st (I − Ut UtT )xt xtT Ut ), which is Oja’s update for subspace tracking (Oja, 1992). [sent-144, score-0.372]

29 ˆ 2 The quotient geometries of S+ (d) are rooted in the matrix factorization W = GGT , G ∈ GL(d), where GL(d) is the set of all invertible d × d matrices. [sent-153, score-0.236]

30 Because the factorization is invariant by rotation, G → GO, O ∈ O (d), the search space is once again identiﬁed to the quotient S+ (d) ≃ GL(d)/O (d), which represents the set of equivalence classes [G] = {GO s. [sent-154, score-0.27]

31 1 A Flat Metric on S+ (d) The metric on the quotient GL(d)/O (d): g[G] (ξ[G] , ζ[G] ) gG (ξG , ζG ), ¯ ¯ ¯ is induced by the standard metric in Rd×d , ∆1 gG (∆ 1 , ∆ 2 ) = Tr(∆ T ∆ 2 ), ¯ ∆ which is invariant by rotation along the set of equivalence classes. [sent-159, score-0.448]

32 With this geometry, a tangent vector ξ[G] at [G] is represented by a horizontal ¯ tangent vector ξG at G by ¯ ∆ ξG = Sym(∆ )G, ∆ ∈ Rd×d . [sent-161, score-0.271]

33 The horizontal gradient of 1 f (G) = ℓ(y, y) = (Tr(GGT Sym(X)) − y)2 , ˆ 2 (8) is the unique horizontal vector grad f (G) that satisﬁes ∆ D f (G)[∆ ] = gG (∆ , grad f (G)). [sent-162, score-0.87]

34 ¯ ∆ Elementary computations yield grad f (G) = 2(y − y)Sym(X)G. [sent-163, score-0.36]

35 Those formulas applied to the cost (8) turns the abstract update (2) into the simple formula Gt+1 = Gt − 2st (yt − yt )Sym(Xt )Gt , ˆ (9) for an online gradient algorithm and Gt+1 = Gt − 2st 1 n ˆ ∑ (yi − yi )Sym(Xi )Gt , n i=1 (10) for a batch gradient algorithm. [sent-165, score-0.656]

36 (11) The afﬁne-invariant geometry of S+ (d) has been well studied, in particular in the context of information geometry (Smith, 2005). [sent-172, score-0.352]

37 The gradient grad f (W) is given by ∆ ∆ D f (W)[∆ ] = gW (∆ , grad f (W)). [sent-178, score-0.784]

38 Applying this formula to (5) yields grad f (W) = (y − y)WSym(X)W. [sent-179, score-0.36]

39 (14) The formulas (12) and (13) applied to the cost (5) turn the abstract update (2) into 1 1 1 1 Wt+1 = Wt2 exp(−st (yt − yt )Wt2 Sym(Xt )Wt2 )Wt2 . [sent-182, score-0.224]

40 ˆ With the alternative retraction (14), the update becomes Wt+1 = Wt − st (yt − yt )Wt Sym(Xt )Wt , ˆ which is the update of Davis et al. [sent-183, score-0.472]

41 The gradient of this cost function is given by grad f (S) = (yt − yt )Sym(Xt ), ˆ and the retraction is RS (sξS ) = exp(log W + sξS ). [sent-192, score-0.745]

42 The corresponding gradient descent update is Wt+1 = exp(log Wt − st (yt − yt )Sym(Xt )), ˆ which is the update of Tsuda et al. [sent-193, score-0.41]

43 Indeed, the ﬂat quotient geometry of the manifold ∗ S+ (d) ≃ GL(d)/O (d) is generalized to the quotient geometry of S+ (r, d) ≃ Rd×r /O (r) by a mere ∗ adaptation of matrix dimension, leading to the updates (9) and (10) for matrices Gt ∈ Rd×r . [sent-202, score-0.863]

44 (2010), where the quotient geometry of S+ (r, d) ≃ Rd×r /O (r) is studied in ∗ details. [sent-204, score-0.377]

45 In the next section, we propose an alternative geometry that jointly learns a r-dimensional subspace and a full-rank quadratic model in this subspace. [sent-205, score-0.302]

46 However, a generalization ∗ ∗ is possible by considering the polar matrix factorization G = UR, U ∈ St(r, d), R ∈ S+ (r). [sent-208, score-0.257]

47 This gives a polar parameterization of S+ (r, d) W = UR2 UT . [sent-210, score-0.257]

48 This development leads to the quotient representation S+ (r, d) ≃ (St(r, d) × S+ (r))/O (r), (15) based on the invariance of W to the transformation (U, R2 ) → (UO, OT R2 O), O ∈ O (r). [sent-211, score-0.237]

49 A ¯ tangent vector ξ[W] = (ξU , ξR2 )[U,R2 ] at [(U, R2 )] is described by a horizontal tangent vector ξW = ¯ ¯ (ξU , ξR2 )(U,R2 ) at (U, R2 ) by ¯ ξU = ΠU ∆ , ∆ ∈ Rd×r , ¯ Ψ ξR2 = RSym(Ψ )R, Ψ ∈ Rr×r . [sent-217, score-0.271]

50 ¯ Ψ The proposed metric is invariant along the set of equivalence classes and thus induces a quotient structure on S+ (r, d). [sent-219, score-0.359]

51 Combining the gradient of (16) with the retractions (18) and (19) gives Ut+1 = qf Ut − 2λst (yt − yt )(I − Ut UtT )Sym(Xt )Ut Rt2 , ˆ 2 Rt+1 = Rt exp −(1 − λ)st (yt − yt )Rt UtT Sym(Xt )Ut Rt Rt . [sent-228, score-0.406]

52 A proper tuning of this parameter allows us to place more emphasis either on the learning of the subspace U or on the distance in that subspace R2 . [sent-238, score-0.337]

53 Intermediate values of λ continuously interpolate between the subspace learning problem and the distance learning problem at ﬁxed range space. [sent-242, score-0.211]

54 A proper tuning of λ is of interest when a good estimate of the subspace is available (for instance a subspace given by a proper dimension reduction technique) or when too few observations are available to jointly estimate the subspace and the distance within that subspace. [sent-243, score-0.463]

55 In the latter case, one has the choice to favor either subspace or distance learning. [sent-244, score-0.211]

56 4 Strategies for Choosing the Step Size We here present strategies for choosing the step size in both the batch and online cases. [sent-263, score-0.263]

57 In this paper, we use the Armijo step sA deﬁned at each iteration by the condition f (RWt (−sA grad f (Wt ))) ≤ f (Wt ) + c grad f (Wt ) 2 Wt , (22) where Wt ∈ S+ (r, d) is the current iterate, c ∈ (0, 1), f is the empirical cost (1) and RW is the chosen retraction. [sent-267, score-0.752]

58 In order to reduce the dependence on smax in a particular problem, it is chosen inversely proportional to the norm of the gradient at each iteration, smax = s0 grad f (Wt ) . [sent-270, score-0.486]

59 In this paper, the step size schedule st is chosen as s nt0 st = × , (23) µgrad nt0 + t ˆ where s > 0, n is the number of considered learning samples, µgrad is an estimate of the average ˆ gradient norm grad f (W0 ) W0 , and t0 > 0 controls the annealing rate of st . [sent-275, score-0.757]

60 For batch algorithms, the local convergence results follow from the convergence theory of linesearch algorithms on Riemannian manifolds (see, for example, Absil et al. [sent-295, score-0.234]

61 For online algorithms, one can prove that the algorithm based on the ﬂat geometry enjoys almost sure asymptotic convergence to a local minimum of the expected cost. [sent-297, score-0.259]

62 In contrast, when the polar parameterization is used, the convergence results presented by Bottou (1998) do not apply directly because of the quotient nature of the search space. [sent-299, score-0.458]

63 Because the extension would require technical arguments beyond the scope of the present paper, we refrain from stating a formal convergence result for the online algorithm based on the polar geometry, even though the result is quite plausible. [sent-300, score-0.34]

64 One solves (25) by solving the algebraic equation ˆ grad D(W, Wt ) = −st grad ℓ(yt+1 , yt ), ˆ 607 (26) M EYER , B ONNABEL AND S EPULCHRE which is a ﬁrst-order (necessary) optimality condition. [sent-313, score-0.845]

65 If the search space W is a Riemannian manifold and if the closeness measure D(W, Wt ) is the Riemannian distance, the solution of (26) is Wt+1 = ExpWt (−st grad ℓ(yt+1 , yt )). [sent-314, score-0.592]

66 However, yt+1 is generally ˆ ˆ replaced by yt (which is equal to yt+1 up to ﬁrst order terms in st ), which gives the update (2) where ˆ ˆ the exponential mapping is chosen as a retraction. [sent-316, score-0.272]

67 We have shown in Section 5 that the resulting updates can be interpreted as line-search updates for the log-Euclidean metric and the afﬁne-invariant metric of S+ (d) and for speciﬁc choices of the retraction mapping. [sent-319, score-0.342]

68 where the θi ’s are the principal angles between the subspaces spanned by W and Wt (Golub and Van Loan, 1996), and the second term is the afﬁne-invariant distance of S+ (d) between matrices R2 and Rt2 involved in the polar representation of W and Wt . [sent-322, score-0.384]

69 A potential area of future work is the application of the proposed online algorithm (9) for adapting a batch solution to slight modiﬁcations of the dissimilarities over time. [sent-346, score-0.263]

70 Overall, the experiments support that a joint estimation of a subspace and low-dimensional distance in that subspace is a major advantage of the proposed algorithms over methods that estimate the matrix for a subspace that is ﬁxed beforehand. [sent-413, score-0.463]

71 Recent methods compute that subspace of reduced dimension using principal component analysis (Davis and Dhillon, 2008; Weinberger and Saul, 2009), that is, a subspace that captures a maximal amount of variance in the data. [sent-445, score-0.252]

72 However, in general, there is no reason why the subspace spanned by the top principal components should coincide with the subspace that is deﬁned by the target model. [sent-446, score-0.252]

73 Therefore, a more appropriate approach consists in learning jointly the subspace and a distance in that subspace that best ﬁts the data to observations within that subspace. [sent-447, score-0.337]

74 The top plot shows that the learned subspace (which identiﬁes with the target subspace) is indeed very different from the subspace spanned by the top two principal components. [sent-453, score-0.252]

75 612 R EGRESSION ON F IXED -R ANK PSD M ATRICES : A R IEMANNIAN A PPROACH Data Target model subspace PCA subspace Learned subspace 10 x3 5 0 −5 −10 5 5 0 x2 0 −5 −5 1 x1 0. [sent-482, score-0.378]

76 Top: the learned subspace is very different from the subspace computed from a classical heuristic. [sent-502, score-0.252]

77 subspace and the distance in that subspace are learned jointly. [sent-505, score-0.337]

78 BATCH This experiment shows that when a large amount of sample is available (80, 000 training samples and 20, 000 test samples for learning a parameter W∗ in S+ (10, 50)), online algorithms minimize the test error more rapidly than batch ones. [sent-545, score-0.329]

79 Figure 5 report the test error as a function of the learning time, that is, the time after each iteration for batch algorithm and the time after each epoch for online algorithms. [sent-548, score-0.263]

80 For the algorithm based on the polar geometry, the mini-batch extension is strongly recommended to amortize the larger cost of each update. [sent-549, score-0.289]

81 For a large number of samples, online algorithms reduce the test error much more rapidly than batch ones. [sent-553, score-0.263]

82 Distance constraints are generated as yi j ≤ yi j (1 − α) for identically labeled samples and yi j ≥ ˆ ˆ yi j (1 + α) for differentially labeled samples, where α ≥ 0 is a scaling factor, yi j = φ(xi ) − φ(x j ) 2 and yi j = Tr(W(ei − e j )(ei − e j )T ) = (ei − e j )T W(ei − e j ). [sent-562, score-0.324]

83 1 90 Clustering Score (NMI) Classification accuracy (%) 100 80 70 60 50 40 30 20 10 Batch flat geometry Batch polar geometry LogDet−KL Original kernel 0 0 100 200 300 400 500 600 700 800 9001000 Number of constraints 0. [sent-589, score-0.808]

84 6 Batch flat geometry Batch polar geometry (λ = 0. [sent-591, score-0.696]

85 5) Batch polar geometry (λ = 0) LogDet−KL KSR Original Kernel 0. [sent-592, score-0.433]

86 The algorithm based on the polar geometry competes with LogDet-KL. [sent-595, score-0.433]

87 We compare the proposed batch methods with the LogDet-KL algorithm, the only competing algorithm that also learns directly from distance constraints. [sent-606, score-0.265]

88 In this full-rank learning setting, the algorithm based on the polar geometry compete with the LogDet-KL algorithm. [sent-611, score-0.433]

89 The convergence time of the algorithm based on the polar geometry is however much faster (0. [sent-612, score-0.433]

90 The algorithm based on the ﬂat geometry has inferior performance when too few constraints are provided. [sent-614, score-0.221]

91 The ﬁgure shows that KSR, LogDet-KL and the algorithm based on the polar geometry with λ = 0 perform similarly. [sent-626, score-0.433]

92 These methods are however outperformed by the proposed algorithms (ﬂat geometry and polar geometry with λ = 0. [sent-627, score-0.609]

93 This experiment also enlightens the ﬂexibility of the polar geometry, which allows us to ﬁx the subspace in situations where too few constraints are available. [sent-629, score-0.428]

94 4 Batch flat geometry Batch polar geometry LogDet−KL KSR Original kernel 0. [sent-634, score-0.763]

95 6 Batch flat geometry Batch polar geometry LogDet−KL KSR MVU Kernel PCA 0. [sent-636, score-0.696]

96 30 seconds for the algorithm based on the polar geometry. [sent-657, score-0.257]

97 3 R ESULTS 35 Batch flat geometry Batch polar geometry ITML LEGO POLA LMNN Euclidean baseline Classification Error (%) 30 25 20 15 10 5 0 Wine Ionosphere Bal. [sent-700, score-0.696]

98 (2009), we demonstrate that the proposed batch algorithms compete with state-of-the-art full-rank Mahalanobis distance learning algorithms on several UCI data sets (Figure 8). [sent-704, score-0.265]

99 We have not included the online versions of our algorithms in this comparison because we consider that the batch approaches are more relevant on such small data sets. [sent-705, score-0.263]

100 The rich Riemannian geometry of the set of ﬁxed-rank PSD matrices is exploited through a geometric optimization approach. [sent-723, score-0.218]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('grad', 0.36), ('polar', 0.257), ('wt', 0.211), ('sym', 0.206), ('quotient', 0.201), ('batch', 0.18), ('geometry', 0.176), ('riemannian', 0.173), ('retraction', 0.164), ('epulchre', 0.144), ('eyer', 0.144), ('iemannian', 0.144), ('onnabel', 0.144), ('mahalanobis', 0.141), ('subspace', 0.126), ('yt', 0.125), ('tol', 0.123), ('ank', 0.122), ('atrices', 0.122), ('tangent', 0.114), ('gw', 0.113), ('st', 0.111), ('ixed', 0.11), ('psd', 0.108), ('ut', 0.103), ('ksr', 0.103), ('rt', 0.096), ('kulis', 0.094), ('pproach', 0.094), ('gt', 0.092), ('metric', 0.089), ('flat', 0.087), ('distance', 0.085), ('rd', 0.085), ('online', 0.083), ('absil', 0.082), ('egression', 0.076), ('itml', 0.072), ('lego', 0.072), ('lmnn', 0.072), ('logdet', 0.072), ('utt', 0.072), ('tr', 0.069), ('kernel', 0.067), ('manifold', 0.067), ('gradient', 0.064), ('bonnabel', 0.062), ('rwt', 0.062), ('qf', 0.061), ('gl', 0.061), ('semide', 0.06), ('davis', 0.057), ('pca', 0.056), ('xt', 0.054), ('manifolds', 0.054), ('weinberger', 0.054), ('nmi', 0.051), ('pola', 0.051), ('sepulchre', 0.051), ('constraints', 0.045), ('horizontal', 0.043), ('matrices', 0.042), ('tsuda', 0.042), ('yi', 0.041), ('geodesics', 0.041), ('closeness', 0.04), ('isolet', 0.039), ('bregman', 0.038), ('descent', 0.038), ('rank', 0.038), ('equivalence', 0.037), ('asuncion', 0.036), ('newman', 0.036), ('invariance', 0.036), ('tw', 0.036), ('update', 0.036), ('ggt', 0.035), ('geometries', 0.035), ('uo', 0.035), ('multidimensional', 0.034), ('divergences', 0.034), ('samples', 0.033), ('regression', 0.032), ('clustering', 0.032), ('invariant', 0.032), ('jain', 0.032), ('cost', 0.032), ('mvu', 0.031), ('meyer', 0.031), ('classification', 0.031), ('formulas', 0.031), ('af', 0.031), ('xi', 0.031), ('gyrb', 0.031), ('retractions', 0.031), ('smax', 0.031), ('twt', 0.031), ('wsym', 0.031), ('xxt', 0.031), ('gu', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

2 0.18631899 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The ﬁrst method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

3 0.14826575 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a uniﬁed way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function

4 0.1390972 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Author: John Duchi, Elad Hazan, Yoram Singer

Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to ﬁnd needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which signiﬁcantly simpliﬁes setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efﬁcient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization

5 0.13232724 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama

Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally conﬁrm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efﬁciency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization

6 0.089073956 55 jmlr-2011-Learning Multi-modal Similarity

7 0.076378822 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

8 0.068967715 14 jmlr-2011-Better Algorithms for Benign Bandits

9 0.065684125 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

10 0.063851871 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

11 0.059801783 28 jmlr-2011-Double Updating Online Learning

12 0.051474869 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

13 0.049506269 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

14 0.049182698 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

15 0.044235874 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

16 0.042930093 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

17 0.041274782 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

18 0.040951792 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

19 0.037315492 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

20 0.036439948 6 jmlr-2011-A Simpler Approach to Matrix Completion

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.242), (1, 0.334), (2, -0.034), (3, -0.152), (4, -0.091), (5, -0.142), (6, -0.023), (7, -0.021), (8, -0.017), (9, -0.004), (10, 0.04), (11, 0.051), (12, 0.019), (13, -0.016), (14, -0.077), (15, -0.009), (16, 0.199), (17, 0.019), (18, -0.066), (19, -0.126), (20, 0.049), (21, 0.065), (22, -0.005), (23, -0.005), (24, 0.027), (25, 0.027), (26, -0.011), (27, -0.056), (28, -0.031), (29, 0.0), (30, 0.099), (31, 0.031), (32, -0.059), (33, -0.056), (34, -0.068), (35, -0.015), (36, 0.045), (37, -0.003), (38, -0.012), (39, -0.034), (40, 0.03), (41, -0.047), (42, 0.033), (43, 0.071), (44, -0.044), (45, 0.001), (46, -0.022), (47, -0.006), (48, 0.023), (49, -0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94248533 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

2 0.70279527 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

3 0.66833538 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

4 0.62459671 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama

5 0.50603348 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

6 0.46392581 55 jmlr-2011-Learning Multi-modal Similarity

7 0.44927028 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

8 0.43349341 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

9 0.38907802 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

10 0.37155202 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

11 0.34793514 28 jmlr-2011-Double Updating Online Learning

12 0.31094787 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

13 0.29275802 14 jmlr-2011-Better Algorithms for Benign Bandits

14 0.2776866 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

15 0.2661199 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

16 0.26493254 6 jmlr-2011-A Simpler Approach to Matrix Completion

17 0.23979324 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

18 0.23725623 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

19 0.2307947 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

20 0.22734815 48 jmlr-2011-Kernel Analysis of Deep Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.036), (9, 0.03), (10, 0.023), (11, 0.449), (24, 0.044), (31, 0.082), (32, 0.03), (41, 0.019), (60, 0.015), (65, 0.011), (70, 0.022), (73, 0.057), (78, 0.079), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86202431 41 jmlr-2011-Improved Moves for Truncated Convex Models

Author: M. Pawan Kumar, Olga Veksler, Philip H.S. Torr

Abstract: We consider the problem of obtaining an approximate maximum a posteriori estimate of a discrete random ﬁeld characterized by pairwise potentials that form a truncated convex model. For this problem, we propose two st-MINCUT based move making algorithms that we call Range Swap and Range Expansion. Our algorithms can be thought of as extensions of αβ-Swap and α-Expansion respectively that fully exploit the form of the pairwise potentials. Speciﬁcally, instead of dealing with one or two labels at each iteration, our methods explore a large search space by considering a range of labels (that is, an interval of consecutive labels). Furthermore, we show that Range Expansion provides the same multiplicative bounds as the standard linear programming (LP) relaxation in polynomial time. Compared to previous approaches based on the LP relaxation, for example interior-point algorithms or tree-reweighted message passing (TRW), our methods are faster as they use only the efﬁcient st-MINCUT algorithm in their design. We demonstrate the usefulness of the proposed approaches on both synthetic and standard real data problems. Keywords: truncated convex models, move making algorithms, range moves, multiplicative bounds, linear programming relaxation

same-paper 2 0.74686688 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

3 0.3231225 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

4 0.29321888 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classiﬁcation, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a ﬁxed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefﬁcient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefﬁcient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefﬁcient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

5 0.29156131 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-deﬁnite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efﬁcient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efﬁcient approximations, multitask learning, structured outputs, multivariate processes

6 0.29154843 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

7 0.29023892 12 jmlr-2011-Bayesian Co-Training

8 0.28992495 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

9 0.2887527 35 jmlr-2011-Forest Density Estimation

10 0.28780675 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

11 0.28641391 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.28460467 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

13 0.28245607 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

14 0.28204697 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

15 0.28193218 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

16 0.28138229 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

17 0.28082141 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

18 0.28067055 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

19 0.28064641 16 jmlr-2011-Clustering Algorithms for Chains

20 0.2805073 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity