jmlr jmlr2011 jmlr2011-80 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
Reference: text
sentIndex sentText sentNum sentScore
1 The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. [sent-11, score-0.278]
2 Learning a parametric model in S+ (r, d) amounts to jointly learn a r-dimensional subspace and a quadratic distance in this subspace. [sent-24, score-0.211]
3 Our approach makes use of two quotient geometries of the set S+ (r, d) that have been recently studied by Journée et al. [sent-29, score-0.236]
4 ˆ ˆ As it is generally not possible to compute F(W) explicitly, batch learning algorithms minimize instead the empirical cost 1 n fn (W) = ˆ (1) ∑ (yi − yi )2 , 2n i=1 which is the average loss computed over a finite number of samples {(Xi , yi )}n . [sent-68, score-0.327]
5 At time t, the online learning algorithm minimizes the instantaneous cost 1 ft (W) = (yt − yt )2 . [sent-70, score-0.24]
6 The single necessary change to convert an online algorithm into its batch counterpart is to perform, at each iteration, the minimization of the empirical cost fn instead of the minimization of the instantaneous cost ft . [sent-72, score-0.327]
7 (2008), an abstract gradient descent algorithm can then be derived based on the update formula Wt+1 = RWt (−st grad f (Wt )). [sent-77, score-0.498]
8 595 (2) M EYER , B ONNABEL AND S EPULCHRE The gradient grad f (Wt ) is an element of the tangent space TWt W . [sent-78, score-0.538]
9 The retraction RWt is a mapping from the tangent space TWt W to the Riemannian manifold. [sent-80, score-0.278]
10 Under mild conditions on the retraction R, the classical convergence theory of line-search algorithms in linear spaces generalizes to Riemannian manifolds (see Absil et al. [sent-81, score-0.218]
11 Observe that the standard (online) learning algorithm for linear regression in Rd , wt+1 = wt − st (wtT xt − yt )xt , (3) can be interpreted as a particular case of (2) for the linear model y = wT x in the linear search space ˆ d . [sent-83, score-0.533]
12 This example illustrates that the main ingredients to obtain a concrete algorithm are convenient formulas for the gradient and for the retraction mapping. [sent-87, score-0.259]
13 Each of those sets will be equipped with quotient Riemannian geometries that provide convenient formulas for the gradient and for the retractions. [sent-89, score-0.331]
14 However, when the search space is endowed with a particular manifold structure, it is possible to design an exploration strategy that is consistent with the geometry of the problem and that appropriately turns the problem into an unconstrained optimization problem. [sent-97, score-0.243]
15 One usually makes the distinction between embedded submanifolds (subsets of larger manifolds) and quotient manifolds (manifolds described by a set of equivalence classes). [sent-100, score-0.292]
16 A typical example of quotient manifold is the set of rdimensional subspaces in Rd , viewed as a collection of r-dimensional orthogonal frames that cannot be superposed by a rotation. [sent-102, score-0.268]
17 The rotational variants of a given frame thus define an equivalence class (denoted using square brackets [·]), which is identified as a single point on the quotient manifold. [sent-103, score-0.238]
18 The updated point Wt+1 automatically remains inside the manifold thanks to the retraction mapping. [sent-107, score-0.231]
19 For quotient manifolds W = W / ∼, where W is the total space and ∼ is the equivalence relation that defines the quotient, the tangent space T[W] W at [W] is sufficiently described by the directions that do not induce any displacement in the set of equivalence classes [W]. [sent-111, score-0.443]
20 Provided that the metric gW in the total space is invariant along the ¯ equivalence classes, it defines a metric in the quotient space g[W] (ξ[W] , ζ[W] ) gW (ξW , ζW ). [sent-113, score-0.448]
21 ¯ ¯ ¯ The horizontal gradient grad f (W) is obtained by projecting the gradient grad f (W) in the total ¯ space onto the set of horizontal vectors ξW at W. [sent-114, score-0.934]
22 Natural displacements at W in a direction ξW on the manifold are performed by following geodesics (paths of shortest length on the manifold) starting from W and tangent to ξW . [sent-115, score-0.222]
23 It is a quotient representation of the set of r-dimensional subspaces in Rd , that is, the Grassmann manifold Gr(r, d). [sent-132, score-0.268]
24 The quotient geometries of Gr(r, d) have been well studied (Edelman et al. [sent-133, score-0.236]
25 The metric g[U] (ξ[U] , ζ[U] ) gU (ξU , ζU ), ¯ ¯ ¯ is induced by the standard metric in Rd×r , ∆1 gU (∆ 1 , ∆2 ) = Tr(∆ T ∆2 ), ¯ ∆ which is invariant along the fibers, that is, equivalence classes. [sent-136, score-0.247]
26 Therefore, the gradient admits the simple horizontal representation grad f (U) = ΠU grad f (U), 598 (5) R EGRESSION ON F IXED -R ANK PSD M ATRICES : A R IEMANNIAN A PPROACH where grad f (U) is defined by the identity ∆ D f (U)[∆ ] = gU (∆ , grad f (U)). [sent-138, score-1.547]
27 A possible advantage of the retraction (7) over the retraction (6) is that, in contrast to the SVD computation, the QR decomposition is computed in a fixed number O(dr2 ) of arithmetic operations. [sent-143, score-0.328]
28 With the formulas (5) and (7) applied to the cost function (4), the abstract update (2) becomes Ut+1 = qf(Ut + st (I − Ut UtT )xt xtT Ut ), which is Oja’s update for subspace tracking (Oja, 1992). [sent-144, score-0.372]
29 ˆ 2 The quotient geometries of S+ (d) are rooted in the matrix factorization W = GGT , G ∈ GL(d), where GL(d) is the set of all invertible d × d matrices. [sent-153, score-0.236]
30 Because the factorization is invariant by rotation, G → GO, O ∈ O (d), the search space is once again identified to the quotient S+ (d) ≃ GL(d)/O (d), which represents the set of equivalence classes [G] = {GO s. [sent-154, score-0.27]
31 1 A Flat Metric on S+ (d) The metric on the quotient GL(d)/O (d): g[G] (ξ[G] , ζ[G] ) gG (ξG , ζG ), ¯ ¯ ¯ is induced by the standard metric in Rd×d , ∆1 gG (∆ 1 , ∆ 2 ) = Tr(∆ T ∆ 2 ), ¯ ∆ which is invariant by rotation along the set of equivalence classes. [sent-159, score-0.448]
32 With this geometry, a tangent vector ξ[G] at [G] is represented by a horizontal ¯ tangent vector ξG at G by ¯ ∆ ξG = Sym(∆ )G, ∆ ∈ Rd×d . [sent-161, score-0.271]
33 The horizontal gradient of 1 f (G) = ℓ(y, y) = (Tr(GGT Sym(X)) − y)2 , ˆ 2 (8) is the unique horizontal vector grad f (G) that satisfies ∆ D f (G)[∆ ] = gG (∆ , grad f (G)). [sent-162, score-0.87]
34 ¯ ∆ Elementary computations yield grad f (G) = 2(y − y)Sym(X)G. [sent-163, score-0.36]
35 Those formulas applied to the cost (8) turns the abstract update (2) into the simple formula Gt+1 = Gt − 2st (yt − yt )Sym(Xt )Gt , ˆ (9) for an online gradient algorithm and Gt+1 = Gt − 2st 1 n ˆ ∑ (yi − yi )Sym(Xi )Gt , n i=1 (10) for a batch gradient algorithm. [sent-165, score-0.656]
36 (11) The affine-invariant geometry of S+ (d) has been well studied, in particular in the context of information geometry (Smith, 2005). [sent-172, score-0.352]
37 The gradient grad f (W) is given by ∆ ∆ D f (W)[∆ ] = gW (∆ , grad f (W)). [sent-178, score-0.784]
38 Applying this formula to (5) yields grad f (W) = (y − y)WSym(X)W. [sent-179, score-0.36]
39 (14) The formulas (12) and (13) applied to the cost (5) turn the abstract update (2) into 1 1 1 1 Wt+1 = Wt2 exp(−st (yt − yt )Wt2 Sym(Xt )Wt2 )Wt2 . [sent-182, score-0.224]
40 ˆ With the alternative retraction (14), the update becomes Wt+1 = Wt − st (yt − yt )Wt Sym(Xt )Wt , ˆ which is the update of Davis et al. [sent-183, score-0.472]
41 The gradient of this cost function is given by grad f (S) = (yt − yt )Sym(Xt ), ˆ and the retraction is RS (sξS ) = exp(log W + sξS ). [sent-192, score-0.745]
42 The corresponding gradient descent update is Wt+1 = exp(log Wt − st (yt − yt )Sym(Xt )), ˆ which is the update of Tsuda et al. [sent-193, score-0.41]
43 Indeed, the flat quotient geometry of the manifold ∗ S+ (d) ≃ GL(d)/O (d) is generalized to the quotient geometry of S+ (r, d) ≃ Rd×r /O (r) by a mere ∗ adaptation of matrix dimension, leading to the updates (9) and (10) for matrices Gt ∈ Rd×r . [sent-202, score-0.863]
44 (2010), where the quotient geometry of S+ (r, d) ≃ Rd×r /O (r) is studied in ∗ details. [sent-204, score-0.377]
45 In the next section, we propose an alternative geometry that jointly learns a r-dimensional subspace and a full-rank quadratic model in this subspace. [sent-205, score-0.302]
46 However, a generalization ∗ ∗ is possible by considering the polar matrix factorization G = UR, U ∈ St(r, d), R ∈ S+ (r). [sent-208, score-0.257]
47 This gives a polar parameterization of S+ (r, d) W = UR2 UT . [sent-210, score-0.257]
48 This development leads to the quotient representation S+ (r, d) ≃ (St(r, d) × S+ (r))/O (r), (15) based on the invariance of W to the transformation (U, R2 ) → (UO, OT R2 O), O ∈ O (r). [sent-211, score-0.237]
49 A ¯ tangent vector ξ[W] = (ξU , ξR2 )[U,R2 ] at [(U, R2 )] is described by a horizontal tangent vector ξW = ¯ ¯ (ξU , ξR2 )(U,R2 ) at (U, R2 ) by ¯ ξU = ΠU ∆ , ∆ ∈ Rd×r , ¯ Ψ ξR2 = RSym(Ψ )R, Ψ ∈ Rr×r . [sent-217, score-0.271]
50 ¯ Ψ The proposed metric is invariant along the set of equivalence classes and thus induces a quotient structure on S+ (r, d). [sent-219, score-0.359]
51 Combining the gradient of (16) with the retractions (18) and (19) gives Ut+1 = qf Ut − 2λst (yt − yt )(I − Ut UtT )Sym(Xt )Ut Rt2 , ˆ 2 Rt+1 = Rt exp −(1 − λ)st (yt − yt )Rt UtT Sym(Xt )Ut Rt Rt . [sent-228, score-0.406]
52 A proper tuning of this parameter allows us to place more emphasis either on the learning of the subspace U or on the distance in that subspace R2 . [sent-238, score-0.337]
53 Intermediate values of λ continuously interpolate between the subspace learning problem and the distance learning problem at fixed range space. [sent-242, score-0.211]
54 A proper tuning of λ is of interest when a good estimate of the subspace is available (for instance a subspace given by a proper dimension reduction technique) or when too few observations are available to jointly estimate the subspace and the distance within that subspace. [sent-243, score-0.463]
55 In the latter case, one has the choice to favor either subspace or distance learning. [sent-244, score-0.211]
56 4 Strategies for Choosing the Step Size We here present strategies for choosing the step size in both the batch and online cases. [sent-263, score-0.263]
57 In this paper, we use the Armijo step sA defined at each iteration by the condition f (RWt (−sA grad f (Wt ))) ≤ f (Wt ) + c grad f (Wt ) 2 Wt , (22) where Wt ∈ S+ (r, d) is the current iterate, c ∈ (0, 1), f is the empirical cost (1) and RW is the chosen retraction. [sent-267, score-0.752]
58 In order to reduce the dependence on smax in a particular problem, it is chosen inversely proportional to the norm of the gradient at each iteration, smax = s0 grad f (Wt ) . [sent-270, score-0.486]
59 In this paper, the step size schedule st is chosen as s nt0 st = × , (23) µgrad nt0 + t ˆ where s > 0, n is the number of considered learning samples, µgrad is an estimate of the average ˆ gradient norm grad f (W0 ) W0 , and t0 > 0 controls the annealing rate of st . [sent-275, score-0.757]
60 For batch algorithms, the local convergence results follow from the convergence theory of linesearch algorithms on Riemannian manifolds (see, for example, Absil et al. [sent-295, score-0.234]
61 For online algorithms, one can prove that the algorithm based on the flat geometry enjoys almost sure asymptotic convergence to a local minimum of the expected cost. [sent-297, score-0.259]
62 In contrast, when the polar parameterization is used, the convergence results presented by Bottou (1998) do not apply directly because of the quotient nature of the search space. [sent-299, score-0.458]
63 Because the extension would require technical arguments beyond the scope of the present paper, we refrain from stating a formal convergence result for the online algorithm based on the polar geometry, even though the result is quite plausible. [sent-300, score-0.34]
64 One solves (25) by solving the algebraic equation ˆ grad D(W, Wt ) = −st grad ℓ(yt+1 , yt ), ˆ 607 (26) M EYER , B ONNABEL AND S EPULCHRE which is a first-order (necessary) optimality condition. [sent-313, score-0.845]
65 If the search space W is a Riemannian manifold and if the closeness measure D(W, Wt ) is the Riemannian distance, the solution of (26) is Wt+1 = ExpWt (−st grad ℓ(yt+1 , yt )). [sent-314, score-0.592]
66 However, yt+1 is generally ˆ ˆ replaced by yt (which is equal to yt+1 up to first order terms in st ), which gives the update (2) where ˆ ˆ the exponential mapping is chosen as a retraction. [sent-316, score-0.272]
67 We have shown in Section 5 that the resulting updates can be interpreted as line-search updates for the log-Euclidean metric and the affine-invariant metric of S+ (d) and for specific choices of the retraction mapping. [sent-319, score-0.342]
68 where the θi ’s are the principal angles between the subspaces spanned by W and Wt (Golub and Van Loan, 1996), and the second term is the affine-invariant distance of S+ (d) between matrices R2 and Rt2 involved in the polar representation of W and Wt . [sent-322, score-0.384]
69 A potential area of future work is the application of the proposed online algorithm (9) for adapting a batch solution to slight modifications of the dissimilarities over time. [sent-346, score-0.263]
70 Overall, the experiments support that a joint estimation of a subspace and low-dimensional distance in that subspace is a major advantage of the proposed algorithms over methods that estimate the matrix for a subspace that is fixed beforehand. [sent-413, score-0.463]
71 Recent methods compute that subspace of reduced dimension using principal component analysis (Davis and Dhillon, 2008; Weinberger and Saul, 2009), that is, a subspace that captures a maximal amount of variance in the data. [sent-445, score-0.252]
72 However, in general, there is no reason why the subspace spanned by the top principal components should coincide with the subspace that is defined by the target model. [sent-446, score-0.252]
73 Therefore, a more appropriate approach consists in learning jointly the subspace and a distance in that subspace that best fits the data to observations within that subspace. [sent-447, score-0.337]
74 The top plot shows that the learned subspace (which identifies with the target subspace) is indeed very different from the subspace spanned by the top two principal components. [sent-453, score-0.252]
75 612 R EGRESSION ON F IXED -R ANK PSD M ATRICES : A R IEMANNIAN A PPROACH Data Target model subspace PCA subspace Learned subspace 10 x3 5 0 −5 −10 5 5 0 x2 0 −5 −5 1 x1 0. [sent-482, score-0.378]
76 Top: the learned subspace is very different from the subspace computed from a classical heuristic. [sent-502, score-0.252]
77 subspace and the distance in that subspace are learned jointly. [sent-505, score-0.337]
78 BATCH This experiment shows that when a large amount of sample is available (80, 000 training samples and 20, 000 test samples for learning a parameter W∗ in S+ (10, 50)), online algorithms minimize the test error more rapidly than batch ones. [sent-545, score-0.329]
79 Figure 5 report the test error as a function of the learning time, that is, the time after each iteration for batch algorithm and the time after each epoch for online algorithms. [sent-548, score-0.263]
80 For the algorithm based on the polar geometry, the mini-batch extension is strongly recommended to amortize the larger cost of each update. [sent-549, score-0.289]
81 For a large number of samples, online algorithms reduce the test error much more rapidly than batch ones. [sent-553, score-0.263]
82 Distance constraints are generated as yi j ≤ yi j (1 − α) for identically labeled samples and yi j ≥ ˆ ˆ yi j (1 + α) for differentially labeled samples, where α ≥ 0 is a scaling factor, yi j = φ(xi ) − φ(x j ) 2 and yi j = Tr(W(ei − e j )(ei − e j )T ) = (ei − e j )T W(ei − e j ). [sent-562, score-0.324]
83 1 90 Clustering Score (NMI) Classification accuracy (%) 100 80 70 60 50 40 30 20 10 Batch flat geometry Batch polar geometry LogDet−KL Original kernel 0 0 100 200 300 400 500 600 700 800 9001000 Number of constraints 0. [sent-589, score-0.808]
84 6 Batch flat geometry Batch polar geometry (λ = 0. [sent-591, score-0.696]
85 5) Batch polar geometry (λ = 0) LogDet−KL KSR Original Kernel 0. [sent-592, score-0.433]
86 The algorithm based on the polar geometry competes with LogDet-KL. [sent-595, score-0.433]
87 We compare the proposed batch methods with the LogDet-KL algorithm, the only competing algorithm that also learns directly from distance constraints. [sent-606, score-0.265]
88 In this full-rank learning setting, the algorithm based on the polar geometry compete with the LogDet-KL algorithm. [sent-611, score-0.433]
89 The convergence time of the algorithm based on the polar geometry is however much faster (0. [sent-612, score-0.433]
90 The algorithm based on the flat geometry has inferior performance when too few constraints are provided. [sent-614, score-0.221]
91 The figure shows that KSR, LogDet-KL and the algorithm based on the polar geometry with λ = 0 perform similarly. [sent-626, score-0.433]
92 These methods are however outperformed by the proposed algorithms (flat geometry and polar geometry with λ = 0. [sent-627, score-0.609]
93 This experiment also enlightens the flexibility of the polar geometry, which allows us to fix the subspace in situations where too few constraints are available. [sent-629, score-0.428]
94 4 Batch flat geometry Batch polar geometry LogDet−KL KSR Original kernel 0. [sent-634, score-0.763]
95 6 Batch flat geometry Batch polar geometry LogDet−KL KSR MVU Kernel PCA 0. [sent-636, score-0.696]
96 30 seconds for the algorithm based on the polar geometry. [sent-657, score-0.257]
97 3 R ESULTS 35 Batch flat geometry Batch polar geometry ITML LEGO POLA LMNN Euclidean baseline Classification Error (%) 30 25 20 15 10 5 0 Wine Ionosphere Bal. [sent-700, score-0.696]
98 (2009), we demonstrate that the proposed batch algorithms compete with state-of-the-art full-rank Mahalanobis distance learning algorithms on several UCI data sets (Figure 8). [sent-704, score-0.265]
99 We have not included the online versions of our algorithms in this comparison because we consider that the batch approaches are more relevant on such small data sets. [sent-705, score-0.263]
100 The rich Riemannian geometry of the set of fixed-rank PSD matrices is exploited through a geometric optimization approach. [sent-723, score-0.218]
wordName wordTfidf (topN-words)
[('grad', 0.36), ('polar', 0.257), ('wt', 0.211), ('sym', 0.206), ('quotient', 0.201), ('batch', 0.18), ('geometry', 0.176), ('riemannian', 0.173), ('retraction', 0.164), ('epulchre', 0.144), ('eyer', 0.144), ('iemannian', 0.144), ('onnabel', 0.144), ('mahalanobis', 0.141), ('subspace', 0.126), ('yt', 0.125), ('tol', 0.123), ('ank', 0.122), ('atrices', 0.122), ('tangent', 0.114), ('gw', 0.113), ('st', 0.111), ('ixed', 0.11), ('psd', 0.108), ('ut', 0.103), ('ksr', 0.103), ('rt', 0.096), ('kulis', 0.094), ('pproach', 0.094), ('gt', 0.092), ('metric', 0.089), ('flat', 0.087), ('distance', 0.085), ('rd', 0.085), ('online', 0.083), ('absil', 0.082), ('egression', 0.076), ('itml', 0.072), ('lego', 0.072), ('lmnn', 0.072), ('logdet', 0.072), ('utt', 0.072), ('tr', 0.069), ('kernel', 0.067), ('manifold', 0.067), ('gradient', 0.064), ('bonnabel', 0.062), ('rwt', 0.062), ('qf', 0.061), ('gl', 0.061), ('semide', 0.06), ('davis', 0.057), ('pca', 0.056), ('xt', 0.054), ('manifolds', 0.054), ('weinberger', 0.054), ('nmi', 0.051), ('pola', 0.051), ('sepulchre', 0.051), ('constraints', 0.045), ('horizontal', 0.043), ('matrices', 0.042), ('tsuda', 0.042), ('yi', 0.041), ('geodesics', 0.041), ('closeness', 0.04), ('isolet', 0.039), ('bregman', 0.038), ('descent', 0.038), ('rank', 0.038), ('equivalence', 0.037), ('asuncion', 0.036), ('newman', 0.036), ('invariance', 0.036), ('tw', 0.036), ('update', 0.036), ('ggt', 0.035), ('geometries', 0.035), ('uo', 0.035), ('multidimensional', 0.034), ('divergences', 0.034), ('samples', 0.033), ('regression', 0.032), ('clustering', 0.032), ('invariant', 0.032), ('jain', 0.032), ('cost', 0.032), ('mvu', 0.031), ('meyer', 0.031), ('classification', 0.031), ('formulas', 0.031), ('af', 0.031), ('xi', 0.031), ('gyrb', 0.031), ('retractions', 0.031), ('smax', 0.031), ('twt', 0.031), ('wsym', 0.031), ('xxt', 0.031), ('gu', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
2 0.18631899 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
Author: Shai Shalev-Shwartz, Ambuj Tewari
Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity
3 0.14826575 36 jmlr-2011-Generalized TD Learning
Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii
Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function
4 0.1390972 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
5 0.13232724 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama
Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization
6 0.089073956 55 jmlr-2011-Learning Multi-modal Similarity
7 0.076378822 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
8 0.068967715 14 jmlr-2011-Better Algorithms for Benign Bandits
9 0.065684125 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
10 0.063851871 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
11 0.059801783 28 jmlr-2011-Double Updating Online Learning
12 0.051474869 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
13 0.049506269 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
14 0.049182698 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
15 0.044235874 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
16 0.042930093 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
17 0.041274782 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
18 0.040951792 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
19 0.037315492 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
20 0.036439948 6 jmlr-2011-A Simpler Approach to Matrix Completion
topicId topicWeight
[(0, 0.242), (1, 0.334), (2, -0.034), (3, -0.152), (4, -0.091), (5, -0.142), (6, -0.023), (7, -0.021), (8, -0.017), (9, -0.004), (10, 0.04), (11, 0.051), (12, 0.019), (13, -0.016), (14, -0.077), (15, -0.009), (16, 0.199), (17, 0.019), (18, -0.066), (19, -0.126), (20, 0.049), (21, 0.065), (22, -0.005), (23, -0.005), (24, 0.027), (25, 0.027), (26, -0.011), (27, -0.056), (28, -0.031), (29, 0.0), (30, 0.099), (31, 0.031), (32, -0.059), (33, -0.056), (34, -0.068), (35, -0.015), (36, 0.045), (37, -0.003), (38, -0.012), (39, -0.034), (40, 0.03), (41, -0.047), (42, 0.033), (43, 0.071), (44, -0.044), (45, 0.001), (46, -0.022), (47, -0.006), (48, 0.023), (49, -0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.94248533 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
2 0.70279527 36 jmlr-2011-Generalized TD Learning
Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii
Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function
3 0.66833538 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
Author: Shai Shalev-Shwartz, Ambuj Tewari
Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity
4 0.62459671 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama
Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization
5 0.50603348 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
6 0.46392581 55 jmlr-2011-Learning Multi-modal Similarity
7 0.44927028 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
8 0.43349341 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
9 0.38907802 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
10 0.37155202 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
11 0.34793514 28 jmlr-2011-Double Updating Online Learning
12 0.31094787 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
13 0.29275802 14 jmlr-2011-Better Algorithms for Benign Bandits
14 0.2776866 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
15 0.2661199 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
16 0.26493254 6 jmlr-2011-A Simpler Approach to Matrix Completion
17 0.23979324 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
18 0.23725623 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
19 0.2307947 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
20 0.22734815 48 jmlr-2011-Kernel Analysis of Deep Networks
topicId topicWeight
[(4, 0.036), (9, 0.03), (10, 0.023), (11, 0.449), (24, 0.044), (31, 0.082), (32, 0.03), (41, 0.019), (60, 0.015), (65, 0.011), (70, 0.022), (73, 0.057), (78, 0.079), (90, 0.015)]
simIndex simValue paperId paperTitle
1 0.86202431 41 jmlr-2011-Improved Moves for Truncated Convex Models
Author: M. Pawan Kumar, Olga Veksler, Philip H.S. Torr
Abstract: We consider the problem of obtaining an approximate maximum a posteriori estimate of a discrete random field characterized by pairwise potentials that form a truncated convex model. For this problem, we propose two st-MINCUT based move making algorithms that we call Range Swap and Range Expansion. Our algorithms can be thought of as extensions of αβ-Swap and α-Expansion respectively that fully exploit the form of the pairwise potentials. Specifically, instead of dealing with one or two labels at each iteration, our methods explore a large search space by considering a range of labels (that is, an interval of consecutive labels). Furthermore, we show that Range Expansion provides the same multiplicative bounds as the standard linear programming (LP) relaxation in polynomial time. Compared to previous approaches based on the LP relaxation, for example interior-point algorithms or tree-reweighted message passing (TRW), our methods are faster as they use only the efficient st-MINCUT algorithm in their design. We demonstrate the usefulness of the proposed approaches on both synthetic and standard real data problems. Keywords: truncated convex models, move making algorithms, range moves, multiplicative bounds, linear programming relaxation
same-paper 2 0.74686688 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
3 0.3231225 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
4 0.29321888 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
5 0.29156131 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
Author: Mauricio A. Álvarez, Neil D. Lawrence
Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes
6 0.29154843 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
7 0.29023892 12 jmlr-2011-Bayesian Co-Training
8 0.28992495 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
9 0.2887527 35 jmlr-2011-Forest Density Estimation
10 0.28780675 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
11 0.28641391 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
12 0.28460467 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
13 0.28245607 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
14 0.28204697 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
15 0.28193218 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
16 0.28138229 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
17 0.28082141 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
18 0.28067055 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
19 0.28064641 16 jmlr-2011-Clustering Algorithms for Chains
20 0.2805073 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity