jmlr jmlr2011 jmlr2011-87 knowledge-graph by maker-knowledge-mining

87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization


Source: pdf

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Computer Science Department The University of Texas at Austin Austin, TX 78701, USA Editor: Leon Bottou Abstract We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. [sent-6, score-0.175]

2 Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. [sent-9, score-0.241]

3 1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity 1. [sent-11, score-0.222]

4 Introduction We present optimization procedures for solving problems of the form: 1 m ∑ L( w, xi , yi ) + λ w w∈Rd m i=1 min 1 , (1) where (x1 , y1 ), . [sent-12, score-0.085]

5 The first method we propose is a stochastic version of the familiar coordinate descent approach. [sent-31, score-0.279]

6 The coordinate descent approach for solving ℓ1 regularized problems is not new (as we survey below in Section 1. [sent-32, score-0.289]

7 At each iteration of coordinate descent, a single element of w is updated. [sent-34, score-0.109]

8 This simple modification enables us to show that the runtime required to achieve ε (expected) accuracy is upper bounded by m d β w⋆ ε 2 2 , (2) where β is a constant which only depends on the loss function (e. [sent-40, score-0.156]

9 This bound tells us that the runtime grows only linearly with the size of the problem. [sent-43, score-0.16]

10 Another well known stochastic method that has been successfully applied for loss minimization problems, is stochastic gradient descent (e. [sent-45, score-0.34]

11 In stochastic gradient descent, at each iteration, we pick one example from the training set, uniformly at random, and update the weight vector based on the chosen example. [sent-49, score-0.151]

12 The attractiveness of stochastic gradient descent methods is that their runtime do not depend at all on the number of examples, and can even sometime decrease with the number of examples (see Bottou and Bousquet, 2008; Shalev-Shwartz and Srebro, 2008). [sent-50, score-0.378]

13 Unfortunately, the stochastic gradient descent method fails to produce sparse solutions, which makes the algorithm both slower and less attractive as sparsity is one of the major reasons to use ℓ1 regularization. [sent-51, score-0.285]

14 (2008) suggested to replace the ℓ1 regularization term with a constraint of the form w 1 ≤ B, and then to use stochastic gradient projection procedure. [sent-54, score-0.14]

15 In this approach, the elements of w that cross 0 after the stochastic gradient step are truncated to 0, hence sparsity is achieved. [sent-57, score-0.151]

16 (2009) methods is that, in some situations, their runtime might grow quadratically with the dimension d, even if the optimal predictor w⋆ is very sparse (see Section 1. [sent-60, score-0.158]

17 This quadratic dependence on d can be avoided if one uses mirror descent updates (Beck and Teboulle, 2003) such as the exponentiated gradient approach (Littlestone, 1988; Kivinen and Warmuth, 1997; Beck and Teboulle, 2003). [sent-62, score-0.338]

18 , 2009) with another variant of stochastic mirror descent, which is based on p-norm updates (Grove et al. [sent-65, score-0.194]

19 In particular, for the logistic-loss and the squared-loss we obtain the following upper bound on the runtime to achieving ε expected accuracy: O d log(d) w⋆ ε2 1866 2 1 . [sent-70, score-0.16]

20 (3) S TOCHASTIC M ETHODS FOR ℓ1 - REGULARIZED L OSS M INIMIZATION Comparing the above with the runtime bound of the stochastic coordinate descent method given in (2) we note three major differences. [sent-71, score-0.439]

21 First, while the bound in (2) depends on the number of examples, m, the runtime of SMIDAS does not depend on m at all. [sent-72, score-0.16]

22 On the flip side, the dependence of stochastic coordinate descent on the dimension is better both because the lack of the term log(d) and because w⋆ 2 is always smaller than w⋆ 2 (the ratio is at most d). [sent-73, score-0.294]

23 Finally, we would like to point out that while the stochastic coordinate descent method is parameter free, the success of SMIDAS and of the method of Langford et al. [sent-76, score-0.279]

24 1 Related Work We now survey several existing methods and in particular show how our stochastic twist enables us to give superior runtime guarantees. [sent-79, score-0.217]

25 (2007) described a coordinate descent method (called BBR) for minimizing ℓ1 regularized objectives. [sent-83, score-0.289]

26 First, and most important, at each iteration we choose a coordinate uniformly at random. [sent-85, score-0.131]

27 In a series of experiments, they observed that cyclic coordinate descent outperforms many alternative popular methods such as LARS (Efron et al. [sent-96, score-0.229]

28 Luo and Tseng (1992) established a linear convergence result for coordinate descent algorithms. [sent-103, score-0.214]

29 In an attempt to improve the dependence on the size of the problem, Tseng and Yun (2009) recently studied other variants of block coordinate descent for optimizing ‘smooth plus separable’ objectives. [sent-106, score-0.229]

30 In particular, ℓ1 regularized loss minimization (1) is of this form, provided that the loss function is smooth. [sent-107, score-0.128]

31 Translated to our notation, the runtime bound given in Tseng and Yun (2009) is 1867 S HALEV-S HWARTZ AND T EWARI m d 2 β w⋆ 2 2 . [sent-109, score-0.16]

32 This bound is inferior to our runtime bound for stochastic coordinate descent of order2 ε given in (2) by a factor of the dimension d. [sent-110, score-0.476]

33 2 C OORDINATE D ESCENT M ETHODS F OR ℓ1 D OMAIN C ONSTRAINTS 1 A different, but related, optimization problem is to minimize the loss, m ∑i L( w, xi , yi ), subject to a domain constraint of the form w 1 ≤ B. [sent-113, score-0.085]

34 1 Since at each iteration of the algorithm, one needs to calculate the gradient of the loss at w, the runtime of each iteration is m d. [sent-121, score-0.252]

35 Therefore, the total runtime becomes O m d β w⋆ 2 /ε . [sent-122, score-0.138]

36 This seems to imply that any deterministic method, which goes over the entire data at each iteration, will induce a runtime which is inferior to the runtime we derive for stochastic coordinate descent. [sent-128, score-0.458]

37 3 S TOCHASTIC G RADIENT D ESCENT AND M IRROR D ESCENT Stochastic gradient descent (SGD) is considered to be one of the best methods for large scale loss minimization, when we measure how fast a method achieves a certain generalization error. [sent-131, score-0.193]

38 They also provide bounds on the runtime of the resulting algorithm. [sent-137, score-0.154]

39 , without assuming low objective relative to ε), their analysis implies the following runtime bound O 2 d w⋆ 2 X2 2 ε2 , (4) 1 2 where X2 = m ∑i xi 2 is the average squared norm of an instance. [sent-140, score-0.212]

40 Specifically, if w⋆ has only k ≪ d non-zero elements and each xi is dense (say xi ∈ {−1, +1}d ), then the ratio between the d above bound of SGD and the bound in (3) becomes k log(d) ≫ 1. [sent-142, score-0.146]

41 1868 S TOCHASTIC M ETHODS FOR ℓ1 - REGULARIZED L OSS M INIMIZATION regularization, he should also believe that w⋆ is sparse, and thus our runtime bound in (3) is likely to be superior. [sent-147, score-0.16]

42 3 The reader familiar with the online learning and mirror descent literature will not be surprised by the above discussion. [sent-148, score-0.245]

43 4 R ECENT W ORKS D EALING WITH S TOCHASTIC M ETHODS FOR L ARGE S CALE R EGULARIZED L OSS M INIMIZATION Since the publication of the conference version (Shalev-Shwartz and Tewari, 2009) of this paper, several papers proposing stochastic algorithms for regularized loss minimization have appeared. [sent-157, score-0.175]

44 Viewing the average loss in (1) leads to interesting connections with the area of Stochastic Convex Optimization that deals with minimizing a convex function given access to an oracle that can return unbiased estimates of the gradient of the convex function at any query point. [sent-165, score-0.092]

45 Finally, Nesterov (2010) has analyzed randomized versions of coordinate descent for unconstrained and constrained minimization of smooth convex functions. [sent-168, score-0.246]

46 Stochastic Coordinate Descent To simplify the notation throughout this section, we rewrite the problem in (1) using the notation ≡P(w) m min w∈Rd 1 m ∑ L( w, xi , yi ) +λ w 1 . [sent-170, score-0.085]

47 1869 S HALEV-S HWARTZ AND T EWARI We are now ready to present the stochastic coordinate descent algorithm. [sent-174, score-0.279]

48 At each iteration, we pick a coordinate j uniformly at random from [d]. [sent-176, score-0.105]

49 That is, g j = 1 m ′ ′ m ∑i=1 L ( w, xi , yi )xi, j , where L is the derivative of the loss function with respect to its first argument. [sent-181, score-0.123]

50 2 Runtime Guarantee The following theorem establishes runtime guarantee for SCD. [sent-211, score-0.138]

51 Define the potential, Φ(wt ) = 1 2 w⋆ − wt 2 2 , and let ∆t, j = Φ(wt−1 ) − Φ(wt−1 + η j e j ) be the change in the potential assuming we update wt−1 using coordinate j. [sent-220, score-0.88]

52 Note that, we have, 1 | wt−1 ] = 1 d ∑ wt−1 + ηk ek d k=1 = E[ wt 1 d ∑ ( wt−1 d k=1 = wt−1 1− 1 1 − |wt−1,k | + |wt−1,k + ηk |) 1 wt−1 d 1+ 1 d ∑ |wt−1,k + ηk | . [sent-225, score-0.778]

53 d k=1 Plugging this above gives us, βE[Φ(wt−1 ) − Φ(wt ) | wt−1 ] ≥ E[C(wt ) + λ wt 1 | wt−1 ] −C(wt−1 ) − λ wt−1 = E[P(wt ) | wt−1 ] − P(wt−1 ) + 1+ C(wt−1 ) + λ wt−1 ⋆ 1 −C(w ) − λ w⋆ 1 d P(wt−1 ) − P(w⋆ ) . [sent-226, score-0.759]

54 Next, we specify the runtime bound for the case of ℓ1 regularized logistic-regression and squaredloss. [sent-240, score-0.235]

55 Since the average cost of each iteration is s m, where s is as defined in (8), we end up with the total runtime s m d ( 1 w⋆ 4 ε 2 + 2) 2 . [sent-244, score-0.164]

56 The above is the runtime required to achieve expected ε-accuracy. [sent-245, score-0.138]

57 Using Theorem 2 the required runtime to achieve ε-accuracy with a probability of at least 1 − δ is smd ( 1 w⋆ 2 + 4) 2 2 + ⌈log(1/δ)⌉ ε . [sent-246, score-0.159]

58 Assuming that the targets are normalized so that i C(0) ≤ 1, and using similar derivation we obtain the total runtime bound smd (2 w⋆ 2 + 4) 2 + ⌈log(1/δ)⌉ ε . [sent-248, score-0.181]

59 Stochastic Mirror Descent Made Sparse In this section, we describe our mirror descent approach for ℓ1 regularized loss minimization that maintains intermediate sparse solutions. [sent-250, score-0.375]

60 Recall that we rewrite the problem in (1) using the notation ≡P(w) m min w∈Rd 1 m ∑ L( w, xi , yi ) +λ i=1 ≡C(w) 1874 w 1 . [sent-251, score-0.085]

61 (11) S TOCHASTIC M ETHODS FOR ℓ1 - REGULARIZED L OSS M INIMIZATION Mirror descent algorithms (Nemirovski and Yudin, 1978, Chapter 3) maintain two weight vectors: primal w and dual θ. [sent-252, score-0.131]

62 In our mirror descent variant, we use the p-norm link function. [sent-256, score-0.263]

63 We first describe how mirror descent algorithms can be applied to the objective C(w) without the ℓ1 regularization term. [sent-262, score-0.291]

64 We then estimate the gradient of C(w) by calculating the vector v = L′ ( w, xi , yi ) xi . [sent-267, score-0.166]

65 If the link function is the identity mapping, this step is identical to the update of stochastic gradient descent. [sent-271, score-0.147]

66 1 Runtime Guarantee We now provide runtime guarantees for Algorithm 3. [sent-288, score-0.138]

67 , m} let v = L′ ( w, xi , yi ) xi (L′ is the derivative of L. [sent-300, score-0.142]

68 Denote the value of w at the end of iteration t by wt (with w0 = 0) and set wo = wr for r chosen uniformly at random from 0, . [sent-305, score-0.871]

69 Let θt be the value of θ at the beginning ˜ of iteration t of the algorithm, let vt be the value of v, and let θt = θt − ηvt . [sent-317, score-0.099]

70 Let wt = f −1 (θt ) and −1 (θ ) where f −1 is as defined in (12). [sent-318, score-0.759]

71 ˜t ˜ wt = f q 2 Consider the Bregman divergence, ∆F (w, w′ ) = F(w) − F(w′ ) − ∇F(w′ ), w − w′ = F(w) − F(w′ ) − f (w′ ), w − w′ , and define the potential, Ψ(w) = ∆F (w⋆ , w) . [sent-320, score-0.759]

72 4 of Shalev-Shwartz, 2007), we have ˜ ˜ ∆F (wt , wt ) ≥ q−1 wt − wt 2 . [sent-326, score-2.277]

73 q 2 Moreover, using Fenchel-Young inequality with the conjugate functions g(x) = q−1 x 2 1 x 2 we have p 2(q−1) ˜ | ηvt , wt − w⋆ | ≤ η2 2(q−1) vt q−1 2 p+ 2 ˜ wt − w⋆ 2 q . [sent-327, score-1.591]

74 2 q and g⋆ (x) = S HALEV-S HWARTZ AND T EWARI From (13), we obtain that vt Thus, 2 p ≤ vt ∞d 1/p 2 ≤ ρ2 d 2/p = ρ2 e . [sent-330, score-0.146]

75 (18) ˜ Ψ(wt ) − Ψ(wt ) ≥ η(L( wt , xi , yi ) − L( w⋆ , xi , yi )) − η 2 (p−1) ρ2 e 2 . [sent-331, score-0.929]

76 (19) So far, our analysis has followed the standard analysis of mirror descent (see, e. [sent-332, score-0.245]

77 1− (20) To show this, we begin the same way as we did to obtain (16), ˜ ˜ ˜ Ψ(wt ) − Ψ(wt+1 ) = ∆F (wt+1 , wt ) + f (wt ) − f (wt+1 ), wt+1 − w⋆ ˜ ˜ = ∆F (wt+1 , wt ) + θt − θt+1 , wt+1 − w⋆ ˜ ≥ θt − θt+1 , wt+1 − w⋆ ˜ ˜ = θt − θt+1 , wt+1 − θt − θt+1 , w⋆ . [sent-336, score-1.518]

78 Note that this equality is crucial and does not hold for the Bregman potential corresponding to the exponentiated gradient algorithm. [sent-340, score-0.081]

79 Combining the lower bounds (19) and (20) and plugging them into (15), we get, Ψ(wt ) − Ψ(wt+1 ) ≥ η(L( wt , xi , yi ) − L( w⋆ , xi , yi )) −η 2 (p−1) ρ2 e 2 + ηλ( wt+1 w⋆ 1 ) . [sent-342, score-0.971]

80 , m}, we get, E[Ψ(wt ) − Ψ(wt+1 )] ≥ ηE[C(wt ) −C(w⋆ )] − η 2 (p−1) ρ2 e = ηE[P(wt ) − P(w⋆ )] − 1878 2 η2 (p−1) ρ2 e 2 + ηλE[ wt+1 + ηλE[ wt+1 1− w⋆ 1 ] 1− wt 1] . [sent-346, score-0.759]

81 When (14) holds, instead of the bound (18), we have, vt 2 p ≤ vt ∞d 1/p 2 ≤ ρ L( wt , xi , yi ) d 2/p = ρ L( wt , xi , yi ) e . [sent-352, score-1.856]

82 Denote the value of w at the beginning of iteration t by wt and set wo = wr for r chosen uniformly at random from 0, . [sent-363, score-0.871]

83 SCD is the stochastic coordinate descent algorithm given in Section 2 above. [sent-395, score-0.279]

84 The coordinate to be updated at each iteration is greedily chosen in a deterministic manner to maximize a lower bound on the guaranteed decrease in the objective function. [sent-397, score-0.165]

85 Since choosing a coordinate (or feature in our case) in a deterministic manner involves significant computation in case of large data sets, we expect that the deterministic algorithm will converge much slower than the stochastic algorithm. [sent-399, score-0.186]

86 We also tried the cyclic version of coordinate descent that just cycles through the coordinates. [sent-400, score-0.229]

87 SMIDAS is the mirror descent algorithm given in Section 3 above. [sent-402, score-0.245]

88 One plot shows the regularized objective function plotted against the number of data accesses, that is, the number of times the algorithm accesses the data matrix (xi, j ). [sent-414, score-0.107]

89 GCD is much slower because, as we mentioned above, it spends a lot of time in finding the best coordinate to update. [sent-424, score-0.083]

90 The coordinate descent algorithms SCD and GCD are quicker to reach the minimum on MAGIC 04 S. [sent-447, score-0.214]

91 Recall that we denote 1 the average loss by C(w) = m ∑m L( w, xi , yi ). [sent-466, score-0.103]

92 2 Proof Note that, by assumption on L, for any i, j we have, L( w + ηe j , xi , yi ) = L( w, xi + ηxi, j , yi ) ≤ L( w, xi , yi ) + ηL′ ( w, xi , yi ) xi, j + ≤ L( w, xi , yi ) + ηL′ ( w, xi , yi ) xi, j + 2 β η2 xi, j 2 β η2 2 , where the last inequality follows because xi, j ∈ [−1, +1]. [sent-506, score-0.51]

93 , m and dividing by m, we get C(w + ηe j ) ≤ C(w) + η m ′ η2 ∑ L ( w, xi , yi ) xi, j + β2 m i=1 η = C(w) + η(∇C(w)) j + β 2 . [sent-510, score-0.085]

94 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-525, score-0.146]

95 Regularized paths for generalized linear models via coordinate descent. [sent-583, score-0.083]

96 Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, 2011. [sent-598, score-0.165]

97 On the convergence of coordinate descent method for convex differentiable minimization. [sent-641, score-0.229]

98 Efficiency of coordinate descent methods on huge-scale optimization problems. [sent-650, score-0.214]

99 A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. [sent-685, score-0.175]

100 Dual averaging method for regularized stochastic learning and online optimization. [sent-711, score-0.14]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.759), ('smidas', 0.257), ('magic', 0.219), ('runc', 0.154), ('ewari', 0.144), ('inimization', 0.144), ('runtime', 0.138), ('scd', 0.134), ('descent', 0.131), ('tochastic', 0.131), ('hwartz', 0.122), ('mirror', 0.114), ('rad', 0.114), ('oss', 0.11), ('ethods', 0.087), ('coordinate', 0.083), ('regularized', 0.075), ('vt', 0.073), ('langford', 0.068), ('stochastic', 0.065), ('ey', 0.065), ('arcene', 0.063), ('sgd', 0.062), ('gcd', 0.051), ('varx', 0.051), ('yun', 0.051), ('sign', 0.049), ('duke', 0.049), ('yi', 0.048), ('gradient', 0.044), ('tseng', 0.042), ('escent', 0.041), ('genkin', 0.041), ('wo', 0.04), ('xi', 0.037), ('abp', 0.035), ('duchi', 0.032), ('regularization', 0.031), ('ambuj', 0.031), ('pt', 0.03), ('kivinen', 0.029), ('dense', 0.028), ('iteration', 0.026), ('rda', 0.026), ('ab', 0.026), ('plugging', 0.026), ('sparsity', 0.025), ('bottou', 0.024), ('rd', 0.024), ('wr', 0.024), ('beck', 0.023), ('uniformly', 0.022), ('tewari', 0.022), ('gj', 0.022), ('egularization', 0.022), ('bound', 0.022), ('boxed', 0.021), ('ghadimi', 0.021), ('oles', 0.021), ('smd', 0.021), ('ttic', 0.021), ('derivative', 0.02), ('sparse', 0.02), ('warmuth', 0.02), ('composite', 0.02), ('update', 0.02), ('deterministic', 0.019), ('ek', 0.019), ('boosting', 0.019), ('exponentiated', 0.019), ('loss', 0.018), ('potential', 0.018), ('teboulle', 0.018), ('ex', 0.018), ('minimizer', 0.018), ('link', 0.018), ('iterations', 0.018), ('accesses', 0.017), ('austin', 0.017), ('oordinate', 0.017), ('minimization', 0.017), ('truncated', 0.017), ('concludes', 0.016), ('bounds', 0.016), ('hebrew', 0.016), ('jerusalem', 0.016), ('luo', 0.016), ('cyclic', 0.015), ('inferior', 0.015), ('dependence', 0.015), ('objective', 0.015), ('convex', 0.015), ('jth', 0.015), ('updates', 0.015), ('grove', 0.014), ('shai', 0.014), ('littlestone', 0.014), ('twist', 0.014), ('koh', 0.014), ('truncate', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

2 0.43503231 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama

Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization

3 0.24269111 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function

4 0.18631899 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

5 0.13925369 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

6 0.10941669 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

7 0.069728509 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

8 0.049079008 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

9 0.048508342 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

10 0.048063569 59 jmlr-2011-Learning with Structured Sparsity

11 0.044557583 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

12 0.043002099 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

13 0.037118051 101 jmlr-2011-Variable Sparsity Kernel Learning

14 0.036777057 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

15 0.035616729 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

16 0.028825428 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

17 0.026731256 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

18 0.026581494 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

19 0.026154039 6 jmlr-2011-A Simpler Approach to Matrix Completion

20 0.024803143 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.21), (1, 0.51), (2, 0.027), (3, -0.119), (4, -0.279), (5, -0.38), (6, -0.217), (7, 0.015), (8, 0.075), (9, 0.119), (10, -0.012), (11, -0.042), (12, -0.034), (13, 0.017), (14, 0.052), (15, -0.037), (16, -0.047), (17, 0.095), (18, -0.022), (19, 0.046), (20, 0.004), (21, -0.019), (22, -0.028), (23, 0.026), (24, -0.026), (25, 0.075), (26, 0.009), (27, -0.013), (28, 0.012), (29, -0.031), (30, 0.024), (31, -0.004), (32, 0.003), (33, -0.014), (34, 0.04), (35, 0.005), (36, -0.007), (37, -0.007), (38, -0.005), (39, 0.007), (40, 0.011), (41, -0.031), (42, 0.045), (43, -0.019), (44, -0.005), (45, 0.028), (46, -0.023), (47, -0.01), (48, 0.009), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98640263 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

2 0.96053427 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama

Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization

3 0.73569125 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function

4 0.60646462 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

5 0.46764022 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

6 0.40683737 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

7 0.17557482 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

8 0.16310574 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

9 0.14786384 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

10 0.13696268 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

11 0.13173507 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

12 0.12965406 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

13 0.12205182 97 jmlr-2011-Union Support Recovery in Multi-task Learning

14 0.12058564 59 jmlr-2011-Learning with Structured Sparsity

15 0.11674131 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

16 0.10575747 58 jmlr-2011-Learning from Partial Labels

17 0.10524522 101 jmlr-2011-Variable Sparsity Kernel Learning

18 0.10249615 105 jmlr-2011-lp-Norm Multiple Kernel Learning

19 0.10207719 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

20 0.098606922 28 jmlr-2011-Double Updating Online Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.035), (9, 0.032), (10, 0.022), (24, 0.018), (31, 0.056), (32, 0.023), (41, 0.017), (60, 0.016), (70, 0.019), (73, 0.022), (78, 0.601), (90, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98674351 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efficiency of model-based methods with the space efficiency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efficiency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse

2 0.98476601 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and refitting: first we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

same-paper 3 0.97491097 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

4 0.93940836 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classification, optimization, empirical risk minimization, support vector machines, logistic regression

5 0.84768647 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

6 0.78575879 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

7 0.78408533 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

8 0.76988769 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

9 0.76780117 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

10 0.75743234 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

11 0.72780514 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.7220965 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

13 0.69964939 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

14 0.69676095 36 jmlr-2011-Generalized TD Learning

15 0.69129241 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

16 0.69014174 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

17 0.6896677 104 jmlr-2011-X-Armed Bandits

18 0.68667918 91 jmlr-2011-The Sample Complexity of Dictionary Learning

19 0.68329781 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

20 0.67864597 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning