jmlr jmlr2011 jmlr2011-36 knowledge-graph by maker-knowledge-mining

36 jmlr-2011-Generalized TD Learning

Source: pdf

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a uniﬁed way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a uniﬁed way in terms of estimating functions. [sent-19, score-0.192]

2 Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. [sent-20, score-0.24]

3 Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function 1. [sent-21, score-0.122]

4 These successes were attributed to model-free policy evaluation, that is, the value function which evaluates the expected cumulative reward is estimated from a given sample trajectory without specifying the task environment. [sent-24, score-0.105]

5 We specify the MRP as a semiparametric model, where only the value function is modeled parametrically with a smaller number of parameters than necessary, while the other unspeciﬁed part of MRP corresponds to the nuisance parameters. [sent-39, score-0.098]

6 For estimating the parameters of interest in such models, estimating functions provide a well-established toolbox: they give consistent estimators (M-estimators) regardless of the nuisance parameters (Godambe, 1960, 1991; Huber and Ronchetti, 2009; van der Vaart, 2000). [sent-40, score-0.164]

7 Furthermore, by applying the asymptotic analysis, we derive the asymptotic estimation variance of general estimating functions (Lemma 3) and the optimal estimating function that yields the minimum asymptotic variance of estimation (Theorem 6). [sent-43, score-0.289]

8 One is the class of batch algorithms which obtain estimators in one shot by using all samples in the given trajectory such as least squares temporal difference (LSTD) learning (Bradtke and Barto, 1996). [sent-45, score-0.119]

9 The other is the class of online algorithms which update the estimators step-by-step such as temporal difference (TD) learning (Sutton, 1988). [sent-46, score-0.093]

10 In the batch algorithm, we assume that the value function is represented as a parametrically linear function and derive a new least squares-type algorithm, gLSTD learning, which achieves the minimum asymptotic variance (Algorithm 1). [sent-47, score-0.144]

11 We then show that the online algorithms can achieve the same asymptotic performance as their batch counterparts if the parameters controlling learning processes are appropriately tuned (Lemma 9 and Theorem 10). [sent-49, score-0.154]

12 We derive the optimal choice of the estimating function and construct the online learning algorithm that achieves the minimum estimation error asymptotically (Algorithm 2). [sent-50, score-0.133]

13 First, we give background of MRP and deﬁne the semiparametric statistical model for estimating the value function (Section 2). [sent-58, score-0.1]

14 After providing a short overview of estimating functions (Section 3), we present the main contribution, fundamental statistical analysis based on the estimating function theory (Section 4). [sent-59, score-0.113]

15 Then, we explain the construction of practical learning algorithms, derived from estimating functions, as both batch and online algorithms (Section 5). [sent-60, score-0.168]

16 Assumption 2 For any time t, reward rt is uniformly bounded. [sent-69, score-0.226]

17 Proposition 1 (Bertsekas and Tsitsiklis, 1996) Consider a conditional probability of {rt , st } given st−1 , p(rt , st |st−1 ) = p(rt |st , st−1 )p(st |st−1 ). [sent-73, score-0.604]

18 The function V that satisﬁes Equation (2) is unique and found to be the value function: T V (s) ≡ lim E T →∞ ∑ γt−1 rt s0 = s . [sent-76, score-0.31]

19 In such a case, the joint distribution of the trajectory ZT is expressed as T p(ZT ; θ, ξ) = p(s0 ; ξ0 ) ∏ p(rt , st |st−1 ; θ, ξs ), (5) t=1 where ξ ≡ (ξ0 , ξs ). [sent-86, score-0.328]

20 This can be achieved by using estimating functions which is a well-established technique to obtain a consistent estimator of the parameter without estimating the nuisance parameters (Godambe, 1960, 1991; Amari and Kawanabe, 1997; Huber and Ronchetti, 2009). [sent-92, score-0.168]

21 The advantages of considering such semiparametric models behind the model-free policy evaluation are: (a) we can characterize all possible model-free algorithms, (b) we can discuss asymptotic properties of the estimators in a uniﬁed way and obtain the optimal one with the minimum estimation error. [sent-93, score-0.144]

22 If there is an ˆ estimating function f (x, θ), we can obtain an estimator θT which has good asymptotic properties, by solving the following estimating equation: T ˆ ∑ f (xt , θT ) = 0. [sent-107, score-0.169]

23 For convenience, we write the triplet at time t as zt ≡ {st−1 , st , rt } ∈ S2 × R and the trajectory up to time t as Zt ≡ {s0 , s1 , r1 , . [sent-124, score-1.309]

24 Eθ,ξs [ψt (Zt , θ)|Zt−1 ] = 0, ∀t, det |A| = 0, where A ≡ lim Eθ,ξ [∂θ ψt (Zt , θ)] , t→∞ lim Eθ,ξ t→∞ ψt (Zt , θ) 2 < ∞. [sent-134, score-0.186]

25 Let ε : S2 × R × Θ → R1 be the so-called temporal difference (TD) error, that is, εt ≡ ε(zt , θ) ≡ g(st−1 , θ) − γg(st , θ) − rt . [sent-140, score-0.222]

26 Assumption 4 (a) Function wt : St+1 × Rt × Θ → Rm can be twice continuously differentiable with respect to parameter θ for any t, and lim Eθ,ξ [|∂θ wt (Zt , θ)|] < ∞ for any θ. [sent-144, score-0.719]

27 t→∞ (b) There exists a limit of matrix Eθ,ξ [wt−1 (Zt−1 , θ){∂θ ε(zt , θ)}⊤ ], and the matrix lim Eθ,ξ [wt−1 (Zt−1 , θ){∂θ ε(zt , θ)}⊤ ] is nonsingular for any θ and ξ. [sent-145, score-0.11]

28 The estimator derived from the estimating Equation (15) has an asymptotic variance summarized in the following lemma. [sent-157, score-0.129]

29 Then, any martingale T estimating function fT (ZT , θ) = ∑t=1 ψt (Zt , θ) in the semiparametric model {p(ZT ; θ, ξ) | θ, ξ} of MRP can be expressed as T T t=1 t=1 fT (ZT , θ) = ∑ ψt (Zt , θ) = ∑ wt−1 (Zt−1 , θ)ε(zt , θ). [sent-164, score-0.127]

30 2 Optimal Estimating Function Since Theorem 4 has speciﬁed the set of all martingale estimating functions, we can now discuss the optimal estimating function among them which gives an M-estimator with the minimum asymptotic variance. [sent-167, score-0.176]

31 The weight function wt (Zt−1 , θ) may depend not only on the current state st and the parameter θ, but also on the previous states and rewards. [sent-168, score-0.646]

32 1984 G ENERALIZED TD L EARNING Lemma 5 Let wt (Zt , θ) be any weight function that depends on the current and previous observations and the parameter, and satisﬁes the conditions in Assumption 4. [sent-170, score-0.344]

33 Then, there is necessarily a weight function depending only on the current state and the parameter whose corresponding estimator has the minimum asymptotic variance among all possible weight functions. [sent-171, score-0.128]

34 We next discuss the optimal weight function of Equation (14) in terms of asymptotic variance, which corresponds to the optimal estimating function. [sent-173, score-0.117]

35 The asymptotic variance of the optimal estimating function can be calculated from Lemma 3 and Theorem 6. [sent-179, score-0.134]

36 Lemma 7 The minimum asymptotic variance is given by 1 ˆ Av = Av(θT ) = Q−1 , T where Q ≡ lim Eθ∗ ,ξ∗ [∂θ ψt∗ (zt , θ ∗ )] = lim Eθ∗ ,ξ∗ ψt∗ (zt , θ ∗ )ψt∗ (zt , θ ∗ )⊤ . [sent-180, score-0.245]

37 Learning Algorithms In this section, we present two kinds of practical algorithms to obtain the solution of the estimating Equation (15): one is the batch learning procedure and the other is the online learning procedure. [sent-186, score-0.168]

38 This theoretical consideration allows us to obtain a new online learning algorithm that asymptotically converges faster than current online algorithms. [sent-194, score-0.132]

39 Then, estimating Equation (14) is given as T ∑ wt−1 (Zt−1 , θ) t=1 ˆ (φ(st−1 ) − γφ(st ))⊤ θT − rt = 0. [sent-197, score-0.254]

40 ˆ If the weight function does not depend on parameter θ, the estimator θT can be analytically obtained as ˆ θT = T ¯ ∑ wt−1 (Zt−1 )(φ(st−1) − γφ(st )) ⊤ −1 t=1 T ¯ ∑ wt−1 (Zt−1 )rt , t=1 where wt : St+1 × Rt → Rm is a function which depends only on the previous observations. [sent-198, score-0.377]

41 In fact, there are other online update rules derived from the same estimating function ft (Zt , θ) = ∑ti=1 ψi (Zi , θ) as ˆ ˆ ˆ ˆ θt = θt−1 − ηt R(θt−1 )ψt (Zt , θt−1 ), (21) where R(θ) denotes an m×m nonsingular matrix only depending on θ (Amari, 1998). [sent-222, score-0.212]

42 (b) For any t, there exists such nonnegative constants c1 and c2 that Eθ∗ ,ξs∗ ˆ ˆ R(θt )ψt+1 (Zt+1 , θt ) 2 ˆ st ≤ c1 + c2 θt − θ ∗ 2 . [sent-232, score-0.302]

43 If stepsizes {ηt } are all positive and sat∞ ∞ isfy ∑t=1 ηt = ∞ and ∑t=1 ηt2 < ∞, then the online algorithm (21) almost surely converges to true parameter θ ∗ . [sent-237, score-0.106]

44 Theorem 8 ensures that even if the original online learning algorithm (20) does not converge to the true parameter, we can construct an online learning algorithm with local consistency by appropriately choosing matrix R(θ). [sent-239, score-0.118]

45 2 C ONVERGENCE R ATE The convergence speed of an online algorithm could generally be slower than that of its batch counterpart that tries to solve the estimating equation using all available samples. [sent-242, score-0.196]

46 Following the discussion on the previous work (Bottou and LeCun, 2004, 2005), we elucidate the convergence speed of online learning for estimating the value function in this section. [sent-244, score-0.133]

47 In other words, Lemma 9 indicates that online algorithms can converge with the same convergence speed as their batch counterparts through an appropriate choice of matrix R. [sent-254, score-0.129]

48 4 ACCELERATED TD L EARNING TD learning is a traditional online approach to model-free policy evaluation and has been one of the most important algorithms in the RL ﬁeld. [sent-285, score-0.103]

49 This section discusses TD learning from the viewpoint of the method of estimating functions and proposes a new online algorithm that can achieve faster convergence than standard TD learning. [sent-287, score-0.12]

50 To simplify the following discussion, we have assumed that g(s, θ) is a linear function as in Equation (19) with which we can solve the linear estimating equation using both batch and online procedures. [sent-288, score-0.198]

51 When weight function wt (Zt , θ) in Equation (13) is set to φ(st ), the online and batch procedures correspond to the TD and LSTD algorithms, respectively. [sent-289, score-0.462]

52 Since Rt−1 converges to −1 = lim E ∗ ∗ [φ(s ⊤ ]−1 and A−1 must be a positive deﬁnite matrix (see A t−1 )(φ(st−1 ) − γφ(st )) θ ,ξ t→∞ Lemma 6. [sent-293, score-0.107]

53 Since the online approximation of the weight function only depends on past observations, optimal TD learning is necessarily consistent even when the online approximation of the weight function is inaccurate. [sent-296, score-0.18]

54 The ﬁrst is a batch procedure: ˆ θT = T ∑ wt−1 (φ(st−1) − γφ(st )) ⊤ −1 t=1 T ∑ wt−1 rt . [sent-307, score-0.263]

55 t=1 and the second is an online procedure: ˆ ˆ ˆ ˆ θt = θt−1 − ηt Rt wt−1 ε(zt , θt−1 ), where wt is a weight function at time t. [sent-308, score-0.403]

56 It should be noted that GTD, GTD2, GTDc, iLSTD, and iLSTD (λ) are speciﬁc online implementations for solving corresponding estimating equations; however, these algorithms can also be interpreted as instances of the method of estimating functions we propose. [sent-327, score-0.159]

57 The asymptotic behavior of model-free policy evaluation has been analyzed within special contexts; Konda (2002) derived the asymptotic variance of LSTD (λ) and revealed that the convergence rate of TD (λ) was worse than that of LSTD (λ). [sent-329, score-0.15]

58 14 12 10 8 6 4 2 0 Accelerated-TD Optimal-TD TD (online) (online) (online) LSTD (batch) gLSTD (batch) Figure 4: Boxplots of MSE both of the online (TD, Accelerated-TD and Optimal-TD) and batch (LSTD and gLSTD) algorithms. [sent-360, score-0.118]

59 In the online algorithms (TD, Accelerated-TD, and Optimal-TD), we used some batch procedures to obtain initial estimates of the parameters, as is often done in many online 1995 U ENO , M AEDA , K AWANABE AND I SHII procedures. [sent-381, score-0.177]

60 More speciﬁcally, the ﬁrst 10 steps in each episode were used to obtain initial estimators in a batch manner and the online algorithm started after the 10 steps. [sent-382, score-0.134]

61 As shown in Figure 4, the Optimal-TD and gLSTD algorithms achieved the minimum MSE among the online and batch algorithms, respectively. [sent-393, score-0.118]

62 ˆ Figure 5 shows how the estimation error of the estimator (θT ) behaves as the learning proceeds, both for online (upper panel) and batch (lower panel) learning algorithms. [sent-396, score-0.149]

63 Discussion and Future Work The contributions of this study are to present a new semiparametric approach to the model-free policy evaluation, which generalizes most of the current policy evaluation methods, and to clarify statistical properties of the policy evaluation problem. [sent-401, score-0.169]

64 71 300 200 100 0 TD (online) Figure 6: Boxplots of MSE for both of the online (TD, Accelerated-TD and Optimal-TD) and batch (LSTD and gLSTD) algorithms on a twenty states Markov random walk problem. [sent-440, score-0.118]

65 1 Asymptotic Analysis in Misspeciﬁed Situations First, let us revisit the asymptotic variance of the estimating function (15). [sent-444, score-0.122]

66 In misspeciﬁed cases, estimating function (14) does not necessarily satisfy the martingale property, then its asymptotic variance can no longer be calculated by Equation (16). [sent-445, score-0.161]

67 (30) Note that the class of estimators characterized by the above estimating function (30) is general enough for our theoretical consideration because it leads to almost all of the major algorithms for model-free policy evaluation that have been proposed so far (see Table 1). [sent-448, score-0.123]

68 Now we demonstrate 1998 G ENERALIZED TD L EARNING ˆ with Lemma 11 that the asymptotic variance of the estimators θT given by the estimating equation T ¯ ˆ ∑ ψt (Zt , θ) = 0. [sent-449, score-0.142]

69 Assume that: ¯ (a) There exists such a parameter value θ ∈ Θ that ¯ ¯ lim E ψt (Zt , θ) = 0, t→∞ ˆ where that E[·] denotes the expectation with respect to p(ZT ), and θT converges to the pa¯ in probability. [sent-451, score-0.118]

70 lim E wt−1 (Zt−1 ){∂θ ε(zt , θ)} ¯ ¯ E wt−1 (Zt−1 ){∂θ ε(zt , θ)}⊤ ¯ and t→∞ (c) E ¯ wt−1 (Zt−1 )ε(zt , θ) ¯ 2 is ﬁnite for any t. [sent-453, score-0.093]

71 ¯⊤ t→∞ t→∞ t ′ =1 Here, wt and cov[·, ·] denote the abbreviation of wt (Zt ) and the covariance function, respectively. [sent-455, score-0.626]

72 Conclusions We introduced a framework of semiparametric statistical inference for value function estimation which can be applied to analyzing both batch learning and online learning procedures. [sent-484, score-0.179]

73 Based on this framework, we derived the general form of estimating functions for model-free value function estimation in MRPs, which provides a statistical basis to many batch and online learning algorithms available currently for policy evaluation. [sent-485, score-0.236]

74 Moreover, we found an optimal estimating function, which yields the minimum asymptotic estimation variance amongst the general class, and presented new learning algorithms based on it as both batch and the online procedures. [sent-486, score-0.238]

75 If lim E[ f (Yt ) 2 ] is ﬁnite, then the central limit t→∞ theorem holds for f¯, that is, √ d − T f¯T − lim E[ f (Yt )] → N (0, σ2 ), t→∞ as T → ∞ where σ2 ≡ lim E[ f (Yt )2 ] + 2 lim ∑t∞=1 cov [ f (Yt ), f (Yt+t ′ )]. [sent-528, score-0.411]

76 −→ lim Eθ∗ ,ξ∗ [∂θ wt−1 (Zt−1 , θ ∗ )ε(zt , θ ∗ )] − t→∞ =0 + lim Eθ∗ ,ξ∗ wt−1 (Zt−1 , θ ∗ ){∂θ ε(zt , θ ∗ )}⊤ t→∞ =A 1 √ T T 1 T ∑ ψt (Zt , θ∗ ) = √T ∑ wt−1 (Zt−1 , θ∗ )ε(zt , θ∗ ) t=1 t=1     d −→ N 0, lim Eθ∗ ,ξ∗ ε(zt , θ ∗ )2 wt−1 (Zt−1 , θ ∗ )wt−1 (Zt−1 , θ ∗ )⊤  . [sent-561, score-0.279]

77 Proof of Theorem 4 Proof From Equation (2), for any t, the value function V (st ) = g(st , θ) must satisfy Eθ,ξs [rt+1 |st ] = g(st , θ) − Eθ,ξs [ g(st+1 , θ)| st ] , regardless of the nuisance parameter ξ. [sent-565, score-0.363]

78 Also, from the condition of martingale estimating functions, for any time t, the estimating function must satisfy Eθ,ξs [ft+1 (Zt+1 , θ) − ft (Zt , θ)|Zt ] = 0, (33) regardless of the nuisance parameter ξ. [sent-567, score-0.261]

79 If we can show from Equation (33) that ft+1 (Zt+1 , θ) − ft (Zt , θ) = wt (Zt , θ)ε(zt+1 , θ) holds, fT (ZT , θ) must have the form (17) by induction. [sent-568, score-0.386]

80 We ﬁrst prove in a constructive manner that any simple function h(zt+1 , θ) which depends only on zt+1 and satisfy Eθ,ξs [h(zt+1 , θ)|st ] = 0 and Eθ,ξs [{h(zt+1 , θ)}2 ] < ∞ for any t, θ and ξs can be expressed as h(zt+1 , θ) = w(st , θ)ε(zt+1 , θ), where w(st , θ) is a function of st . [sent-571, score-0.328]

81 To prove the simple case ﬁrst, for arbitrary ﬁxed st and θ, we consider the set M (st , θ) of all probability distributions of rt+1 and st+1 with each of which the expectation of the TD error ε(zt+1 , θ) vanishes. [sent-574, score-0.313]

82 In the following discussion, st is treated as a ﬁxed constant. [sent-575, score-0.302]

83 We remark that the set M (st , θ) and the domain of the nuisance parameter ξs depend on st and θ. [sent-577, score-0.35]

84 Since the whole argument holds for any st and θ, the ﬁrst claim is proved. [sent-591, score-0.302]

85 In the general case for the function of Zt+1 , we just show that any function h(Zt+1 , θ) which satisﬁes Eθ,ξs [h(Zt+1 , θ)|Zt ] = 0 and Eθ,ξs [{h(Zt+1 , θ)}2 ] < ∞ for any st , θ and ξs can be expressed as h(Zt+1 , θ) = wt (Zt , θ)ε(zt+1 , θ), where wt (Zt , θ) is a function of Zt and θ. [sent-592, score-0.967]

86 Therefore, the problem reduces to the case that the function only depends on zt+1 , so that we can say that h(rt+1 , st+1 , Zt , θ) ∝ ε(rt+1 , st+1 , st ). [sent-594, score-0.315]

87 Since this relationship holds for any Zt , we conclude that the function h(Zt+1 , θ) must have the form wt (Zt , θ)ε(zt+1 , θ). [sent-595, score-0.326]

88 Proof of Lemma 5 Proof We show that the conditional expectation wt (st , θ) = Eξs [wt (Zt , θ)|st ], which depends only ˜ on the current state st and the parameter θ, gives an equally good estimator or better estimator than those by the original weight function wt (Zt , θ). [sent-597, score-1.01]

89 Here, wt is t→∞ t→∞ ˆ˜ an abbreviation of wt (Zt , θ ∗ ). [sent-599, score-0.626]

90 Here, wt is ˜ ˜⊤ ˜ ˜ ˜ ˜ t→∞ t→∞ 2006 G ENERALIZED TD L EARNING an abbreviation of wt (st , θ). [sent-601, score-0.626]

91 Proof of Theorem 6 ˆ Proof As shown in Equation (16), the asymptotic variance of the estimator θw is given by Av = 1 Aw Σw (A−1 )⊤ , w T where Aw ≡ lim Eθ∗ ,ξ∗ [wt−1 (Zt−1 , θ ∗ ){∂θ ε(zt , θ ∗ )}⊤ ], t→∞ Σw ≡ lim Eθ∗ ,ξ∗ [ε(zt , θ ∗ )2 wt−1 (Zt−1 , θ ∗ )wt−1 (Zt−1 , θ ∗ )⊤ ]. [sent-605, score-0.265]

92 Since the functional F(wt−1 ) is stationary ∗ for tiny variation in the function wt−1 , the weight function wt−1 which minimizes the asymptotic estimation variance must satisfy ∗ ∂h G(h; wt−1 , δt−1 ) h=0 = 0, for arbitrary choice of δt−1 . [sent-612, score-0.126]

93 The matrices at wt−1 + δt−1 become ¯ Aw∗ +δ = Q + lim Eθ∗ ,ξ∗ [δt−1 ∂θ ε(zt , θ ∗ )⊤ ], ¯ t→∞ ¯⊤ Σw∗ +δ = Q + lim Eθ∗ ,ξ∗ ∂θ ε(zt , θ ∗ )δt−1 ¯ t→∞ ¯ ¯ ¯⊤ + lim Eθ∗ ,ξ∗ δt−1 {∂θ ε(zt , θ ∗ )}⊤ + lim Eθ∗ ,ξ∗ εt2 δt−1 δt−1 . [sent-627, score-0.372]

94 The conditional expectation of variation of ht can be derived as ˆ ˆ ˆ Eθ∗ ,ξs∗ [ht+1 − ht |st ] = − 2ηt+1 θt⊤ R(θt )Eθ∗ ,ξs∗ ψt+1 (Zt+1 , θt ) st 2 ˆ ˆ + ηt+1 Eθ∗ ,ξs∗ R(θt )ψt+1 (Zt+1 , θt ) 2 st . [sent-639, score-0.651]

95 From Assumption 5, the second term of this equation is bounded by the second moment, thus we obtain 2 Eθ∗ ,ξs∗ ht+1 − (1 + ηt+1 c2 )ht |st 2 ˆ ˆ ˆ ≤ −2ηt+1 θt⊤ R(θt )Eθ∗ ,ξs∗ ψt+1 (Zt+1 , θt ) st + ηt+1 c1 . [sent-640, score-0.319]

96 Multiplying the both sides of Equation (42) by χt+1 , we obtain ′ Eθ∗ ,ξ∗ ht+1 − ht′ |Zt 2 ˆ ˆ ˆ ≤ −2ηt+1 χt+1 θt⊤ R(θt )Eθ∗ ,ξs∗ ψt+1 (Zt+1 , θt ) st + ηt+1 χt+1 c1 . [sent-643, score-0.302]

97 2) guarantees that ht′ converges to a nonnegative random variable almost surely, ∞ ∞ ˆ ˆ ˆ and ∑t=1 ηt+1 χt+1 θt⊤ Rt (θt )Eθ∗ ,ξs∗ ψt+1 (Zt+1 , θt ) st < ∞. [sent-646, score-0.316]

98 ˆ ˆ ˆ we have θt⊤ R(θt )Eθ∗ ,ξs∗ ψt+1 (Zt+1 , θt ) st −→ 0. [sent-649, score-0.302]

99 By applying the central limit theorem to √ T ¯ ¯ (1/ T ) ∑t=1 k⊤ ψt (Zt , θ) in Lemma 18, we have 1 √ T T ¯ ¯ ∑ k⊤ ψt (Zt , θ) −→ N d ¯ 0, k⊤ Σk , t=1 where ¯ ¯ ¯ k⊤ Σk = lim k⊤ E ε(zt , θ)2 wt−1 wt−1 k ¯⊤ t→∞ ∞ ¯ ¯ ¯ ¯ + lim 2 ∑ cov ε(zt , θ)k⊤ wt−1 , ε(zt+t ′ , θ)k⊤ wt+t ′ −1 . [sent-707, score-0.225]

100 Then, by using the covariance bound in Lemma 19, we obtain ¯ ¯ cov ε(zt , θ)k⊤ wt−1 , ε(zt+t ′ , θ)k⊤ wt+t ′ −1 ¯ ¯ ¯ ¯ ≤ 2 ϕ(t ′ ) lim k⊤ E ε(zt , θ)2 wt−1 wt−1 k ¯⊤ t→∞ √ ′ ¯ ¯ ¯⊤ ≤ 2 Cρt /2 lim k⊤ E ε(zt , θ)2 wt−1 wt−1 k. [sent-709, score-0.201]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zt', 0.777), ('wt', 0.313), ('st', 0.302), ('rt', 0.204), ('td', 0.186), ('lstd', 0.131), ('lim', 0.093), ('glstd', 0.086), ('aeda', 0.082), ('awanabe', 0.082), ('eno', 0.082), ('shii', 0.082), ('ft', 0.073), ('mrp', 0.064), ('eneralized', 0.063), ('online', 0.059), ('batch', 0.059), ('av', 0.053), ('estimating', 0.05), ('nuisance', 0.048), ('amari', 0.046), ('aw', 0.044), ('policy', 0.044), ('bradtke', 0.037), ('sutton', 0.037), ('semiparametric', 0.037), ('asymptotic', 0.036), ('qt', 0.035), ('barto', 0.034), ('ilstd', 0.03), ('martingale', 0.027), ('geramifard', 0.026), ('godambe', 0.026), ('kyoto', 0.026), ('trajectory', 0.026), ('kawanabe', 0.026), ('misspeci', 0.024), ('variance', 0.023), ('lspe', 0.022), ('mrps', 0.022), ('stepsizes', 0.022), ('reward', 0.022), ('earning', 0.021), ('mse', 0.021), ('ti', 0.021), ('estimator', 0.02), ('rg', 0.02), ('nedi', 0.019), ('rl', 0.019), ('tr', 0.019), ('si', 0.019), ('ht', 0.018), ('bertsekas', 0.018), ('mixing', 0.018), ('weight', 0.018), ('lemma', 0.018), ('temporal', 0.018), ('xt', 0.017), ('nonsingular', 0.017), ('equation', 0.017), ('estimators', 0.016), ('geometrically', 0.015), ('ntd', 0.015), ('rensen', 0.015), ('semipositive', 0.015), ('reinforcement', 0.015), ('cov', 0.015), ('converges', 0.014), ('zi', 0.014), ('theorem', 0.014), ('panel', 0.013), ('mses', 0.013), ('function', 0.013), ('calculated', 0.012), ('lecun', 0.012), ('stationary', 0.012), ('mannor', 0.011), ('vaart', 0.011), ('huber', 0.011), ('boyan', 0.011), ('drt', 0.011), ('gtd', 0.011), ('ibragimov', 0.011), ('ronchetti', 0.011), ('tsuyoshi', 0.011), ('ueno', 0.011), ('estimation', 0.011), ('convergence', 0.011), ('yt', 0.011), ('proof', 0.011), ('rm', 0.011), ('expectation', 0.011), ('rn', 0.011), ('surely', 0.011), ('diffusion', 0.011), ('bottou', 0.01), ('central', 0.01), ('murata', 0.01), ('baird', 0.01), ('konishi', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

2 0.24269111 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The ﬁrst method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

3 0.18159679 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama

Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally conﬁrm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efﬁciency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization

4 0.14826575 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a ﬁxed-rank positive semideﬁnite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of ﬁxedrank positive semideﬁnite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semideﬁnite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semideﬁnite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

5 0.12339973 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Author: John Duchi, Elad Hazan, Yoram Singer

Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to ﬁnd needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which signiﬁcantly simpliﬁes setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efﬁcient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization

6 0.11001399 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

7 0.089434952 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

8 0.087120853 14 jmlr-2011-Better Algorithms for Benign Bandits

9 0.080517784 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

10 0.067324914 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

11 0.054466929 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.053588994 28 jmlr-2011-Double Updating Online Learning

13 0.049245596 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

14 0.04243096 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

15 0.040274374 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

16 0.034461983 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

17 0.030376831 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

18 0.029908642 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

19 0.025910819 101 jmlr-2011-Variable Sparsity Kernel Learning

20 0.025536107 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, 0.45), (2, -0.137), (3, -0.094), (4, -0.17), (5, -0.129), (6, -0.098), (7, 0.03), (8, 0.044), (9, 0.057), (10, 0.074), (11, 0.047), (12, -0.061), (13, -0.032), (14, -0.015), (15, -0.009), (16, 0.029), (17, -0.014), (18, -0.024), (19, 0.035), (20, 0.077), (21, -0.049), (22, 0.005), (23, -0.069), (24, 0.002), (25, -0.071), (26, -0.096), (27, -0.033), (28, 0.033), (29, 0.066), (30, -0.033), (31, -0.053), (32, 0.02), (33, -0.086), (34, -0.026), (35, -0.03), (36, 0.044), (37, 0.002), (38, -0.027), (39, -0.056), (40, -0.063), (41, 0.01), (42, -0.045), (43, 0.063), (44, 0.092), (45, -0.036), (46, 0.052), (47, 0.074), (48, 0.02), (49, -0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99230725 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

2 0.73772544 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

3 0.6952098 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama

4 0.63711083 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

5 0.49132496 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

Author: Benoît Patra

Abstract: Motivated by the problem of effectively executing clustering algorithms on very large data sets, we address a model for large scale distributed clustering methods. To this end, we brieﬂy recall some standards on the quantization problem and some results on the almost sure convergence of the competitive learning vector quantization (CLVQ) procedure. A general model for linear distributed asynchronous algorithms well adapted to several parallel computing architectures is also discussed. Our approach brings together this scalable model and the CLVQ algorithm, and we call the resulting technique the distributed asynchronous learning vector quantization algorithm (DALVQ). An indepth analysis of the almost sure convergence of the DALVQ algorithm is performed. A striking result is that we prove that the multiple versions of the quantizers distributed among the processors in the parallel architecture asymptotically reach a consensus almost surely. Furthermore, we also show that these versions converge almost surely towards the same nearly optimal value for the quantization criterion. Keywords: k-means, vector quantization, distributed, asynchronous, stochastic optimization, scalability, distributed consensus

6 0.3609696 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

7 0.33644378 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

8 0.31718871 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

9 0.27599189 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

10 0.26465467 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

11 0.25367782 28 jmlr-2011-Double Updating Online Learning

12 0.24377201 14 jmlr-2011-Better Algorithms for Benign Bandits

13 0.23478226 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

14 0.2060221 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

15 0.19876026 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

16 0.18482436 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

17 0.14401518 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

18 0.13642295 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

19 0.12574394 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

20 0.12549785 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.026), (9, 0.02), (10, 0.021), (24, 0.038), (31, 0.123), (32, 0.023), (41, 0.019), (60, 0.011), (65, 0.014), (67, 0.027), (71, 0.01), (73, 0.022), (78, 0.149), (84, 0.35), (90, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78648967 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

2 0.48076224 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or ﬁnancial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classiﬁers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy deﬁnition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classiﬁcation. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classiﬁers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classiﬁcation, optimization, empirical risk minimization, support vector machines, logistic regression

3 0.48026231 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We ﬁrst propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efﬁcient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line ﬁlling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture

4 0.46506432 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufﬁcient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

5 0.46503332 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classiﬁcation and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassiﬁcation error or by a statistic of the ROC curve (such as the area under the curve). Speciﬁcally, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classiﬁcation algorithm that generalizes AdaBoost, which we call “P-Classiﬁcation.” We discuss and validate the potential beneﬁts of this equivalence relationship, and perform controlled experiments to understand P-Classiﬁcation’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classiﬁcation and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classiﬁcation, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

6 0.46164262 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

7 0.46088946 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

8 0.45860747 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

9 0.45689258 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

10 0.45685339 12 jmlr-2011-Bayesian Co-Training

11 0.45648533 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

12 0.45531002 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

13 0.45400137 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

14 0.45384014 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

15 0.45070475 16 jmlr-2011-Clustering Algorithms for Chains

16 0.44738549 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

17 0.44718304 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

18 0.44658625 91 jmlr-2011-The Sample Complexity of Dictionary Learning

19 0.44563049 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

20 0.44493809 104 jmlr-2011-X-Armed Bandits