jmlr jmlr2005 jmlr2005-5 knowledge-graph by maker-knowledge-mining

5 jmlr-2005-A Generalization Error for Q-Learning

Source: pdf

Author: Susan A. Murphy

Abstract: Planning problems that involve learning a policy from a single training set of ﬁnite horizon trajectories arise in both social science and medical ﬁelds. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function. Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics University of Michigan Ann Arbor, MI 48109-1107, USA Editor: Michael Littman Abstract Planning problems that involve learning a policy from a single training set of ﬁnite horizon trajectories arise in both social science and medical ﬁelds. [sent-3, score-0.162]

2 This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function. [sent-5, score-0.114]

3 Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data 1. [sent-6, score-0.043]

4 A training set or batch of n trajectories of T + 1-decision epochs is available for estimating a policy. [sent-8, score-0.064]

5 These are ﬁnite horizon problems with T generally quite small, T = 2 − 4, with known exploration policy. [sent-17, score-0.039]

6 Alternately the training set of n trajectories may be historical; for example data in which clinicians and their patients are followed with i! [sent-21, score-0.046]

7 Again the goal is to estimate the best policy for managing the disease. [sent-23, score-0.078]

8 The goal is to estimate the best policy for deciding which clients should receive a mailing and the form of the mailing. [sent-25, score-0.083]

9 These latter planning problems can be viewed as inﬁnite horizon problems but only T decision epochs per client are recorded. [sent-26, score-0.049]

10 If T is large, the rewards are bounded and the dynamics are stationary Markovian then this ﬁnite horizon problem provides an approximation to the discounted inﬁnite horizon problem (Kearns, Mansour and Ng, 2000). [sent-27, score-0.078]

11 All that is available is the n trajectories of T + 1 decision epochs. [sent-30, score-0.041]

12 One approach to learning a policy in this setting is Q-learning c 2005 Susan A. [sent-31, score-0.078]

13 rence in value when using the optimal policy as compared to using the greedy policy (from Q-learning) in generating a separate test set. [sent-43, score-0.156]

14 These upper bounds illuminate the mismatch between Q-learning with function approximation and the goal of ﬁnding a policy maximizing the value function (see the remark following Lemma 2 and the third remark following Theorem 2). [sent-46, score-0.122]

15 In the process of providing an upper bound on the average generalization error, ﬁnite sample bounds on the difference in average values resulting from different policies are derived. [sent-48, score-0.033]

16 From the work of Kearns, Mansour, and Ng (1999, 2000) and Peshkin and Shelton (2002), we expect and ﬁnd as well here that the number of trajectories needed to guarantee a speciﬁed error level is exponential in the horizon time, T . [sent-52, score-0.07]

17 For example if the optimal Q-function belongs to the approximation space, then the upper bounds imply that batch Q-learning is a PAC reinforcement learning algorithm as in Feichter (1994, 1997); see the ﬁrst remark following Theorem 1. [sent-56, score-0.061]

18 In Section 5 we provide the two main results, both of which provide the number of trajectories needed to achieve a given error level with a speciﬁed level of certainty. [sent-60, score-0.041]

19 Each of the n trajectories is composed of the sequence {O0 , A0 , O1 , . [sent-63, score-0.041]

20 The rewards are Rt = rt (Ot , At , Ot+1 ) for rt a reward function and for each 0 ≤ t ≤ T (if the Markov assumption holds then replace Ot with Ot and At with At ). [sent-74, score-0.085]

21 We assume the trajectories are sampled at random according to a ﬁxed distribution denoted by P. [sent-76, score-0.041]

22 Thus the trajectories are generated by one ﬁxed distribution. [sent-77, score-0.041]

23 fT }) and an exploration policy for generating the actions. [sent-81, score-0.088]

24 , pT } where the probability that action a is taken given history {Ot , At−1 } is pt (a|Ot , At−1 ) (if the Markov assumption holds then, as before, replace Ot with Ot and At−1 with At−1 . [sent-85, score-0.033]

25 ) We assume that pt (a|ot , at−1 ) > 0 for each action a ∈ A and for each possible value (ot , at−1 ); that is, at each time all actions are possible. [sent-86, score-0.042]

26 Let the distribution Pπ denote the distribution of a trajectory whereby the policy π is used to generate the actions. [sent-96, score-0.097]

27 Note that since (1) and (2) differ only in regard to the policy for generating actions, an expectation with respect to either P or Pπ that does not involve integration over the policy results in the same quantity. [sent-102, score-0.162]

28 For example, E [Rt |Ot , At ] = Eπ [Rt |Ot , At ], for any policy π. [sent-103, score-0.078]

29 In a ﬁnite horizon planning problem (permitting nonstationary, non-Markovian policies) the goal is to estimate a policy that maximizes Eπ [∑T R j |O0 = j=1 o0 ] over π ∈ Π. [sent-105, score-0.122]

30 j=1 The t-value function for policy π is the value of the rewards summed from time t on and is T Vπ,t (ot , at−1 ) = Eπ ∑ R j Ot = ot , At−1 = at−1 . [sent-108, score-0.915]

31 j=t If the Markovian assumption holds then (ot , at−1 ) in the deﬁnition of Vπ,t is replaced by ot . [sent-109, score-0.83]

32 Then the value functions satisfy the following relationship: Vπ,t (ot , at−1 ) = Eπ [Rt +Vπ,t+1 (Ot+1 , At )|Ot = ot , At−1 = at−1 ] for t = 0, . [sent-112, score-0.83]

33 The time t Q-function for policy π is Qπ,t (ot , at ) = E[Rt +Vπ,t+1 (Ot+1 , At )|Ot = ot , At = at ]. [sent-116, score-0.908]

34 ) In Section 4 we express ˜ the difference in value functions for policy π and policy π in terms of the advantages (as deﬁned in Baird, 1993). [sent-118, score-0.156]

35 The advantage can be interpreted as the gain in performance obtained by following action at at time t and thereafter policy π as compared to following policy π from time t on. [sent-120, score-0.167]

36 π∈Π As is well-known, the optimal value functions satisfy the Bellman equations (Bellman, 1957) ∗ Vt∗ (ot , at−1 ) = max E[Rt +Vt+1 (Ot+1 , At )|Ot = ot , At = at ]. [sent-122, score-0.838]

37 at ∈A Optimal, deterministic, time t decision rules must satisfy ∗ πt∗ (ot , at−1 ) ∈ arg max E[Rt +Vt+1 (Ot+1 , At )|Ot = ot , At = at ]. [sent-123, score-0.852]

38 Batch Q-Learning We consider a version of Q-learning for use in learning a non-stationary, non-Markovian policy with one training set of ﬁnite horizon trajectories. [sent-126, score-0.112]

39 (3) Suppose that Q-functions are approximated by linear combinations of p features (Qt = {θT qt (ot , at ) : θ ∈ R p }) then to achieve (3) solve backwards through time, t = T, T − 1, . [sent-141, score-0.525]

40 , 0, 0 = En Rt + max Qt+1 (Ot+1 , At , at+1 ; θt+1 ) − Qt (Ot , At ; θt ) qt (Ot , At )T at+1 (4) for θt . [sent-144, score-0.533]

41 An incremental implementation with updates between trajectories of (3) and (4) results in onestep Q-learning (Sutton and Barto, 1998, pg. [sent-145, score-0.047]

42 Then for each t, X = (Ot+1 , At ), θ = θt and n (n+1) f (X, θt ) = Rt + max Qt+1 (Ot+1 , At , at+1 ; θt+1 ) − Qt (Ot , At ; θt ) qt (Ot , At )T at+1 is a vector of dimension p. [sent-157, score-0.533]

43 The incremental update is (n+1) θt (n+1) (n) (n) ← θt + αn Rt + max Qt+1 (Ot+1 , At , at+1 ; θt+1 ) − Qt (Ot , At ; θt ) qt (Ot , At )T ) at+1 1077 M URPHY for t = 0, . [sent-158, score-0.539]

44 148) with γ = 1 and generalized to permit function approximation and nonstationary Q-functions and is analogous to the TD(0) update of Tsitsiklis and van Roy (1997) permitting non-Markovian, nonstationary value functions. [sent-163, score-0.044]

45 The estimated policy, π, satisﬁes πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ) for each t. [sent-168, score-0.074]

46 For example the Q-functions corresponding to the use of a policy π (Qπ,t , t = 0, . [sent-170, score-0.078]

47 , πT,θ } and where each πt,θ (ot , at−1 ) ∈ arg maxat Qt (ot , at ; θ) for some Qt ∈ Qt . [sent-183, score-0.074]

48 Generalization Error Deﬁne the generalization error of a policy π at an observation o0 as the average difference between the optimal value function and the value function resulting from the use of policy π in generating a separate test set. [sent-185, score-0.156]

49 The generalization error of policy π at observation o0 can be written as V ∗ (o0 ) −Vπ (o0 ) = −Eπ T ∑ µt∗ (Ot , At ) O0 = o0 (5) t=0 where Eπ denotes the expectation using the likelihood (2). [sent-186, score-0.084]

50 So the generalization error can be expressed in terms of the optimal advantages evaluated at actions determined by policy π; that is when each At = πt (Ot , At−1 ). [sent-187, score-0.087]

51 (6) G ENERALIZATION E RROR T And Eπ ∑t=0 Rt OT , AT is the expectation with respect to the distribution of OT +1 given the history (OT , AT ); this is the density fT +1 from Section 2 and fT +1 is independent of the policy used ˜ to choose the actions. [sent-197, score-0.084]

52 , QT } satisfying πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ), t = 0, . [sent-205, score-0.074]

53 , QT } satisfying πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ), t = 0, . [sent-211, score-0.074]

54 We assume that there exists a positive constant, L for which pt (at |ot , at−1 ) ≥ L−1 for each t and all pairs (ot , at−1 ); if the stochastic decision rule, pt , were uniform then L would be the size of the action space. [sent-215, score-0.055]

55 Lemma 2 ˜ For all functions, Qt satisfying πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ), t = 0, . [sent-216, score-0.074]

56 , T , and all functions Qt ˜ ˜ satisfying πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ), t = 0, . [sent-219, score-0.074]

57 Note that in general arg maxat Qπ,t (ot , at ) may not be πt thus we can not choose Qt = Qπ,t . [sent-224, score-0.074]

58 This result can be used to emphasize one aspect of the mismatch between estimating the optimal Q function and the goal of learning a policy that maximizes the value function. [sent-227, score-0.092]

59 The generalization error is T V ∗ (o0 ) −Vπ (o0 ) ≤ ∑ 2L(t+1)/2 E (Qt (Ot , At ) − Qt∗ (Ot , At ))2 O0 = o0 t=0 for Qt any function satisfying πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ). [sent-229, score-0.074]

60 Thus for the difference in value functions to be small, the average difference between Qt (ot , at ) − maxat Qt (ot , at ) and Qt∗ (ot , at ) − maxat Qt∗ (ot , at ) must be small; this does not require that the average difference between Qt (ot , at ) and Qt∗ (ot , at ) is small. [sent-232, score-0.12]

61 For example, Baxter and Bartlett (2001) provide an example in which the approximation space for the value function includes a value function for which the greedy policy is optimal, yet the greedy policy found by temporal difference learning (TD(1! [sent-234, score-0.162]

62 Deﬁne µt (ot , at ) = Qt (ot , at ) − maxat Qt (ot , at ) for each t; note that µt (ot , at ) evaluated at at = πt (ot , at−1 ) is zero. [sent-237, score-0.06]

63 Finite Sample Upper Bounds on the Average Generalization Error Traditionally the performance of a policy π is evaluated in terms of maximum generalization error: maxo [V ∗ (o) − Vπ (o)] (Bertsekas and Tsitsiklis, 1996). [sent-251, score-0.078]

64 The choice of F with density f = f0 ( f0 is the density of O0 in likelihoods (1) and (2)) is particularly appealing in the development of a policy in many medical and social science applications. [sent-253, score-0.094]

65 The goal is to produce a good policy for this population of subjects. [sent-255, score-0.078]

66 If only a training set of trajectories is available for learning and we are unwilling to assume that the system dynamics are Markovian, then the choice of F is constrained by the following consideration. [sent-258, score-0.053]

67 , T with associated policy π deﬁned by π j (o j , a j−1 ) = arg maxa j Q j (o j , a j ) and ˜ ˜ ˜ every choice of functions Q j ∈ Q j , j = 0, . [sent-294, score-0.099]

68 Thus, as long as the covering numbers for each Qt and thus for F do not grow too fast, estimating each Qt by minimizing Errn,Qt+1 (Qt ) yields a policy that consistently achieves the optimal value. [sent-306, score-0.089]

69 In this case the training set size n sufﬁcient for (10) to hold need only be polynomial in (1/δ, 1/ε) and batch Q-learning is a probably approximate correct (PAC) reinforcement learning algorithm as deﬁned by Fiechter (1997). [sent-311, score-0.043]

70 As shown by Fiechter (1997) this algorithm can be converted to an efﬁcient on-line reinforcement learning algorithm (here the word on-line implies updating the policy between trajectories). [sent-312, score-0.098]

71 If we do this the result continues to hold when we set π to ∗ and set Q = Q∗ for each t; the generalization error is ˜t an optimal policy π t Z T T V ∗ (o) −Vπ (o) dF(o) ≤ 6ML1/2 ∑ ∑(16)i−t Li Errn,Qi+1 (Qi ) 1/2 t=0 i=t 1/2 + 12ML ε for all n satisfying (9). [sent-315, score-0.078]

72 This upper bound is consistent with the practice of using a policy π for which πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ) and Qt ∈ arg minQt ∈Qt Errn,Qt+1 (Qt ). [sent-316, score-0.176]

73 Given that the covering numbers for the approximation space can be expressed in a sufﬁciently simple form (as in remark 4 below), this upper bound can be used to carry out model selection using structural risk minimization (Vapnik, 1982). [sent-317, score-0.034]

74 Note the needed training set size n depends exponentially on the horizon time T but not on the dimension of the observation space. [sent-335, score-0.034]

75 s is not unexpected as the upper bounds on the generalization error of both Kearns, Mansour and Ng (2000) and Peshkin and Shelton’s (2002) policy search methods (the latter using a training set and importance sampling weights) also depend exponentially on the horizon time. [sent-337, score-0.122]

76 When F is inﬁnite, we use covering numbers for the approximation space Qt and then appeal to Lemma A2 in the appendix to derive a covering number for F ; this results in N1 (ε, F , n) ≤ (T + 1) max N1 t=0,. [sent-339, score-0.036]

77 , T , denote the feature vector by qt (ot , at ). [sent-404, score-0.525]

78 The approximation space is then, Qt = {Qt (ot , at ) = θT qt (ot , at ) : θ ∈ Θ} where Θ is a subset of R p . [sent-405, score-0.531]

79 , Qt } on the training set by Errn,Qt+1 (Qt ) = En Rt + max Qt+1 (Ot+1 , At , at+1 ) − Qt (Ot , At ) qt (Ot , At ) at+1 for t = 0, . [sent-409, score-0.538]

80 In this theorem F = p T [[ i=1 t=1 rt + max Qt+1 (ot+1 , at+1 ; θt+1 ) − Qt (ot , at ; θt ) qti (ot , at ) : θt , θt+1 ∈ Θ . [sent-413, score-0.047]

81 , 0, set Qt (Ot , At ) as the projection of E Rt + Qt+1 (Ot+1 , At , πt+1 )|Ot , At on ¯ t+1 (Ot+1 , At , πt+1 ) is deﬁned as Qt+1 (Ot+1 , At , at+1 ) with at+1 ¯ ¯ the space spanned by qt (recall Q ¯ ¯ ¯ replaced by πt+1 (Ot+1 , At )). [sent-423, score-0.525]

82 And set πt (ot , at−1 ) ∈ arg maxat Qt (ot , at ). [sent-424, score-0.074]

83 These projections are with 1088 G ENERALIZATION E RROR respect to P, the distribution which generated the trajectories in the training set (the likelihood is in (1)). [sent-425, score-0.046]

84 Also assume that Θ is a closed subset of {x ∈ R p : ||x|| ≤ MΘ } and for all (t, i), the ith component in the vector qt is pointwise bounded; |qti | ≤ MQ for MQ a constant. [sent-428, score-0.525]

85 , 0, yields a policy that consistently achieves the optimal value. [sent-446, score-0.078]

86 , πT,θ } and Rwhere each πt,θ (ot , at−1 ) ∈ arg maxat Qt (ot , at ; θ) for some Qt ∈ Qt . [sent-467, score-0.074]

87 Ideally Q-learning would provide a policy that achieves the highest value as compared to other policies in ΠQ (as is the case with ˜ π). [sent-469, score-0.101]

88 The policy π may not produce a for maximal value; that is Vπ (o) −Vπ (o) dF(o) need not be zero (see also the remark following ¯ ˜ ¯ ˜ ˜ ˜ Lemma 2). [sent-472, score-0.085]

89 Approximating the Q-function yields an optimal policy if the approximating class is sufﬁciently rich. [sent-478, score-0.078]

90 Deﬁne an inﬁnite training sample version of Errn as ErrQt+1 (Qt ) = E Rt + max Qt+1 (Ot+1 , At , at+1 ) − Qt qt at+1 1090 G ENERALIZATION E RROR ¯ ¯ ¯ Qt + max Qt+1 (Ot+1 , At , at+1 ) − Qt+1 (Ot+1 , At , πt+1 ) − Qt qt = E at+1 ¯ where Qt is an abbreviation for Qt (Ot , At ). [sent-504, score-1.082]

91 To derive the last equality recall that Qt (Ot , At ) is the projection of ¯ ¯ ¯ E Rt + Qt+1 (Ot+1 , At , πt+1 )|Ot , At on the space spanned by qt . [sent-505, score-0.525]

92 Since Qt is a projection we can ¯ write Qt = θT qt for some θπ,t ∈ Θ. [sent-506, score-0.525]

93 Rearrange the terms in T is invertible to obtain ErrQt+1 using the fact that Eqt qt (θπ,t − θt ) = ¯ −1 Eqt qtT − Eqt qtT ErrQt+1 (Qt ) −1 E ¯ ¯ max Qt+1 (Ot+1 , At , at+1 ) − Qt+1 (Ot+1 , At , πt+1 ) qt . [sent-509, score-1.058]

94 Then (θπ,t − θt )T qt ¯ ≤ (1/η) ErrQt+1 (Qt ) ||qt || + (1/η)E ¯ ¯ max Qt+1 (Ot+1 , At , at+1 ) − Qt+1 (Ot+1 , At , πt+1 ) ||qt || ||qt || at+1 ¯ ≤ (1/η) ErrQt+1 (Qt ) ||qt || + (1/η)L E Qt+1 − Qt+1 ||qt || ||qt || √ 2 ¯ ≤ (1/η) pMQ ErrQt+1 (Qt ) + (1/η)LpMQ E Qt+1 − Qt+1 for t ≤ T . [sent-511, score-0.533]

95 , T P for some θt , θt+1 ∈ Θ, qt ∈ Qt , qt+1 ∈ Qt+1 has Errn,Qt+1 (Qt ) j − ErrQt+1 (Qt ) j > ε ≤ 4N1 (ε )2 n ε , F , 2n exp − 16M 32(M )2 . [sent-528, score-0.525]

96 Discussion Planning problems involving a single training set of trajectories are not unusual and can be expected to increase due to the widespread use of policies in the social and behavioral/medical sciences (see, for example, Rush et al. [sent-541, score-0.078]

97 These training sets are collected under ﬁxed exploration policies and thus while they allow exploration they do not allow exploitation, that is, online choice of the actions. [sent-544, score-0.048]

98 However the mismatch between Q-learning and the goal of learning a policy that maximizes the value function has serious consequences and emphasizes the need to use all available science in choosing the approximation space. [sent-547, score-0.098]

99 Recall that the distributions, P and Pπ differ only with regards to the policy (see (1) and (2)). [sent-561, score-0.078]

100 The presence of the p j s in denominator below represent the price we pay because we only have access to training trajectories with distribution P; we do not have access to trajectories from distribution Pπ . [sent-564, score-0.087]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ot', 0.83), ('qt', 0.525), ('policy', 0.078), ('errqt', 0.076), ('maxat', 0.06), ('trajectories', 0.041), ('rt', 0.039), ('qi', 0.034), ('urphy', 0.031), ('horizon', 0.029), ('df', 0.028), ('rror', 0.026), ('errqi', 0.023), ('policies', 0.023), ('pt', 0.022), ('lt', 0.022), ('eneralization', 0.021), ('lpmq', 0.021), ('pmq', 0.021), ('anthony', 0.02), ('reinforcement', 0.02), ('trajectory', 0.019), ('tsitsiklis', 0.019), ('batch', 0.018), ('planning', 0.015), ('arg', 0.014), ('mismatch', 0.014), ('fiechter', 0.013), ('peshkin', 0.013), ('mt', 0.013), ('mansour', 0.012), ('bartlett', 0.012), ('bt', 0.011), ('en', 0.011), ('covering', 0.011), ('abbreviation', 0.011), ('permit', 0.011), ('action', 0.011), ('eqt', 0.01), ('mf', 0.01), ('oi', 0.01), ('shelton', 0.01), ('thall', 0.01), ('zia', 0.01), ('exploration', 0.01), ('mq', 0.01), ('vt', 0.01), ('upper', 0.01), ('ft', 0.009), ('actions', 0.009), ('markovian', 0.009), ('kakade', 0.009), ('nonstationary', 0.009), ('social', 0.009), ('van', 0.009), ('clinical', 0.009), ('max', 0.008), ('qtt', 0.008), ('rush', 0.008), ('solicitation', 0.008), ('someqt', 0.008), ('le', 0.008), ('lemma', 0.008), ('roy', 0.008), ('rewards', 0.007), ('remark', 0.007), ('dynamics', 0.007), ('maxa', 0.007), ('likelihoods', 0.007), ('sutton', 0.007), ('kearns', 0.006), ('expectation', 0.006), ('nn', 0.006), ('bertsekas', 0.006), ('tth', 0.006), ('incremental', 0.006), ('ng', 0.006), ('treatment', 0.006), ('approximation', 0.006), ('dynamic', 0.005), ('barto', 0.005), ('bellman', 0.005), ('err', 0.005), ('rq', 0.005), ('altfeld', 0.005), ('brooner', 0.005), ('catalog', 0.005), ('chronic', 0.005), ('client', 0.005), ('errq', 0.005), ('fava', 0.005), ('mailing', 0.005), ('minqt', 0.005), ('oit', 0.005), ('psychiatry', 0.005), ('simester', 0.005), ('susan', 0.005), ('telescoping', 0.005), ('trivedi', 0.005), ('training', 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 5 jmlr-2005-A Generalization Error for Q-Learning

Author: Susan A. Murphy

2 0.059384733 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou

Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efﬁcient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We ﬁrst present an online SVM algorithm based on this premise. LASVM yields competitive misclassiﬁcation rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.

3 0.047130436 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

Author: Onno Zoeter, Tom Heskes

Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies

4 0.040132567 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

Author: Damien Ernst, Pierre Geurts, Louis Wehenkel

Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt , ut , rt , xt+1 ) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and ﬁnd that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees. Keywords: batch mode reinforcement learning, regression trees, ensemble methods, supervised learning, ﬁtted value iteration, optimal control

5 0.039335914 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

Author: Hyunsoo Kim, Peg Howland, Haesun Park

Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classiﬁcation methods for many applications including text classiﬁcation. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efﬁciently handle a large number of terms in practical applications of text classiﬁcation. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classiﬁcation algorithm and support vector classiﬁers to handle the classiﬁcation problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efﬁciency for both training and testing can be achieved without sacriﬁcing prediction accuracy of text classiﬁcation even when the dimension of the input space is signiﬁcantly reduced. Keywords: dimension reduction, support vector machines, text classiﬁcation, linear discriminant analysis, centroids

6 0.032913156 61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers

7 0.013615175 12 jmlr-2005-An MDP-Based Recommender System

8 0.01210197 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

9 0.011507235 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

10 0.0095042288 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

11 0.0087718312 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

12 0.0082165673 3 jmlr-2005-A Classification Framework for Anomaly Detection

13 0.0078926524 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

14 0.0078244843 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems

15 0.0071514472 67 jmlr-2005-Stability of Randomized Learning Algorithms

16 0.0069837263 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

17 0.0068352297 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

18 0.0060547739 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

19 0.005649216 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

20 0.0054875235 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.052), (1, 0.021), (2, 0.081), (3, -0.093), (4, -0.06), (5, 0.047), (6, 0.089), (7, 0.083), (8, 0.056), (9, -0.09), (10, -0.221), (11, -0.058), (12, -0.074), (13, 0.083), (14, -0.063), (15, 0.028), (16, -0.087), (17, -0.095), (18, -0.195), (19, -0.249), (20, -0.053), (21, 0.177), (22, 0.256), (23, -0.321), (24, 0.005), (25, 0.022), (26, 0.006), (27, 0.212), (28, -0.458), (29, 0.121), (30, 0.067), (31, -0.049), (32, 0.074), (33, 0.088), (34, 0.081), (35, 0.16), (36, 0.022), (37, 0.079), (38, 0.04), (39, -0.058), (40, 0.3), (41, -0.03), (42, 0.047), (43, -0.186), (44, 0.035), (45, -0.028), (46, 0.082), (47, 0.021), (48, -0.065), (49, -0.104)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99605191 5 jmlr-2005-A Generalization Error for Q-Learning

Author: Susan A. Murphy

2 0.18225577 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou

3 0.14103064 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

Author: Hyunsoo Kim, Peg Howland, Haesun Park

4 0.11420923 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

Author: Onno Zoeter, Tom Heskes

5 0.081313983 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

Author: Damien Ernst, Pierre Geurts, Louis Wehenkel

6 0.062471922 61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers

7 0.055619851 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

8 0.048108131 3 jmlr-2005-A Classification Framework for Anomaly Detection

9 0.047597729 25 jmlr-2005-Denoising Source Separation

10 0.042205416 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

11 0.039099216 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

12 0.037607778 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

13 0.036345627 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

14 0.035359271 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

15 0.034494109 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies

16 0.03017579 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

17 0.029534305 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

18 0.028830577 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

19 0.027758095 64 jmlr-2005-Semigroup Kernels on Measures

20 0.027456209 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.018), (17, 0.016), (19, 0.028), (36, 0.017), (37, 0.06), (43, 0.019), (52, 0.048), (59, 0.01), (68, 0.034), (70, 0.024), (88, 0.061), (90, 0.01), (94, 0.452)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82221615 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefﬁcients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

same-paper 2 0.80552071 5 jmlr-2005-A Generalization Error for Q-Learning

Author: Susan A. Murphy

3 0.79459429 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

Author: Fabio Aiolli, Alessandro Sperduti

Abstract: Winner-take-all multiclass classiﬁers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classiﬁed with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to ﬁnd locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efﬁcient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a signiﬁcant reduction (of one or two orders) in response time. Keywords: multiclass classiﬁcation, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classiﬁers

4 0.62629974 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive deﬁnite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities deﬁned on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efﬁcient and ﬂexible framework using the “kernel trick”. Finally, a constructive approach to such positive deﬁnite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classiﬁcation to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

5 0.41155738 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classiﬁers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classiﬁer should minimize the conﬁdence-rated error, whereas for AdaBoost.M2 the base classiﬁer should minimize the pseudo-loss. This makes them more easily applicable to already existing base classiﬁers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classiﬁcation, decision stumps

6 0.32838458 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

7 0.32157141 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

8 0.32053876 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

9 0.31759822 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

10 0.31587523 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

11 0.30859271 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

12 0.30710068 11 jmlr-2005-Algorithmic Stability and Meta-Learning

13 0.30133274 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

14 0.30065787 67 jmlr-2005-Stability of Randomized Learning Algorithms

15 0.29485202 54 jmlr-2005-Managing Diversity in Regression Ensembles

16 0.28349796 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

17 0.28249836 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

18 0.2821117 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

19 0.27820691 48 jmlr-2005-Learning the Kernel Function via Regularization

20 0.27015102 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning