jmlr jmlr2012 jmlr2012-58 knowledge-graph by maker-knowledge-mining

58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions

Source: pdf

Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy

Abstract: We present a general and detailed development of an algorithm for ﬁnite-horizon ﬁtted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the ﬁeld of evidence-based clinical decision support. Keywords: reinforcement learning, dynamic programming, decision making, linear regression, preference elicitation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For the ith patient in the trial, we obtain a trajectory of observations and actions of the form oi , ai , oi , ai , . [sent-22, score-0.272]

2 L IZOTTE , B OWLING AND M URPHY Here, ati represents the action (treatment) at time t, and oti represents measurements made of patient i i after action at−1 and before action ati . [sent-29, score-0.431]

3 To analyze these data using reinforcement learning methods, we must deﬁne two functions st (o1 , a1 , . [sent-31, score-0.596]

4 , ot ) and rt (st , at , ot+1 ) which map the patient’s current history to a state representation and a scalar reward signal, respectively. [sent-34, score-0.379]

5 That is, given the value st of the current state, the distribution of next state St+1 and current reward Rt is conditionally independent of s j , a j , r j for all j < t. [sent-42, score-0.796]

6 Clearly this can be achieved by including past history in st , but this may not be feasible. [sent-43, score-0.554]

7 The interplay between predictiveness, learnability, and interpretability makes the deﬁnition of st a challenging problem that requires a great deal of further investigation, particularly into the consequences of a non-Markov deﬁnition of state. [sent-46, score-0.554]

8 However, the question of how st should be deﬁned can be answered at least in part by the data themselves together with expert knowledge and feature/model selection techniques analogous to those used in supervised learning settings (Keller et al. [sent-47, score-0.554]

9 There are many reasonable reward functions one could deﬁne, since each patient record includes a multi-dimensional measurement of that patient’s overall well-being. [sent-52, score-0.286]

10 These different dimensions are typically better addressed by some treatments than by others, and therefore the choice of which dimension to use as the reward will affect the resulting learned policy. [sent-54, score-0.305]

11 In clinical practice, doctors, patients, and families decide on a treatment by weighing different measures of well-being, like symptoms and side-effects, according to subjective preferences that are not known to us at the time of data analysis. [sent-57, score-0.289]

12 Continuing our example, these preferences may lean more toward symptom reduction or side-effect reduction, depending on the individual decision makers involved, and an appropriate reward function deﬁnition should reﬂect these preferences. [sent-58, score-0.391]

13 Moreover, this approach is problematic because it does not give a complete picture of the quality of the available actions under different reward choices. [sent-60, score-0.41]

14 Indeed, the decision maker will not know when very small changes in preferences might lead to different actions, or when one action is optimal for a broad range of preferences, or when another action is not optimal for any preference. [sent-61, score-0.454]

15 Rather than eliciting preference and producing a policy that recommends a single action per state, our “inverse preference elicitation” approach is to ﬁrst consider all of the actions available at the current state. [sent-62, score-0.627]

16 Furthermore, even if a preference is speciﬁed somehow, our methods allow the maker to immediately see if his or her preferences are near a “boundary”—that is, whether a small change in preference can lead to a different recommended action. [sent-65, score-0.348]

17 We are interested in efﬁcient algorithms that can exactly compute the optimal policy for a range of reward functions to investigate how our choice of reward function inﬂuences the optimal policy, and in turn how we can offer more ﬂexible choices among good actions. [sent-67, score-0.568]

18 (1998) demonstrated that an MDP with multiple reward signals can be well-deﬁned and solved so long as we are given a ﬁxed partial ordering on reward vectors. [sent-78, score-0.424]

19 Mannor and Shimkin (2004) offer a formalism where actions are chosen to ensure that the long-term average reward vector approaches a “target set”. [sent-79, score-0.41]

20 Natarajan and Tadepalli (2005) assume that a scalar reward function is constructed by taking a weighted sum of a reward vector, just as we will. [sent-81, score-0.424]

21 Early work in this direction (Barrett and Narayanan, 2008) explored the problem of simultaneously computing optimal policies for a 3255 L IZOTTE , B OWLING AND M URPHY class of reward functions over a small, ﬁnite state space in a framework where the model is known. [sent-85, score-0.285]

22 In this ﬁnite-horizon sequential decision making setting, the optimal policy in general depends on t (Bertsekas and Tsitsiklis, 1996), so we explicitly maintain separate rt , Qt , and Vt functions for each timepoint, and we deﬁne QT ≡ rT . [sent-101, score-0.295]

23 We consider sets of MDPs that all have the same St , At , and state transition dynamics, but whose expected reward functions rt (st , at , δ) have an additional parameter δ. [sent-104, score-0.379]

24 Each ﬁxed δ identiﬁes a single MDP by ﬁxing a reward function, which has a corresponding optimal1 state-action value function deﬁned in the usual way via the Bellman equation: Qt (st , at , δ) = rt (st , at , δ) + ESt+1 |st ,at [ max Qt+1 (St+1 , a, δ)]. [sent-106, score-0.349]

25 Thus rt (st , at , δ) is a convex combination of the basis rewards rt[0] , . [sent-120, score-0.289]

26 For example, if δ[d] = 1 for some index d, then rt (st , at , δ) = rt[d] (st , at ), and the optimal policy πt (st , δ) will choose actions that optimize the expected sum of rewards as determined by rt[d] . [sent-125, score-0.559]

27 Furthermore, we do not attempt to recover the reward function of any particular agent; we will instead attempt to learn the optimal policy for a set of reward functions. [sent-149, score-0.548]

28 In this setting, the range of possible reward functions can be indexed by a single scalar δ = δ[1] by deﬁning rt (st , at , δ) = (1 − δ) · rt[0] (st , at ) + δ · rt[1] (st , at ). [sent-197, score-0.349]

29 Our representation also allows identiﬁcation of the set of dominated actions, that is, the actions that are not optimal for any (st , δ) pair. [sent-229, score-0.265]

30 The Q-function for the last timepoint is equal to the terminal expected reward function rT , which is linear in δ for each state-action pair as deﬁned in (2), so we have QT (sT , aT , δ) = (1 − δ) · rT [0] (sT , aT ) + δ · rT [1] (sT , aT ). [sent-237, score-0.27]

31 In general, we can recover the actions that are optimal on any interval [δ, δ′ ] by ﬁnding the upper-right convex hull of the points {(QT (sT , a, δ), QT (sT , a, δ′ )) : a ∈ {1. [sent-251, score-0.265]

32 Commonly-used convex hull routines produce output that is ordered, so it is easy to recover the list of actions that are optimal for some δ, along with the values of δ where the optimal action changes. [sent-258, score-0.43]

33 To recover the policy at this point, we may also retain a list of lists containing the actions that are optimal at each knot: [[1] [1 2] [2 4] [4]]. [sent-289, score-0.348]

34 This allows us to determine the action or actions that are optimal for any segment by taking the intersection of the action lists for the endpoints of the segment. [sent-290, score-0.456]

35 In practice, we can avoid running the convex hull algorithm over every interval by checking each interval’s end points: if for some action at∗ we ﬁnd that Qt (st , at∗ , ·) is maximal at both ends of an interval in the current knot-list, then maxa Qt (st , a, ·) has no knots inside the interval. [sent-351, score-0.301]

36 Note that though we present our algorithms assuming the reward functions are linear, they will work for piecewise-linear reward functions as well. [sent-352, score-0.424]

37 Therefore, we can use the convex hull method to identify the actions that maximize value at a given sT , and use it to recover ˆ the knots in the piecewise-linear VT (sT , ·). [sent-428, score-0.351]

38 For example, a vertical slice at ψsT = −4 of the value function has three linear pieces where actions 10, 1, and 7 are optimal. [sent-436, score-0.268]

39 As before, we can also retain a list of the optimal actions at each of the knots in order to later recover the policy. [sent-463, score-0.35]

40 Thus to compute Vt (st a t t t i , using regression to compute Q our data set, we apply Algorithm 2 for each st t−1 (st−1 , at−1 , ·) from these functions then proceeds as we did for the t = T − 1 case. [sent-469, score-0.554]

41 Computing knots are the union of the knots of the V T ˆ VT −1 (si −1 , ·) using Algorithm 2 adds at most |A | − 1 new knots between each pair of knots in the T ˆ union, for a total of (|A | − 1|)(N · (|A | − 1) + 1) + 2 knots. [sent-495, score-0.424]

42 Dominated Actions ˆ ˆ The ability to compute Q and V for all preferences achieves our goal of informing the decision maker about the quality of available actions under different preferences, and of informing the decision maker about how the recommended policy changes with preference. [sent-685, score-0.547]

43 In addition to achieving these ˆ ˆ primary goals, our representations of Q and V allow us to compute whether or not an action is dominated (not optimal for a given state no matter what the preference) and whether it is globally dominated (not optimal for any state-preference pair. [sent-686, score-0.32]

44 ) The notion of domination arises in POMDP planning as well, where certain actions may not be optimal for any belief state, but the notion of global domination has no direct analog in the POMDP setting since it is a property of additional observed state s that is not part of a typical POMDP. [sent-687, score-0.271]

45 The concept of domination is central to the ﬁeld of multi-criterion optimization (Ehrgott, 2005), and is important in the medical decision making setting because it identiﬁes treatment actions that are not appropriate no matter what the decision maker’s preference is. [sent-688, score-0.449]

46 One may also consider actions that are dominated for all patient states in a population of interest, that is, the actions that are globally dominated. [sent-689, score-0.554]

47 Knowing this set of actions would be useful for developing a formulary—a list of treatments that should generally be made available to a patient population. [sent-690, score-0.358]

48 The general problem of analytically identifying the set of globally dominated actions is difﬁcult, as we will illustrate, but we ﬁrst provide solutions for low-dimensional problems and discuss the identiﬁcation of globally dominated actions in higher dimensional settings. [sent-691, score-0.564]

49 ˆ ˆ A point (st , at , δ) where Qt (st , at , δ) = Vt (st , δ) is a certiﬁcate that action at is not dominated at state st , and that action at is therefore not globally dominated. [sent-693, score-0.906]

50 1 how to exactly represent the Qt (st , at , ·) and Vt (st , ·) functions for all st and at for D = 2 when the state space is ﬁnite by representing them as lists of knots (vertices) and knotvalues (vertex-values). [sent-697, score-0.69]

51 In addition to storing this information, we may also store the set of actions that are optimal at each knot, that is, we may store A ∗ (δ) = {at∗ : Qt (st , at∗ , δ) = Vt (st , δ)} for each δ in the knot list of Vt (st , ·). [sent-698, score-0.266]

52 Thus the set of actions that have a non-domination certiﬁcate at state st is given by |∆(Vt (st ,·))| A ∗ (δk ), k=1 ˆ 10. [sent-704, score-0.782]

53 3275 L IZOTTE , B OWLING AND M URPHY and any actions not in the above union are dominated at st . [sent-708, score-0.799]

54 It also allows us to ﬁnd every globally dominated action by computing the above union at each ﬁnite state, taking the union of those sets, and identifying actions not present in the union. [sent-710, score-0.401]

55 2 Regression Case, D = 2 Basis Rewards, One State Feature We now show how to identify all of the globally dominated actions in the linear function approximation setting. [sent-712, score-0.282]

56 We can then deﬁne βta (δ) to be the 2 × 1 ˆ t (δ) that aligns with the sub-vector of φ sub-vector of β st ,at that is non-zero for at = a. [sent-715, score-0.554]

57 (7) To ﬁnd the globally dominated actions, we will search for certiﬁcates of non-domination in (ψsT , δ) space and identify the actions that do not have a certiﬁcate. [sent-717, score-0.282]

58 This procedure reveals all actions that are optimal for some (ψsT , δ), and thereby identiﬁes any actions that are not optimal for any (ψsT , δ). [sent-734, score-0.436]

59 11 We have described the process of identifying globally dominated actions for bilinear functions; we can immediately extend this algorithm for piecewise bilinear functions by applying it between pairs of knots. [sent-744, score-0.306]

60 (2010) conjectured that an analogue of Lemma 5 holds in higher dimensions, and that identifying all globally dominated actions for more state variables and/or reward functions would require computing intersections of surfaces in higher dimensions. [sent-747, score-0.524]

61 We reﬁne this conjecture, and we propose a solution method for ﬁnding globally dominated actions for ˆ QT (ψsT , a, δ) = 1 ψT sT Ba δ. [sent-748, score-0.282]

62 Again, we have described the process of identifying globally dominated actions for functions linear in δ; we can immediately extend this algorithm to piecewiselinear functions by applying it within linear regions. [sent-771, score-0.282]

63 In the medical decision making ﬁeld, however, preference elicitation is usually done at the population level and used to produce generic clinical guidelines, rather than to make recommendations tailored to individual patients (e. [sent-786, score-0.358]

64 Rather than trying to determine which of the uncountable number of possible preferences a user might have, we present, for each available action, the set of preferences for which that action is optimal. [sent-794, score-0.333]

65 We call this approach “inverse preference elicitation” because rather than eliciting a preference and recommending a treatment, we can easily and intuitively show for each treatment the set of preferences that are consistent with its recommendation. [sent-796, score-0.367]

66 In previous work, we used batch off-policy reinforcement learning to analyze data from this study using a single reward function (Shortreed et al. [sent-810, score-0.254]

67 Since having larger PANSS is worse, for our ﬁrst basis reward r[0] we use 100 minus the percentile of a patient’s PANSS at the end of their time in the study. [sent-819, score-0.276]

68 Since in this population having a larger BMI is worse, for our second basis reward r[1] we use 100 minus the percentile of a patient’s BMI at the end of their time in the study. [sent-825, score-0.276]

69 Since having a higher HQLS is better, for our third basis reward r[2] we use the percentile of a patient’s HQLS at the end of their time in the study. [sent-831, score-0.276]

70 In Figures 8, 9 and 10 we will present plots of the piecewise linear value ˆ ˆ function Vt (st , ·) for t = 1, 2 and for various representative values of st . [sent-835, score-0.578]

71 Thus from our plots one can see both the learned value and the learned policy as a function of δ, which enables us to easily see for each action the range of preferences for which that action looks best. [sent-837, score-0.515]

72 ) For all three states shown in the plot, the learned policy indicates that clozapine is the best action for a reward based only on PANSS (i. [sent-864, score-0.516]

73 , for δ = 0), but not-clozapine (olanzapine or risperidone or quetiapine) is best for a reward based only on BMI (i. [sent-866, score-0.279]

74 We see that, except for those with a strong preference for controlling BMI, clozapine appears to be the best choice among patients who found their Phase 1 treatment to be ineffective at controlling symptoms. [sent-870, score-0.267]

75 When we display value functions and learned policies in our examples, we set all of these indicators to 0 since they are not needed by the learned policy to select actions in the future. [sent-877, score-0.391]

76 8 δ 1 Reward: BMI Figure 8: Multiple rewards analysis showing learned value function and associated learned policy for Phase 2 Efﬁcacy patients. [sent-882, score-0.27]

77 , for δ = 0), the learned policy indicates that olanzapine is the best action for those with high or moderate incoming PANSS, and that risperidone is best for those with lower incoming PANSS. [sent-888, score-0.399]

78 8 1 Reward: BMI Figure 9: Multiple rewards analysis showing learned value function and associated learned policy for Phase 2 Tolerability patients. [sent-910, score-0.27]

79 For all three states shown in the plot, the learned policy indicates that olanzapine is the best action for a reward based only on PANSS (i. [sent-917, score-0.544]

80 4 to compute the value functions which map a state st and a three-element preference vector δ to an estimated value. [sent-931, score-0.687]

81 8 1 Reward: BMI Figure 10: Multiple rewards analysis showing learned value function and associated learned policy for Phase 1 patients. [sent-939, score-0.27]

82 13 In all examples, we show the policy for PANSS percentile st = 50. [sent-942, score-0.696]

83 Figure 11 shows the learned policy for patients with s2 = 50 whose Phase 1 treatment was not efﬁcacious. [sent-947, score-0.253]

84 We see that clozapine appears best if the reward is based only on PANSS or on HQLS, and “not-clozapine” appears best only if the preference assigns a relatively large weight to BMI. [sent-949, score-0.384]

85 We also see that clozapine appears best for all preferences that consider only PANSS and HQLS (bottom edge) and for most preferences that consider only HQLS and BMI (upper-right edge. [sent-951, score-0.262]

86 ✩ Clozapine Reward: HQLS δ = (0, 0, 1) Figure 11: Multiple rewards analysis using PANSS, BMI, and HQLS, showing learned policy for Phase 2 Efﬁcacy patients with s2 = 50. [sent-957, score-0.299]

87 In this analysis it is clear that neither action is globally dominated because neither is dominated at state s2 = 50. [sent-960, score-0.28]

88 Here, we see that olanzapine appears best if the reward is based only on PANSS or on HQLS, and ziprasidone appears best if the preference assigns a relatively large weight to BMI. [sent-962, score-0.469]

89 Over much of preference space, these horizontal lines are completely contained within one treatment’s optimal region, meaning that given a weight for BMI, the policy usually does not depend on the relative preference for PANSS versus HQLI. [sent-967, score-0.351]

90 3285 L IZOTTE , B OWLING AND M URPHY Reward: BMI δ = (0, 1, 0) Reward: PANSS δ = (1, 0, 0) Ziprasidone | Risperidone — Olanzapine Reward: HQLS δ = (0, 0, 1) Figure 12: Multiple rewards analysis using PANSS, BMI, and HQLS, showing learned policy for Phase 2 Tolerability patients with s2 = 50. [sent-970, score-0.299]

91 One can think of the preference as setting an “exchange rate” for r[0] and r[1] : in this case, the basis rewards can be exchanged at a one-to-one rate and our happiness with the overall result of an action remains the same. [sent-994, score-0.348]

92 8 1 Reward: 50·BMI Figure 14: Multiple rewards analysis showing learned value function and associated learned policy for Phase 2 Tolerability patients, using 50·BMI as one basis reward. [sent-1008, score-0.296]

93 Note how scaling the BMI reward affects the regions where different actions are optimal. [sent-1009, score-0.41]

94 5 would be two percentiles of PANSS equals one percentile of BMI, and the preference at which the policy changes from one action to the other in Figure 9, for example, would shift to the left. [sent-1015, score-0.384]

95 However, it also supports the use of the exact algorithms we have presented: even if the rewards are poorly scaled, the set of non-dominated actions remains the same, and they retain their ordering according to delta. [sent-1023, score-0.298]

96 In practice, it may make more sense to recommend more than one action for a particular preference if the Q-values of those actions are very similar. [sent-1046, score-0.42]

97 Practical Signiﬁcance Even if we have strong statistical evidence that one action has a higher Q-value than another, we may still wish to recommend a set of actions if that difference is too small to be practically meaningful. [sent-1057, score-0.317]

98 This included a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions under linear function approximation. [sent-1067, score-0.393]

99 Since a is not optimal at a point where p + D + 1 actions are optimal, the region where a is optimal must have on its boundary points where k actions are simultaneously optimal for some k < p + D + 1. [sent-1097, score-0.478]

100 Efﬁcient reinforcement learning with multiple reward functions for randomized controlled trial analysis. [sent-1316, score-0.254]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('st', 0.554), ('qt', 0.422), ('panss', 0.219), ('reward', 0.212), ('vt', 0.209), ('actions', 0.198), ('bmi', 0.147), ('rt', 0.137), ('action', 0.119), ('preferences', 0.107), ('knots', 0.106), ('policy', 0.104), ('preference', 0.103), ('rewards', 0.1), ('itted', 0.1), ('izotte', 0.1), ('owling', 0.1), ('urphy', 0.1), ('hqls', 0.095), ('teration', 0.085), ('eward', 0.085), ('phase', 0.076), ('clinical', 0.076), ('olanzapine', 0.076), ('patient', 0.074), ('unctions', 0.071), ('pieces', 0.07), ('risperidone', 0.067), ('ultiple', 0.066), ('pomdp', 0.062), ('patients', 0.062), ('treatments', 0.06), ('elicitation', 0.057), ('quetiapine', 0.057), ('ziprasidone', 0.057), ('treatment', 0.054), ('catie', 0.052), ('symptoms', 0.052), ('schizophrenia', 0.051), ('inear', 0.05), ('clozapine', 0.048), ('dominated', 0.047), ('certi', 0.045), ('polytopes', 0.044), ('barrett', 0.043), ('olan', 0.043), ('reinforcement', 0.042), ('si', 0.039), ('percentile', 0.038), ('pineau', 0.038), ('quet', 0.038), ('risp', 0.038), ('shortreed', 0.038), ('symptom', 0.038), ('globally', 0.037), ('lizotte', 0.037), ('triangulation', 0.037), ('maker', 0.035), ('terminal', 0.034), ('decision', 0.034), ('tolerability', 0.033), ('learned', 0.033), ('narayanan', 0.032), ('life', 0.032), ('state', 0.03), ('maxa', 0.029), ('antipsychotic', 0.029), ('backups', 0.029), ('laber', 0.029), ('fmax', 0.028), ('basis', 0.026), ('list', 0.026), ('vertices', 0.026), ('convex', 0.026), ('medical', 0.026), ('mdp', 0.025), ('simplices', 0.024), ('canadian', 0.024), ('backup', 0.024), ('cate', 0.024), ('piecewise', 0.024), ('timepoint', 0.024), ('policies', 0.023), ('belief', 0.023), ('cacy', 0.023), ('boundary', 0.022), ('knot', 0.022), ('tradeoffs', 0.022), ('hull', 0.021), ('weight', 0.021), ('optimal', 0.02), ('percentiles', 0.02), ('yt', 0.02), ('piece', 0.02), ('geometry', 0.02), ('allison', 0.019), ('cates', 0.019), ('drugs', 0.019), ('fsum', 0.019), ('hase', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000024 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions

Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy

2 0.21602033 84 jmlr-2012-Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efﬁcient and are Hannan-consistent in both the full information and partial feedback settings. Keywords: submodular optimization, online learning, regret minimization

3 0.13423945 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

Author: Yiming Ying, Peng Li

Abstract: The main theme of this paper is to develop a novel eigenvalue optimization framework for learning a Mahalanobis metric. Within this context, we introduce a novel metric learning approach called DML-eig which is shown to be equivalent to a well-known eigenvalue optimization problem called minimizing the maximal eigenvalue of a symmetric matrix (Overton, 1988; Lewis and Overton, 1996). Moreover, we formulate LMNN (Weinberger et al., 2005), one of the state-of-the-art metric learning methods, as a similar eigenvalue optimization problem. This novel framework not only provides new insights into metric learning but also opens new avenues to the design of efﬁcient metric learning algorithms. Indeed, ﬁrst-order algorithms are developed for DML-eig and LMNN which only need the computation of the largest eigenvector of a matrix per iteration. Their convergence characteristics are rigorously established. Various experiments on benchmark data sets show the competitive performance of our new approaches. In addition, we report an encouraging result on a difﬁcult and challenging face veriﬁcation data set called Labeled Faces in the Wild (LFW). Keywords: metric learning, convex optimization, semi-deﬁnite programming, ﬁrst-order methods, eigenvalue optimization, matrix factorization, face veriﬁcation

4 0.11486877 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efﬁcient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efﬁciency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efﬁcient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics

5 0.10934602 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie

Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with signiﬁcant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We ﬁnd that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling

6 0.096758373 34 jmlr-2012-Dynamic Policy Programming

7 0.090976998 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

8 0.067196116 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

9 0.066405654 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

10 0.064557225 5 jmlr-2012-A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

11 0.058255997 50 jmlr-2012-Human Gesture Recognition on Product Manifolds

12 0.055882592 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

13 0.055133659 97 jmlr-2012-Regularization Techniques for Learning with Matrices

14 0.050260246 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration

15 0.050173651 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

16 0.044122051 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

17 0.043285031 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

18 0.040483985 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

19 0.03800549 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

20 0.0336859 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.169), (1, -0.24), (2, 0.08), (3, -0.203), (4, -0.052), (5, -0.14), (6, -0.251), (7, -0.072), (8, 0.059), (9, 0.109), (10, -0.087), (11, -0.021), (12, 0.026), (13, -0.112), (14, 0.054), (15, -0.071), (16, -0.076), (17, 0.182), (18, 0.092), (19, -0.121), (20, 0.111), (21, 0.035), (22, -0.238), (23, -0.138), (24, 0.06), (25, 0.036), (26, 0.061), (27, -0.047), (28, -0.054), (29, -0.204), (30, -0.007), (31, 0.006), (32, 0.018), (33, 0.004), (34, -0.005), (35, -0.078), (36, -0.095), (37, -0.013), (38, -0.003), (39, -0.142), (40, 0.016), (41, 0.057), (42, 0.05), (43, 0.159), (44, 0.006), (45, 0.07), (46, 0.067), (47, 0.039), (48, 0.076), (49, -0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98737842 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions

Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy

2 0.54612803 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

3 0.4806115 84 jmlr-2012-Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

4 0.46259177 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

Author: Yiming Ying, Peng Li

5 0.43134326 5 jmlr-2012-A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

Author: Michael W. Mahoney, Lorenzo Orecchia, Nisheeth K. Vishnoi

Abstract: The second eigenvalue of the Laplacian matrix and its associated eigenvector are fundamental features of an undirected graph, and as such they have found widespread use in scientiﬁc computing, machine learning, and data analysis. In many applications, however, graphs that arise have several local regions of interest, and the second eigenvector will typically fail to provide information ﬁne-tuned to each local region. In this paper, we introduce a locally-biased analogue of the second eigenvector, and we demonstrate its usefulness at highlighting local properties of data graphs in a semi-supervised manner. To do so, we ﬁrst view the second eigenvector as the solution to a constrained optimization problem, and we incorporate the local information as an additional constraint; we then characterize the optimal solution to this new problem and show that it can be interpreted as a generalization of a Personalized PageRank vector; and ﬁnally, as a consequence, we show that the solution can be computed in nearly-linear time. In addition, we show that this locally-biased vector can be used to compute an approximation to the best partition near an input seed set in a manner analogous to the way in which the second eigenvector of the Laplacian can be used to obtain an approximation to the best partition in the entire input graph. Such a primitive is useful for identifying and reﬁning clusters locally, as it allows us to focus on a local region of interest in a semi-supervised manner. Finally, we provide a detailed empirical evaluation of our method by showing how it can applied to ﬁnding locally-biased sparse cuts around an input vertex seed set in social and information networks. Keywords: spectral graph partitioning, local spectral algorithms, Laplacian matrix, semi-supervised learning, personalized pagerank c 2012 Michael W. Mahoney and Lorenzo Orecchia and Nisheeth K. Vishnoi. M AHONEY, O RECCHIA , AND V ISHNOI

6 0.4053573 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

7 0.33931926 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

8 0.33633769 34 jmlr-2012-Dynamic Policy Programming

9 0.21139219 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

10 0.19524406 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

11 0.17584449 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

12 0.16804686 50 jmlr-2012-Human Gesture Recognition on Product Manifolds

13 0.16167457 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

14 0.15930137 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

15 0.15612371 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory

16 0.15603544 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

17 0.15467083 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

18 0.15364975 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration

19 0.14940222 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

20 0.14443263 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.014), (21, 0.024), (26, 0.025), (27, 0.014), (29, 0.036), (35, 0.018), (56, 0.015), (57, 0.037), (64, 0.013), (69, 0.018), (75, 0.032), (77, 0.02), (79, 0.469), (92, 0.083), (96, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81922179 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions

Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy

2 0.7801097 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(ε−6 n log5 n) preference labels for a regret of ε times the optimal loss. As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem. Our main result takes us a step closer toward settling an open problem posed by learning-torank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we analyze a typical test case in which a large margin linear relaxation is used for efﬁciently solving the simpler learning problems in our decomposition. Keywords: statistical learning theory, active learning, ranking, pairwise ranking, preferences

3 0.72685075 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

Author: Animashree Anandkumar, Vincent Y.F. Tan, Furong Huang, Alan S. Willsky

Abstract: We consider the problem of high-dimensional Gaussian graphical model selection. We identify a set of graphs for which an efﬁcient estimation algorithm exists, and this algorithm is based on thresholding of empirical conditional covariances. Under a set of transparent conditions, we establish structural consistency (or sparsistency) for the proposed algorithm, when the number of −2 samples n = Ω(Jmin log p), where p is the number of variables and Jmin is the minimum (absolute) edge potential of the graphical model. The sufﬁcient conditions for sparsistency are based on the notion of walk-summability of the model and the presence of sparse local vertex separators in the underlying graph. We also derive novel non-asymptotic necessary conditions on the number of samples required for sparsistency. Keywords: Gaussian graphical model selection, high-dimensional learning, local-separation property, walk-summability, necessary conditions for model selection

4 0.34247881 84 jmlr-2012-Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

5 0.34193373 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efﬁcient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efﬁcient, crowdsourcing

6 0.32642609 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

7 0.32108575 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

8 0.31545821 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

9 0.31381583 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

10 0.31345531 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

11 0.31305093 5 jmlr-2012-A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

12 0.3130371 103 jmlr-2012-Sampling Methods for the Nyström Method

13 0.31287572 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

14 0.31039384 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

15 0.30989614 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

16 0.30918682 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

17 0.30799103 80 jmlr-2012-On Ranking and Generalization Bounds

18 0.30521435 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

19 0.30420738 68 jmlr-2012-Minimax Manifold Estimation

20 0.30225068 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class