nips nips2008 nips2008-223 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. [sent-8, score-0.63]
2 We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. [sent-9, score-0.662]
3 We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. [sent-10, score-0.741]
4 We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. [sent-11, score-0.481]
5 Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. [sent-12, score-0.515]
6 Using model-based (Bayesian) Reinforcement learning [1], it is possible to solve this problem optimally by finding policies that maximize the expected discounted future reward [2]. [sent-15, score-0.522]
7 For tasks simple enough to allow comparison between human behavior and normative theory, like the multi-armed bandit problem, human choices appear suboptimal. [sent-17, score-0.739]
8 Moreover, in one-armed bandit tasks where exploration is not necessary, people frequently converge to probability matching [7, 8, 9, 10], rather than the better option, even when subjects are aware which option is best [11]. [sent-19, score-0.607]
9 When this assumption is built into normative predictions, these models account much better for human choices in one-armed bandit problems [12], and potentially multi-armed problems [13]. [sent-22, score-0.586]
10 reward probabilities) within a known model (a fixed causal graph structure) that encodes the relations between environmental states, rewards and actions created by the experimenter. [sent-26, score-0.662]
11 Does reward accrue when the machine is not played for a while? [sent-33, score-0.453]
12 In this work, we assess the effect of uncertainty between two critical reward structures in terms of the need to explore. [sent-35, score-0.559]
13 The first structure is a one-arm bandit problem in which exploration is not necessary (reward generation is coupled across arms); greedy action is optimal [14]. [sent-36, score-0.871]
14 And the other structure is a two-arm bandit problem in which exploration is necessary (reward generation is independent at each arm); each action needs to balance the exploration/exploitation tradeoff [15]. [sent-37, score-0.67]
15 We illustrate how structure learning affects action selection and the value of information gathering in a simple sequential choice task resembling a Multi-armed Bandit (MAB), but with uncertainty between the two previous models of reward coupling. [sent-38, score-0.934]
16 We develop a normative model of learning and action for this class of problems, illustrate the effect of model uncertainty on action selection, and show evidence that people perform structure learning. [sent-39, score-0.57]
17 Consider an environment with several distinct reward sites that can be sampled, but the way models generate these rewards is unknown. [sent-41, score-0.728]
18 Even if independent, if the reward sites are homogeneous, then they may have the same probability. [sent-43, score-0.501]
19 Uncertainty about which reward model is correct naturally produces a mixture as the appropriate learning model. [sent-44, score-0.518]
20 This structure learning model is a special case of Bayesian Reinforcement Learning (BRL), where the states of the environment are the reward sites and the transitions between states are determined by the action of sampling a reward site. [sent-45, score-1.278]
21 Uncertainty about reward dynamics and contingencies can be modeled by including within the belief state not only reward probabilities, but also the possibility of independent or coupled rewards. [sent-46, score-1.387]
22 Then, the optimal balance of exploration and exploitation in BRL results in action selection that seeks to maximize (1) expected rewards (2) information about rewards dynamics, and (3) information about task structure. [sent-47, score-0.631]
23 Given that tasks tested in this research involve mixtures of Multi-Armed Bandit (MAB) problems, we borrow MAB language to call a reward site, an arm, and sample a choice or pull. [sent-48, score-0.498]
24 Let γ (0 < γ < 1) be a discounting factor such that a possibly stochastic reward x obtained t time steps in the future means γ t x today. [sent-51, score-0.453]
25 Optimality requires an action selection policy that maximizes the expectation over the total discounted future reward Eb x + γx + γ 2 x + . [sent-52, score-0.599]
26 After observing reward xa , we compute a belief state posterior bxa ≡ p(b|xa ) ∝ p(xa |b)p(b). [sent-57, score-1.055]
27 Let f (xa |b) ≡ db p(xa |b)p(b) be the predicted probability of reward xa given belief b. [sent-58, score-0.95]
28 Let r(b, a) ≡ ∑ xa f (xa | b) be the expected reward of sampling arm a at state b. [sent-59, score-1.172]
29 a (1) xa The optimal action can be recovered by choosing arm a = arg max r(b, a ) + γ ∑ f (xa | b)V (bxa ) . [sent-61, score-0.824]
30 a (2) xa The belief over dynamics is effectively a probability distribution over possible Markov Decision Processes that would explain observables. [sent-62, score-0.497]
31 2 c Θ1 x1 Θ3 Θ2 x2 x1 x3 Θ1 Θ2 x1 x2 Θ3 x3 x2 N N N M M M (a) 2-arm bandit with no coupling (b) 1-arm, reward coupling (c) Mixture of generative models Figure 1: Different graphical models for generation of rewards at two known sites in the environment. [sent-65, score-1.428]
32 The agent faces M bandit tasks each comprising a random number of N choices (a) Reward sites are independent. [sent-66, score-0.591]
33 (b) Rewards are dependent within a bandit task (c) Mixture of generative models used by the learning model. [sent-67, score-0.461]
34 The causes of reward may be independent or coupled. [sent-68, score-0.496]
35 In Figure 1, we show the two reward structures considered on this paper. [sent-70, score-0.453]
36 When independent, rewards xa at arm a are samples from a unknown distribution p(xa |θa ). [sent-72, score-0.793]
37 When coupled, rewards xa depends on a “hidden” state of reward x3 sampled from p(x3 |θ3 ). [sent-73, score-0.913]
38 The MAB problem is a special case of BRL because we can partition the belief b into a disjoint set of beliefs about each arm {ba }. [sent-78, score-0.595]
39 Because beliefs about non-sampled arms remain frozen until sampled again and sampling one arm doesn’t affect the belief about any other, independent learning and action selection for each arm is possible. [sent-79, score-1.407]
40 Let λa be the reward of a deterministic arm in V (ba ) = max {λa /(1 − γ), r(ba , a) + γ ∑ f (xa |ba )V (bxa )} such that both terms inside the maximization are equal. [sent-80, score-0.82]
41 Gittins [16] proved that it is optimal to choose the arm a with the highest such reward λa (called the Gittins Index). [sent-81, score-0.868]
42 This allows speedup of computation by transforming a many-arm bandit problem to many 2-arm bandit problems. [sent-82, score-0.718]
43 In our task, the belief about a binary reward may be represented by a Beta Distribution with sufficient statistics parameters α, β (both > 0) such that xa ∼ p(xa |θa ) = x α θa a (1 − θa )1−xa , where θa ∼ p(θa ; αa , βa ) ∝ θa a −1 (1 − θa )βa −1 . [sent-83, score-0.926]
44 Thus, the expected reward r(αa , βa , a) and predicted probability of reward f (xa = 1|αa , βa ) are αa (αa + βa )−1 . [sent-84, score-0.953]
45 The belief state transition is bxa = αa + xa , βa + 1 − xa . [sent-85, score-0.897]
46 Therefore, the optimal action is to choose the arm a with highest expected reward α3 α3 +β3 β3 α3 +β3 r(α3 , β3 , a) = a=1 a=2 The belief state transitions are b1 = α3 + x1 , β3 + 1 − x1 and b2 = α3 + 1 − x2 , β3 + x2 . [sent-92, score-1.217]
47 3 3 Learning and acting with model uncertainty In this section, we consider the case where there is uncerainty about the reward model. [sent-93, score-0.581]
48 The agent’s belief is captured by a graphical model for a family of reward structures that may or may not be coupled. [sent-94, score-0.631]
49 The agent is presented with a block of M bandit tasks, each with initially unknown Bernoulli reward probabilities and coupling. [sent-97, score-0.943]
50 Figure 1(c) shows the mixture of two possible reward models shown in figure 1(a) and (b). [sent-99, score-0.522]
51 Node c switches the mixture between the two possible reward models and encodes part of the belief state of the process. [sent-100, score-0.758]
52 Given that it is unknown, the probability distribution p(c = 0) is the mixed proportion for independent reward structure and p(c = 1) is the mixed proportion for coupled reward structure. [sent-102, score-1.405]
53 With mixed proportion φ , successes of arm 1 and failures of arm 2 are attributed to successes on the shared “hidden” arm 3, whereas failures of arm 1 and successes of arm 2 are attributed to failures of arm 3. [sent-106, score-2.81]
54 At the beginning of each bandit task, we assume the agent “resets” its belief about arms (si = fi = 0), but the posterior over p(c) is carried over and used as the prior on the next bandit task. [sent-108, score-1.181]
55 A block of four bandit tasks of 50 trials each for each environment. [sent-121, score-0.433]
56 Marginal beliefs on reward probabilities and coupling are shown as functions of time. [sent-122, score-0.701]
57 Note that how well the reward probabilities sum to one forms critical evidence for or against coupling. [sent-127, score-0.55]
58 4 Simulation Results In Figure 2, we perform simulations of learning on blocks of four bandit tasks, each comprising 50 trials. [sent-130, score-0.41]
59 The importance of the belief on the coupling parameter is that it has a decisive influence on exploratory behavior. [sent-133, score-0.428]
60 Coupling between the two arms corresponds to the case where one arm is a winner and the other is a loser by experimenter design. [sent-134, score-0.638]
61 When playing coupled arms, evidence that an arm is “good” (e. [sent-135, score-0.608]
62 An agent learning a coupling parameter while sampling arms can manifest a range of exploratory behaviors that depend critically on both the recent reward history and the current state of the belief about c, illustrated in figure 3. [sent-140, score-1.224]
63 The top row shows the value of both arms as a function of coupling belief p(c) after different amounts of evidence for the success of arm 2. [sent-141, score-0.971]
64 The plots show that optimal actions stick with the winner when belief in coupling is high, even for small amounts of data. [sent-142, score-0.498]
65 Thus belief in coupling produces underexploration compared to a model assuming independence, and generates behavior similar to a “win stay, lose switch” heuristic early in learning. [sent-143, score-0.397]
66 Figure 3 (lower left) shows that uncertainty about c provides an exploratory bonus to the lower probability arm which incentivizes switching, and hence overexploration. [sent-145, score-0.598]
67 Figure 3 (to the right) shows that p(c) together with the probability of the better arm determine the transition between exploration vs. [sent-147, score-0.446]
68 These results show that optimal action selection with model uncertainty can generate several kinds of behavior typically labeled suboptimal in multi-armed bandit experiments. [sent-149, score-0.7]
69 Next we provide evidence that people are capable of learning and exploiting coupling–evidence that structure learning may play a role in apparent failures of humans to behave optimally in multi-armed bandit tasks. [sent-150, score-0.735]
70 The priors are uniform (α j = β j = 1, 1 ≤ j ≤ 3), the evidence for arm 1 remains fixed for all cases (s1 = 1, f1 = 0), and successes of arm 2 remains fixed as well (s2 = 5). [sent-201, score-0.856]
71 Upper left: Belief that arms are coupled (p(c)) versus reward per unit time (V (1 − γ), where V is the value) of arm 1 (dashed line) and arm 2 (solid line). [sent-203, score-1.588]
72 In all cases, an independent model would choose arm 1 to pull. [sent-204, score-0.41]
73 Vertical line shows the critical coupling belief value where the structure learning model switches to exploitative behavior. [sent-205, score-0.54]
74 Right panel: Critical coupling belief values for exploitative behavior vs. [sent-207, score-0.435]
75 5 Human Experiments Each of 16 subjects ran on 32 bandit tasks, a block of 16 in a independent environment and a block of 16 coupled. [sent-210, score-0.599]
76 Each arm is shown in the screen as a slot machine. [sent-214, score-0.43]
77 When pulled, an animation of the lever is shown, 200 msec later the reward appears in the machine’s screen, and a sound mimicking dropping coins lasts proportionally to the amount gathered. [sent-216, score-0.491]
78 At the top, the machine shows the number of pulls, total reward, and average reward per pull so far. [sent-218, score-0.479]
79 The machine’s total reward is shown as a pile of coins underneath it. [sent-221, score-0.491]
80 For each agent, be human or not, we compute the (empirical) probability that it selects the oracle-best action versus the optimal belief that a block of tasks is coupled. [sent-224, score-0.531]
81 The idea behind this measure is to show how the belief on task structure changes the behavior and which of the models better captures human behavior. [sent-225, score-0.462]
82 We run 1000 agents for each of the models with task uncertainty (optimal model), assumed coupled reward task (coupled model), and assumed independent reward task (independent model) under the same conditions that subjects faced on both the blocks of coupled and independent tasks. [sent-226, score-1.709]
83 And for each of the decisons of these models and the 33904 decisions performed by the 16 subjects, we compute the optimal belief on coupling according to our model and bin the proportion of times the agent chooses the (oracle) best arm according to this belief. [sent-227, score-0.954]
84 The independent model tends to perform equally well on both coupled and independent reward tasks. [sent-229, score-0.735]
85 The coupled model tends to perform well only in the coupled task and worse in the independent tasks. [sent-230, score-0.485]
86 For each of the decisions of subjects and simulated models under the same conditions, we compute the optimal belief on coupling according to the model proposed in this paper and bin the proportion of times an agent chooses the (oracle) best arm according to this belief. [sent-248, score-1.025]
87 This plot represents the empirical probability that an agent would pick the best arm at a given belief on coupling. [sent-249, score-0.649]
88 This is evidence that human behavior shares the characteristics of the optimal model, namely, it contains task uncertainty and exploit the knowledge of the task structure to maximize its gains. [sent-251, score-0.497]
89 Because the subjects are not told the coupling state of the environment and the arms appear as separate options we conclude that people are capable of learning and exploiting task structure. [sent-253, score-0.681]
90 The idea of modeling sequential decision making under uncertainty as a structure learning problem is a natural extension of previous work on structure learning in Bayesian models of cognition [17, 18] and animal learning [19] to sequential decision making problems under uncertainty. [sent-256, score-0.56]
91 It also extends previous work on Bayesian approaches to modeling sequential decision making in the multi-armed bandit [20] by adding structure learning. [sent-257, score-0.551]
92 It is important to note that we have intentionally focused on reward structure, ignoring issues involving dependencies across trials. [sent-258, score-0.453]
93 Clearly reward structure learning must be integrated with learning about temporal dependencies [21]. [sent-259, score-0.571]
94 Although we focused on learning coupling between arms, there are other kinds of reward structure learning that may account for a broad variety of human decision making performance. [sent-260, score-0.875]
95 In particular, allowing dependence between the probability of reward at a site and previous actions can produce large changes in decision making behavior. [sent-261, score-0.595]
96 For instance, in a “foraging” model where reward is collected from a site and probabilistically replenished, optimal strategies will produce choice sequences that alternate between reward sites. [sent-262, score-1.005]
97 Thus uncertainty about the independence of reward on previous actions can produce a continuum of behavior, from maximization to probability matching. [sent-263, score-0.585]
98 Instead of explaining behavior in terms of the idiosynchracies of a learning rule, structure learning constitutes a fully rational response to uncertainty about the causal structure of rewards in the environment. [sent-265, score-0.484]
99 A class of bandit problems yielding myopic optimal strategies. [sent-324, score-0.407]
100 Bayesian modeling of human sequential decision-making on the multi-armed bandit problem. [sent-358, score-0.539]
wordName wordTfidf (topN-words)
[('reward', 0.453), ('arm', 0.367), ('bandit', 0.359), ('xa', 0.295), ('arms', 0.205), ('coupled', 0.196), ('belief', 0.178), ('coupling', 0.176), ('rewards', 0.131), ('mab', 0.116), ('action', 0.114), ('beta', 0.105), ('failures', 0.098), ('bxa', 0.095), ('human', 0.093), ('sequential', 0.087), ('gittins', 0.083), ('agent', 0.08), ('successes', 0.077), ('uncertainty', 0.076), ('exploratory', 0.074), ('normative', 0.074), ('subjects', 0.071), ('structure', 0.07), ('environment', 0.068), ('humans', 0.062), ('bonus', 0.057), ('brl', 0.057), ('exploration', 0.055), ('people', 0.053), ('acting', 0.052), ('bellman', 0.051), ('ba', 0.051), ('site', 0.051), ('task', 0.05), ('beliefs', 0.05), ('sites', 0.048), ('optimal', 0.048), ('causal', 0.046), ('tasks', 0.045), ('evidence', 0.045), ('behavior', 0.043), ('independent', 0.043), ('reinforcement', 0.042), ('proportion', 0.042), ('mixed', 0.041), ('mixture', 0.041), ('winner', 0.039), ('coins', 0.038), ('exploitative', 0.038), ('schrater', 0.038), ('twin', 0.038), ('screen', 0.037), ('economic', 0.037), ('decision', 0.035), ('decisions', 0.035), ('state', 0.034), ('acuna', 0.033), ('actions', 0.032), ('selection', 0.032), ('choices', 0.032), ('bayesian', 0.031), ('switch', 0.031), ('nih', 0.03), ('cities', 0.03), ('contingencies', 0.03), ('minnesota', 0.03), ('paul', 0.03), ('critical', 0.03), ('psychology', 0.03), ('block', 0.029), ('generation', 0.029), ('xor', 0.028), ('models', 0.028), ('suboptimal', 0.028), ('frozen', 0.027), ('experimenter', 0.027), ('daw', 0.027), ('comprising', 0.027), ('banks', 0.026), ('slot', 0.026), ('pull', 0.026), ('pulls', 0.026), ('av', 0.026), ('exploitation', 0.025), ('reset', 0.025), ('stick', 0.025), ('states', 0.024), ('probability', 0.024), ('learning', 0.024), ('joshua', 0.024), ('daniel', 0.024), ('switches', 0.024), ('dynamic', 0.023), ('mark', 0.023), ('expected', 0.023), ('horizon', 0.023), ('uncertain', 0.023), ('probabilities', 0.022), ('maximize', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 223 nips-2008-Structure Learning in Human Sequential Decision-Making
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
2 0.50985593 140 nips-2008-Mortal Multi-Armed Bandits
Author: Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal
Abstract: We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. 1
3 0.41241208 17 nips-2008-Algorithms for Infinitely Many-Armed Bandits
Author: Yizao Wang, Jean-yves Audibert, Rémi Munos
Abstract: We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases. 1
4 0.22991695 170 nips-2008-Online Optimization in X-Armed Bandits
Author: Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, Rémi Munos
Abstract: We consider a generalization of stochastic bandit problems where the set of arms, X , is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally H¨ lder √ a known exponent, then the expected o with regret is bounded up to a logarithmic factor by n, i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider. 1 Introduction and motivation Bandit problems arise in many settings, including clinical trials, scheduling, on-line parameter tuning of algorithms or optimization of controllers based on simulations. In the classical bandit problem there are a finite number of arms that the decision maker can select at discrete time steps. Selecting an arm results in a random reward, whose distribution is determined by the identity of the arm selected. The distributions associated with the arms are unknown to the decision maker whose goal is to maximize the expected sum of the rewards received. In many practical situations the arms belong to a large set. This set could be continuous [1; 6; 3; 2; 7], hybrid-continuous, or it could be the space of infinite sequences over a finite alphabet [4]. In this paper we consider stochastic bandit problems where the set of arms, X , is allowed to be an arbitrary topological space. We assume that the decision maker knows a dissimilarity function defined over this space that constraints the shape of the mean-payoff function. In particular, the dissimilarity function is assumed to put a lower bound on the mean-payoff function from below at each maxima. We also assume that the decision maker is able to cover the space of arms in a recursive manner, successively refining the regions in the covering such that the diameters of these sets shrink at a known geometric rate when measured with the dissimilarity. ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. He also greatly acknowledges the support received from the a Alberta Ingenuity Fund, iCore and NSERC. 1 Our work generalizes and improves previous works on continuum-armed bandit problems: Kleinberg [6] and Auer et al. [2] focussed on one-dimensional problems. Recently, Kleinberg et al. [7] considered generic metric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the space. They proposed an interesting algorithm that achieves essentially the best possible regret in a minimax sense with respect to these environments. The goal of this paper is to further these works in a number of ways: (i) we allow the set of arms to be a generic topological space; (ii) we propose a practical algorithm motivated by the recent very successful tree-based optimization algorithms [8; 5; 4] and show that the algorithm is (iii) able to exploit higher order smoothness. In particular, as we shall argue in Section 7, (i) improves upon the results of Auer et al. [2], while (i), (ii) and (iii) improve upon the work of Kleinberg et al. [7]. Compared to Kleinberg et al. [7], our work represents an improvement in the fact that just like Auer et al. [2] we make use of the local properties of the mean-payoff function around the maxima only, and not a global property, such as Lipschitzness in √ the whole space. This allows us to obtain a regret which scales as O( n) 1 when e.g. the space is the unit hypercube and the mean-payoff function is locally H¨ lder with known exponent in the neighborhood of any o maxima (which are in finite number) and bounded away from the maxima outside of these neighborhoods. Thus, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. We also prove a minimax lower bound that matches our upper bound up to logarithmic factors, showing that the performance of our algorithm is essentially unimprovable in a minimax sense. Besides these theoretical advances the algorithm is anytime and easy to implement. Since it is based on ideas that have proved to be efficient, we expect it to perform well in practice and to make a significant impact on how on-line global optimization is performed. 2 Problem setup, notation We consider a topological space X , whose elements will be referred to as arms. A decision maker “pulls” the arms in X one at a time at discrete time steps. Each pull results in a reward that depends on the arm chosen and which the decision maker learns of. The goal of the decision maker is to choose the arms so as to maximize the sum of the rewards that he receives. In this paper we are concerned with stochastic environments. Such an environment M associates to each arm x ∈ X a distribution Mx on the real line. The support of these distributions is assumed to be uniformly bounded with a known bound. For the sake of simplicity, we assume this bound is 1. We denote by f (x) the expectation of Mx , which is assumed to be measurable (all measurability concepts are with respect to the Borel-algebra over X ). The function f : X → R thus defined is called the mean-payoff function. When in round n the decision maker pulls arm Xn ∈ X , he receives a reward Yn drawn from MXn , independently of the past arm choices and rewards. A pulling strategy of a decision maker is determined by a sequence ϕ = (ϕn )n≥1 of measurable mappings, n−1 where each ϕn maps the history space Hn = X × [0, 1] to the space of probability measures over X . By convention, ϕ1 does not take any argument. A strategy is deterministic if for every n the range of ϕn contains only Dirac distributions. According to the process that was already informally described, a pulling strategy ϕ and an environment M jointly determine a random process (X1 , Y1 , X2 , Y2 , . . .) in the following way: In round one, the decision maker draws an arm X1 at random from ϕ1 and gets a payoff Y1 drawn from MX1 . In round n ≥ 2, first, Xn is drawn at random according to ϕn (X1 , Y1 , . . . , Xn−1 , Yn−1 ), but otherwise independently of the past. Then the decision maker gets a rewards Yn drawn from MXn , independently of all other random variables in the past given Xn . Let f ∗ = supx∈X f (x) be the maximal expected payoff. The cumulative regret of a pulling strategy in n n environment M is Rn = n f ∗ − t=1 Yt , and the cumulative pseudo-regret is Rn = n f ∗ − t=1 f (Xt ). 1 We write un = O(vu ) when un = O(vn ) up to a logarithmic factor. 2 In the sequel, we restrict our attention to the expected regret E [Rn ], which in fact equals E[Rn ], as can be seen by the application of the tower rule. 3 3.1 The Hierarchical Optimistic Optimization (HOO) strategy Trees of coverings We first introduce the notion of a tree of coverings. Our algorithm will require such a tree as an input. Definition 1 (Tree of coverings). A tree of coverings is a family of measurable subsets (Ph,i )1≤i≤2h , h≥0 of X such that for all fixed integer h ≥ 0, the covering ∪1≤i≤2h Ph,i = X holds. Moreover, the elements of the covering are obtained recursively: each subset Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i . A tree of coverings can be represented, as the name suggests, by a binary tree T . The whole domain X = P0,1 corresponds to the root of the tree and Ph,i corresponds to the i–th node of depth h, which will be referred to as node (h, i) in the sequel. The fact that each Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i corresponds to the childhood relationship in the tree. Although the definition allows the childregions of a node to cover a larger part of the space, typically the size of the regions shrinks as depth h increases (cf. Assumption 1). Remark 1. Our algorithm will instantiate the nodes of the tree on an ”as needed” basis, one by one. In fact, at any round n it will only need n nodes connected to the root. 3.2 Statement of the HOO strategy The algorithm picks at each round a node in the infinite tree T as follows. In the first round, it chooses the root node (0, 1). Now, consider round n + 1 with n ≥ 1. Let us denote by Tn the set of nodes that have been picked in previous rounds and by Sn the nodes which are not in Tn but whose parent is. The algorithm picks at round n + 1 a node (Hn+1 , In+1 ) ∈ Sn according to the deterministic rule that will be described below. After selecting the node, the algorithm further chooses an arm Xn+1 ∈ PHn+1 ,In+1 . This selection can be stochastic or deterministic. We do not put any further restriction on it. The algorithm then gets a reward Yn+1 as described above and the procedure goes on: (Hn+1 , In+1 ) is added to Tn to form Tn+1 and the children of (Hn+1 , In+1 ) are added to Sn to give rise to Sn+1 . Let us now turn to how (Hn+1 , In+1 ) is selected. Along with the nodes the algorithm stores what we call B–values. The node (Hn+1 , In+1 ) ∈ Sn to expand at round n + 1 is picked by following a path from the root to a node in Sn , where at each node along the path the child with the larger B–value is selected (ties are broken arbitrarily). In order to define a node’s B–value, we need a few quantities. Let C(h, i) be the set that collects (h, i) and its descendants. We let n Nh,i (n) = I{(Ht ,It )∈C(h,i)} t=1 be the number of times the node (h, i) was visited. A given node (h, i) is always picked at most once, but since its descendants may be picked afterwards, subsequent paths in the tree can go through it. Consequently, 1 ≤ Nh,i (n) ≤ n for all nodes (h, i) ∈ Tn . Let µh,i (n) be the empirical average of the rewards received for the time-points when the path followed by the algorithm went through (h, i): n 1 µh,i (n) = Yt I{(Ht ,It )∈C(h,i)} . Nh,i (n) t=1 The corresponding upper confidence bound is by definition Uh,i (n) = µh,i (n) + 3 2 ln n + ν 1 ρh , Nh,i (n) where 0 < ρ < 1 and ν1 > 0 are parameters of the algorithm (to be chosen later by the decision maker, see Assumption 1). For nodes not in Tn , by convention, Uh,i (n) = +∞. Now, for a node (h, i) in Sn , we define its B–value to be Bh,i (n) = +∞. The B–values for nodes in Tn are given by Bh,i (n) = min Uh,i (n), max Bh+1,2i−1 (n), Bh+1,2i (n) . Note that the algorithm is deterministic (apart, maybe, from the arbitrary random choice of Xt in PHt ,It ). Its total space requirement is linear in n while total running time at round n is at most quadratic in n, though we conjecture that it is O(n log n) on average. 4 Assumptions made on the model and statement of the main result We suppose that X is equipped with a dissimilarity , that is a non-negative mapping : X 2 → R satisfying (x, x) = 0. The diameter (with respect to ) of a subset A of X is given by diam A = supx,y∈A (x, y). Given the dissimilarity , the “open” ball with radius ε > 0 and center c ∈ X is B(c, ε) = { x ∈ X : (c, x) < ε } (we do not require the topology induced by to be related to the topology of X .) In what follows when we refer to an (open) ball, we refer to the ball defined with respect to . The dissimilarity will be used to capture the smoothness of the mean-payoff function. The decision maker chooses and the tree of coverings. The following assumption relates this choice to the parameters ρ and ν1 of the algorithm: Assumption 1. There exist ρ < 1 and ν1 , ν2 > 0 such that for all integers h ≥ 0 and all i = 1, . . . , 2h , the diameter of Ph,i is bounded by ν1 ρh , and Ph,i contains an open ball Ph,i of radius ν2 ρh . For a given h, the Ph,i are disjoint for 1 ≤ i ≤ 2h . Remark 2. A typical choice for the coverings in a cubic domain is to let the domains be hyper-rectangles. They can be obtained, e.g., in a dyadic manner, by splitting at each step hyper-rectangles in the middle along their longest side, in an axis parallel manner; if all sides are equal, we split them along the√ axis. In first this example, if X = [0, 1]D and (x, y) = x − y α then we can take ρ = 2−α/D , ν1 = ( D/2)α and ν2 = 1/8α . The next assumption concerns the environment. Definition 2. We say that f is weakly Lipschitz with respect to if for all x, y ∈ X , f ∗ − f (y) ≤ f ∗ − f (x) + max f ∗ − f (x), (x, y) . (1) Note that weak Lipschitzness is satisfied whenever f is 1–Lipschitz, i.e., for all x, y ∈ X , one has |f (x) − f (y)| ≤ (x, y). On the other hand, weak Lipschitzness implies local (one-sided) 1–Lipschitzness at any maxima. Indeed, at an optimal arm x∗ (i.e., such that f (x∗ ) = f ∗ ), (1) rewrites to f (x∗ ) − f (y) ≤ (x∗ , y). However, weak Lipschitzness does not constraint the growth of the loss in the vicinity of other points. Further, weak Lipschitzness, unlike Lipschitzness, does not constraint the local decrease of the loss at any point. Thus, weak-Lipschitzness is a property that lies somewhere between a growth condition on the loss around optimal arms and (one-sided) Lipschitzness. Note that since weak Lipschitzness is defined with respect to a dissimilarity, it can actually capture higher-order smoothness at the optima. For example, f (x) = 1 − x2 is weak Lipschitz with the dissimilarity (x, y) = c(x − y)2 for some appropriate constant c. Assumption 2. The mean-payoff function f is weakly Lipschitz. ∗ ∗ Let fh,i = supx∈Ph,i f (x) and ∆h,i = f ∗ − fh,i be the suboptimality of node (h, i). We say that def a node (h, i) is optimal (respectively, suboptimal) if ∆h,i = 0 (respectively, ∆h,i > 0). Let Xε = { x ∈ X : f (x) ≥ f ∗ − ε } be the set of ε-optimal arms. The following result follows from the definitions; a proof can be found in the appendix. 4 Lemma 1. Let Assumption 1 and 2 hold. If the suboptimality ∆h,i of a region is bounded by cν1 ρh for some c > 0, then all arms in Ph,i are max{2c, c + 1}ν1 ρh -optimal. The last assumption is closely related to Assumption 2 of Auer et al. [2], who observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volume of the sets of ε-optimal arms shrinks as ε → 0. Here, we capture this by defining a new notion, the near-optimality dimension of the mean-payoff function. The connection between these concepts, as well as the zooming dimension defined by Kleinberg et al. [7] will be further discussed in Section 7. Define the packing number P(X , , ε) to be the size of the largest packing of X with disjoint open balls of radius ε with respect to the dissimilarity .2 We now define the near-optimality dimension, which characterizes the size of the sets Xε in terms of ε, and then state our main result. Definition 3. For c > 0 and ε0 > 0, the (c, ε0 )–near-optimality dimension of f with respect to equals inf d ∈ [0, +∞) : ∃ C s.t. ∀ε ≤ ε0 , P Xcε , , ε ≤ C ε−d (2) (with the usual convention that inf ∅ = +∞). Theorem 1 (Main result). Let Assumptions 1 and 2 hold and assume that the (4ν1 /ν2 , ν2 )–near-optimality dimension of the considered environment is d < +∞. Then, for any d > d there exists a constant C(d ) such that for all n ≥ 1, ERn ≤ C(d ) n(d +1)/(d +2) ln n 1/(d +2) . Further, if the near-optimality dimension is achieved, i.e., the infimum is achieved in (2), then the result holds also for d = d. Remark 3. We can relax the weak-Lipschitz property by requiring it to hold only locally around the maxima. In fact, at the price of increased constants, the result continues to hold if there exists ε > 0 such that (1) holds for any x, y ∈ Xε . To show this we only need to carefully adapt the steps of the proof below. We omit the details from this extended abstract. 5 Analysis of the regret and proof of the main result We first state three lemmas, whose proofs can be found in the appendix. The proofs of Lemmas 3 and 4 rely on concentration-of-measure techniques, while that of Lemma 2 follows from a simple case study. Let us fix some path (0, 1), (1, i∗ ), . . . , (h, i∗ ), . . . , of optimal nodes, starting from the root. 1 h Lemma 2. Let (h, i) be a suboptimal node. Let k be the largest depth such that (k, i∗ ) is on the path from k the root to (h, i). Then we have n E Nh,i (n) ≤ u+ P Nh,i (t) > u and Uh,i (t) > f ∗ or Us,i∗ ≤ f ∗ for some s ∈ {k+1, . . . , t−1} s t=u+1 Lemma 3. Let Assumptions 1 and 2 hold. 1, P Uh,i (n) ≤ f ∗ ≤ n−3 . Then, for all optimal nodes and for all integers n ≥ Lemma 4. Let Assumptions 1 and 2 hold. Then, for all integers t ≤ n, for all suboptimal nodes (h, i) 8 ln such that ∆h,i > ν1 ρh , and for all integers u ≥ 1 such that u ≥ (∆h,i −νnρh )2 , one has P Uh,i (t) > 1 f ∗ and Nh,i (t) > u ≤ t n−4 . 2 Note that sometimes packing numbers are defined as the largest packing with disjoint open balls of radius ε/2, or, ε-nets. 5 . Taking u as the integer part of (8 ln n)/(∆h,i − ν1 ρh )2 , and combining the results of Lemma 2, 3, and 4 with a union bound leads to the following key result. Lemma 5. Under Assumptions 1 and 2, for all suboptimal nodes (h, i) such that ∆h,i > ν1 ρh , we have, for all n ≥ 1, 8 ln n 2 E[Nh,i (n)] ≤ + . (∆h,i − ν1 ρh )2 n We are now ready to prove Theorem 1. Proof. For the sake of simplicity we assume that the infimum in the definition of near-optimality is achieved. To obtain the result in the general case one only needs to replace d below by d > d in the proof below. First step. For all h = 1, 2, . . ., denote by Ih the nodes at depth h that are 2ν1 ρh –optimal, i.e., the nodes ∗ (h, i) such that fh,i ≥ f ∗ − 2ν1 ρh . Then, I is the union of these sets of nodes. Further, let J be the set of nodes that are not in I but whose parent is in I. We then denote by Jh the nodes in J that are located at depth h in the tree. Lemma 4 bounds the expected number of times each node (h, i) ∈ Jh is visited. Since ∆h,i > 2ν1 ρh , we get 8 ln n 2 E Nh,i (n) ≤ 2 2h + . ν1 ρ n Second step. We bound here the cardinality |Ih |, h > 0. If (h, i) ∈ Ih then since ∆h,i ≤ 2ν1 ρh , by Lemma 1 Ph,i ⊂ X4ν1 ρh . Since by Assumption 1, the sets (Ph,i ), for (h, i) ∈ Ih , contain disjoint balls of radius ν2 ρh , we have that |Ih | ≤ P ∪(h,i)∈Ih Ph,i , , ν2 ρh ≤ P X(4ν1 /ν2 ) ν2 ρh , , ν2 ρh ≤ C ν2 ρh −d , where we used the assumption that d is the (4ν1 /ν2 , ν2 )–near-optimality dimension of f (and C is the constant introduced in the definition of the near-optimality dimension). Third step. Choose η > 0 and let H be the smallest integer such that ρH ≤ η. We partition the infinite tree T into three sets of nodes, T = T1 ∪ T2 ∪ T3 . The set T1 contains nodes of IH and their descendants, T2 = ∪0≤h < 1 and then, by optimizing over ρH (the worst value being ρH ∼ ( ln n )−1/(d+2) ). 6 Minimax optimality The packing dimension of a set X is the smallest d such that there exists a constant k such that for all ε > 0, P X , , ε ≤ k ε−d . For instance, compact subsets of Rd (with non-empty interior) have a packing dimension of d whenever is a norm. If X has a packing dimension of d, then all environments have a near-optimality dimension less than d. The proof of the main theorem indicates that the constant C(d) only depends on d, k (of the definition of packing dimension), ν1 , ν2 , and ρ, but not on the environment as long as it is weakly Lipschitz. Hence, we can extract from it a distribution-free bound of the form O(n(d+1)/(d+2) ). In fact, this bound can be shown to be optimal as is illustrated by the theorem below, whose assumptions are satisfied by, e.g., compact subsets of Rd and if is some norm of Rd . The proof can be found in the appendix. Theorem 2. If X is such that there exists c > 0 with P(X , , ε) ≥ c ε−d ≥ 2 for all ε ≤ 1/4 then for all n ≥ 4d−1 c/ ln(4/3), all strategies ϕ are bound to suffer a regret of at least 2/(d+2) 1 1 c n(d+1)/(d+2) , 4 4 4 ln(4/3) where the supremum is taken over all environments with weakly Lipschitz payoff functions. sup E Rn (ϕ) ≥ 7 Discussion Several works [1; 6; 3; 2; 7] have considered continuum-armed bandits in Euclidean or metric spaces and provided upper- and lower-bounds on the regret for given classes of environments. Cope [3] derived a regret √ of O( n) for compact and convex subset of Rd and a mean-payoff function with unique minima and second order smoothness. Kleinberg [6] considered mean-payoff functions f on the real line that are H¨ lder with o degree 0 < α ≤ 1. The derived regret is Θ(n(α+1)/(α+2) ). Auer et al. [2] extended the analysis to classes of functions with only a local H¨ lder assumption around maximum (with possibly higher smoothness degree o 1+α−αβ α ∈ [0, ∞)), and derived the regret Θ(n 1+2α−αβ ), where β is such that the Lebesgue measure of ε-optimal 7 states is O(εβ ). Another setting is that of [7] who considered a metric space (X , ) and assumed that f is Lipschitz w.r.t. . The obtained regret is O(n(d+1)/(d+2) ) where d is the zooming dimension (defined similarly to our near-optimality dimension, but using covering numbers instead of packing numbers and the sets Xε \ Xε/2 ). When (X , ) is a metric space covering and packing numbers are equivalent and we may prove that the zooming dimension and near-optimality dimensions are equal. Our main contribution compared to [7] is that our weak-Lipschitz assumption, which is substantially weaker than the global Lipschitz assumption assumed in [7], enables our algorithm to work better in some common situations, such as when the mean-payoff function assumes a local smoothness whose order is larger than one. In order to relate all these results, let us consider a specific example: Let X = [0, 1]D and assume that the mean-reward function f is locally equivalent to a H¨ lder function with degree α ∈ [0, ∞) around any o maxima x∗ of f (the number of maxima is assumed to be finite): f (x∗ ) − f (x) = Θ(||x − x∗ ||α ) as x → x∗ . (3) This means that ∃c1 , c2 , ε0 > 0, ∀x, s.t. ||x − x∗ || ≤ ε0 , c1 ||x − x∗ ||α ≤ f (x∗ ) − f (x) ≤ c2 ||x − x∗ ||α . √ Under this assumption, the result of Auer et al. [2] shows that for D = 1, the regret is Θ( n) (since here √ β = 1/α). Our result allows us to extend the n regret rate to any dimension D. Indeed, if we choose our def dissimilarity measure to be α (x, y) = ||x − y||α , we may prove that f satisfies a locally weak-Lipschitz √ condition (as defined in Remark 3) and that the near-optimality dimension is 0. Thus our regret is O( n), i.e., the rate is independent of the dimension D. In comparison, since Kleinberg et al. [7] have to satisfy a global Lipschitz assumption, they can not use α when α > 1. Indeed a function globally Lipschitz with respect to α is essentially constant. Moreover α does not define a metric for α > 1. If one resort to the Euclidean metric to fulfill their requirement that f be Lipschitz w.r.t. the metric then the zooming dimension becomes D(α − 1)/α, while the regret becomes √ O(n(D(α−1)+α)/(D(α−1)+2α) ), which is strictly worse than O( n) and in fact becomes close to the slow rate O(n(D+1)/(D+2) ) when α is larger. Nevertheless, in the case of α ≤ 1 they get the same regret rate. In contrast, our result shows that under very weak constraints on the mean-payoff function and if the local behavior of the function around its maximum (or finite number of maxima) is known then global optimization √ suffers a regret of order O( n), independent of the space dimension. As an interesting sidenote let us also remark that our results allow different smoothness orders along different dimensions, i.e., heterogenous smoothness spaces. References [1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995. [2] P. Auer, R. Ortner, and Cs. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit problem. 20th a Conference on Learning Theory, pages 454–468, 2007. [3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. Preprint, 2004. [4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006. [6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004. [7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008. [8] L. Kocsis and Cs. Szepesv´ ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference a on Machine Learning, pages 282–293, 2006. 8
5 0.15876053 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
Author: Dotan D. Castro, Dmitry Volkinshtein, Ron Meir
Abstract: Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming widely separated time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms. 1
6 0.13776162 46 nips-2008-Characterizing response behavior in multisensory perception with conflicting cues
7 0.13748299 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
8 0.13711327 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
9 0.11493686 131 nips-2008-MDPs with Non-Deterministic Policies
10 0.11470558 40 nips-2008-Bounds on marginal probability distributions
11 0.11366725 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
12 0.10319617 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
14 0.10088551 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
15 0.096607931 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
16 0.094301984 206 nips-2008-Sequential effects: Superstition or rational behavior?
17 0.093636096 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
18 0.082810707 181 nips-2008-Policy Search for Motor Primitives in Robotics
19 0.081506878 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
20 0.079824567 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
topicId topicWeight
[(0, -0.185), (1, 0.405), (2, -0.036), (3, -0.214), (4, -0.178), (5, -0.413), (6, -0.189), (7, 0.043), (8, 0.043), (9, 0.07), (10, 0.085), (11, -0.122), (12, -0.093), (13, -0.05), (14, -0.094), (15, 0.035), (16, 0.05), (17, -0.049), (18, 0.009), (19, 0.025), (20, 0.059), (21, -0.059), (22, -0.1), (23, 0.006), (24, 0.063), (25, -0.043), (26, 0.006), (27, -0.021), (28, -0.053), (29, 0.006), (30, -0.052), (31, -0.002), (32, -0.043), (33, -0.004), (34, -0.025), (35, -0.042), (36, 0.003), (37, -0.083), (38, -0.012), (39, 0.013), (40, 0.073), (41, 0.045), (42, 0.018), (43, -0.069), (44, 0.012), (45, 0.028), (46, 0.006), (47, 0.011), (48, 0.055), (49, -0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.96073413 223 nips-2008-Structure Learning in Human Sequential Decision-Making
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
2 0.89313018 140 nips-2008-Mortal Multi-Armed Bandits
Author: Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal
Abstract: We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. 1
3 0.82379442 17 nips-2008-Algorithms for Infinitely Many-Armed Bandits
Author: Yizao Wang, Jean-yves Audibert, Rémi Munos
Abstract: We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases. 1
4 0.58137286 170 nips-2008-Online Optimization in X-Armed Bandits
Author: Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, Rémi Munos
Abstract: We consider a generalization of stochastic bandit problems where the set of arms, X , is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally H¨ lder √ a known exponent, then the expected o with regret is bounded up to a logarithmic factor by n, i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider. 1 Introduction and motivation Bandit problems arise in many settings, including clinical trials, scheduling, on-line parameter tuning of algorithms or optimization of controllers based on simulations. In the classical bandit problem there are a finite number of arms that the decision maker can select at discrete time steps. Selecting an arm results in a random reward, whose distribution is determined by the identity of the arm selected. The distributions associated with the arms are unknown to the decision maker whose goal is to maximize the expected sum of the rewards received. In many practical situations the arms belong to a large set. This set could be continuous [1; 6; 3; 2; 7], hybrid-continuous, or it could be the space of infinite sequences over a finite alphabet [4]. In this paper we consider stochastic bandit problems where the set of arms, X , is allowed to be an arbitrary topological space. We assume that the decision maker knows a dissimilarity function defined over this space that constraints the shape of the mean-payoff function. In particular, the dissimilarity function is assumed to put a lower bound on the mean-payoff function from below at each maxima. We also assume that the decision maker is able to cover the space of arms in a recursive manner, successively refining the regions in the covering such that the diameters of these sets shrink at a known geometric rate when measured with the dissimilarity. ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. He also greatly acknowledges the support received from the a Alberta Ingenuity Fund, iCore and NSERC. 1 Our work generalizes and improves previous works on continuum-armed bandit problems: Kleinberg [6] and Auer et al. [2] focussed on one-dimensional problems. Recently, Kleinberg et al. [7] considered generic metric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the space. They proposed an interesting algorithm that achieves essentially the best possible regret in a minimax sense with respect to these environments. The goal of this paper is to further these works in a number of ways: (i) we allow the set of arms to be a generic topological space; (ii) we propose a practical algorithm motivated by the recent very successful tree-based optimization algorithms [8; 5; 4] and show that the algorithm is (iii) able to exploit higher order smoothness. In particular, as we shall argue in Section 7, (i) improves upon the results of Auer et al. [2], while (i), (ii) and (iii) improve upon the work of Kleinberg et al. [7]. Compared to Kleinberg et al. [7], our work represents an improvement in the fact that just like Auer et al. [2] we make use of the local properties of the mean-payoff function around the maxima only, and not a global property, such as Lipschitzness in √ the whole space. This allows us to obtain a regret which scales as O( n) 1 when e.g. the space is the unit hypercube and the mean-payoff function is locally H¨ lder with known exponent in the neighborhood of any o maxima (which are in finite number) and bounded away from the maxima outside of these neighborhoods. Thus, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. We also prove a minimax lower bound that matches our upper bound up to logarithmic factors, showing that the performance of our algorithm is essentially unimprovable in a minimax sense. Besides these theoretical advances the algorithm is anytime and easy to implement. Since it is based on ideas that have proved to be efficient, we expect it to perform well in practice and to make a significant impact on how on-line global optimization is performed. 2 Problem setup, notation We consider a topological space X , whose elements will be referred to as arms. A decision maker “pulls” the arms in X one at a time at discrete time steps. Each pull results in a reward that depends on the arm chosen and which the decision maker learns of. The goal of the decision maker is to choose the arms so as to maximize the sum of the rewards that he receives. In this paper we are concerned with stochastic environments. Such an environment M associates to each arm x ∈ X a distribution Mx on the real line. The support of these distributions is assumed to be uniformly bounded with a known bound. For the sake of simplicity, we assume this bound is 1. We denote by f (x) the expectation of Mx , which is assumed to be measurable (all measurability concepts are with respect to the Borel-algebra over X ). The function f : X → R thus defined is called the mean-payoff function. When in round n the decision maker pulls arm Xn ∈ X , he receives a reward Yn drawn from MXn , independently of the past arm choices and rewards. A pulling strategy of a decision maker is determined by a sequence ϕ = (ϕn )n≥1 of measurable mappings, n−1 where each ϕn maps the history space Hn = X × [0, 1] to the space of probability measures over X . By convention, ϕ1 does not take any argument. A strategy is deterministic if for every n the range of ϕn contains only Dirac distributions. According to the process that was already informally described, a pulling strategy ϕ and an environment M jointly determine a random process (X1 , Y1 , X2 , Y2 , . . .) in the following way: In round one, the decision maker draws an arm X1 at random from ϕ1 and gets a payoff Y1 drawn from MX1 . In round n ≥ 2, first, Xn is drawn at random according to ϕn (X1 , Y1 , . . . , Xn−1 , Yn−1 ), but otherwise independently of the past. Then the decision maker gets a rewards Yn drawn from MXn , independently of all other random variables in the past given Xn . Let f ∗ = supx∈X f (x) be the maximal expected payoff. The cumulative regret of a pulling strategy in n n environment M is Rn = n f ∗ − t=1 Yt , and the cumulative pseudo-regret is Rn = n f ∗ − t=1 f (Xt ). 1 We write un = O(vu ) when un = O(vn ) up to a logarithmic factor. 2 In the sequel, we restrict our attention to the expected regret E [Rn ], which in fact equals E[Rn ], as can be seen by the application of the tower rule. 3 3.1 The Hierarchical Optimistic Optimization (HOO) strategy Trees of coverings We first introduce the notion of a tree of coverings. Our algorithm will require such a tree as an input. Definition 1 (Tree of coverings). A tree of coverings is a family of measurable subsets (Ph,i )1≤i≤2h , h≥0 of X such that for all fixed integer h ≥ 0, the covering ∪1≤i≤2h Ph,i = X holds. Moreover, the elements of the covering are obtained recursively: each subset Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i . A tree of coverings can be represented, as the name suggests, by a binary tree T . The whole domain X = P0,1 corresponds to the root of the tree and Ph,i corresponds to the i–th node of depth h, which will be referred to as node (h, i) in the sequel. The fact that each Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i corresponds to the childhood relationship in the tree. Although the definition allows the childregions of a node to cover a larger part of the space, typically the size of the regions shrinks as depth h increases (cf. Assumption 1). Remark 1. Our algorithm will instantiate the nodes of the tree on an ”as needed” basis, one by one. In fact, at any round n it will only need n nodes connected to the root. 3.2 Statement of the HOO strategy The algorithm picks at each round a node in the infinite tree T as follows. In the first round, it chooses the root node (0, 1). Now, consider round n + 1 with n ≥ 1. Let us denote by Tn the set of nodes that have been picked in previous rounds and by Sn the nodes which are not in Tn but whose parent is. The algorithm picks at round n + 1 a node (Hn+1 , In+1 ) ∈ Sn according to the deterministic rule that will be described below. After selecting the node, the algorithm further chooses an arm Xn+1 ∈ PHn+1 ,In+1 . This selection can be stochastic or deterministic. We do not put any further restriction on it. The algorithm then gets a reward Yn+1 as described above and the procedure goes on: (Hn+1 , In+1 ) is added to Tn to form Tn+1 and the children of (Hn+1 , In+1 ) are added to Sn to give rise to Sn+1 . Let us now turn to how (Hn+1 , In+1 ) is selected. Along with the nodes the algorithm stores what we call B–values. The node (Hn+1 , In+1 ) ∈ Sn to expand at round n + 1 is picked by following a path from the root to a node in Sn , where at each node along the path the child with the larger B–value is selected (ties are broken arbitrarily). In order to define a node’s B–value, we need a few quantities. Let C(h, i) be the set that collects (h, i) and its descendants. We let n Nh,i (n) = I{(Ht ,It )∈C(h,i)} t=1 be the number of times the node (h, i) was visited. A given node (h, i) is always picked at most once, but since its descendants may be picked afterwards, subsequent paths in the tree can go through it. Consequently, 1 ≤ Nh,i (n) ≤ n for all nodes (h, i) ∈ Tn . Let µh,i (n) be the empirical average of the rewards received for the time-points when the path followed by the algorithm went through (h, i): n 1 µh,i (n) = Yt I{(Ht ,It )∈C(h,i)} . Nh,i (n) t=1 The corresponding upper confidence bound is by definition Uh,i (n) = µh,i (n) + 3 2 ln n + ν 1 ρh , Nh,i (n) where 0 < ρ < 1 and ν1 > 0 are parameters of the algorithm (to be chosen later by the decision maker, see Assumption 1). For nodes not in Tn , by convention, Uh,i (n) = +∞. Now, for a node (h, i) in Sn , we define its B–value to be Bh,i (n) = +∞. The B–values for nodes in Tn are given by Bh,i (n) = min Uh,i (n), max Bh+1,2i−1 (n), Bh+1,2i (n) . Note that the algorithm is deterministic (apart, maybe, from the arbitrary random choice of Xt in PHt ,It ). Its total space requirement is linear in n while total running time at round n is at most quadratic in n, though we conjecture that it is O(n log n) on average. 4 Assumptions made on the model and statement of the main result We suppose that X is equipped with a dissimilarity , that is a non-negative mapping : X 2 → R satisfying (x, x) = 0. The diameter (with respect to ) of a subset A of X is given by diam A = supx,y∈A (x, y). Given the dissimilarity , the “open” ball with radius ε > 0 and center c ∈ X is B(c, ε) = { x ∈ X : (c, x) < ε } (we do not require the topology induced by to be related to the topology of X .) In what follows when we refer to an (open) ball, we refer to the ball defined with respect to . The dissimilarity will be used to capture the smoothness of the mean-payoff function. The decision maker chooses and the tree of coverings. The following assumption relates this choice to the parameters ρ and ν1 of the algorithm: Assumption 1. There exist ρ < 1 and ν1 , ν2 > 0 such that for all integers h ≥ 0 and all i = 1, . . . , 2h , the diameter of Ph,i is bounded by ν1 ρh , and Ph,i contains an open ball Ph,i of radius ν2 ρh . For a given h, the Ph,i are disjoint for 1 ≤ i ≤ 2h . Remark 2. A typical choice for the coverings in a cubic domain is to let the domains be hyper-rectangles. They can be obtained, e.g., in a dyadic manner, by splitting at each step hyper-rectangles in the middle along their longest side, in an axis parallel manner; if all sides are equal, we split them along the√ axis. In first this example, if X = [0, 1]D and (x, y) = x − y α then we can take ρ = 2−α/D , ν1 = ( D/2)α and ν2 = 1/8α . The next assumption concerns the environment. Definition 2. We say that f is weakly Lipschitz with respect to if for all x, y ∈ X , f ∗ − f (y) ≤ f ∗ − f (x) + max f ∗ − f (x), (x, y) . (1) Note that weak Lipschitzness is satisfied whenever f is 1–Lipschitz, i.e., for all x, y ∈ X , one has |f (x) − f (y)| ≤ (x, y). On the other hand, weak Lipschitzness implies local (one-sided) 1–Lipschitzness at any maxima. Indeed, at an optimal arm x∗ (i.e., such that f (x∗ ) = f ∗ ), (1) rewrites to f (x∗ ) − f (y) ≤ (x∗ , y). However, weak Lipschitzness does not constraint the growth of the loss in the vicinity of other points. Further, weak Lipschitzness, unlike Lipschitzness, does not constraint the local decrease of the loss at any point. Thus, weak-Lipschitzness is a property that lies somewhere between a growth condition on the loss around optimal arms and (one-sided) Lipschitzness. Note that since weak Lipschitzness is defined with respect to a dissimilarity, it can actually capture higher-order smoothness at the optima. For example, f (x) = 1 − x2 is weak Lipschitz with the dissimilarity (x, y) = c(x − y)2 for some appropriate constant c. Assumption 2. The mean-payoff function f is weakly Lipschitz. ∗ ∗ Let fh,i = supx∈Ph,i f (x) and ∆h,i = f ∗ − fh,i be the suboptimality of node (h, i). We say that def a node (h, i) is optimal (respectively, suboptimal) if ∆h,i = 0 (respectively, ∆h,i > 0). Let Xε = { x ∈ X : f (x) ≥ f ∗ − ε } be the set of ε-optimal arms. The following result follows from the definitions; a proof can be found in the appendix. 4 Lemma 1. Let Assumption 1 and 2 hold. If the suboptimality ∆h,i of a region is bounded by cν1 ρh for some c > 0, then all arms in Ph,i are max{2c, c + 1}ν1 ρh -optimal. The last assumption is closely related to Assumption 2 of Auer et al. [2], who observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volume of the sets of ε-optimal arms shrinks as ε → 0. Here, we capture this by defining a new notion, the near-optimality dimension of the mean-payoff function. The connection between these concepts, as well as the zooming dimension defined by Kleinberg et al. [7] will be further discussed in Section 7. Define the packing number P(X , , ε) to be the size of the largest packing of X with disjoint open balls of radius ε with respect to the dissimilarity .2 We now define the near-optimality dimension, which characterizes the size of the sets Xε in terms of ε, and then state our main result. Definition 3. For c > 0 and ε0 > 0, the (c, ε0 )–near-optimality dimension of f with respect to equals inf d ∈ [0, +∞) : ∃ C s.t. ∀ε ≤ ε0 , P Xcε , , ε ≤ C ε−d (2) (with the usual convention that inf ∅ = +∞). Theorem 1 (Main result). Let Assumptions 1 and 2 hold and assume that the (4ν1 /ν2 , ν2 )–near-optimality dimension of the considered environment is d < +∞. Then, for any d > d there exists a constant C(d ) such that for all n ≥ 1, ERn ≤ C(d ) n(d +1)/(d +2) ln n 1/(d +2) . Further, if the near-optimality dimension is achieved, i.e., the infimum is achieved in (2), then the result holds also for d = d. Remark 3. We can relax the weak-Lipschitz property by requiring it to hold only locally around the maxima. In fact, at the price of increased constants, the result continues to hold if there exists ε > 0 such that (1) holds for any x, y ∈ Xε . To show this we only need to carefully adapt the steps of the proof below. We omit the details from this extended abstract. 5 Analysis of the regret and proof of the main result We first state three lemmas, whose proofs can be found in the appendix. The proofs of Lemmas 3 and 4 rely on concentration-of-measure techniques, while that of Lemma 2 follows from a simple case study. Let us fix some path (0, 1), (1, i∗ ), . . . , (h, i∗ ), . . . , of optimal nodes, starting from the root. 1 h Lemma 2. Let (h, i) be a suboptimal node. Let k be the largest depth such that (k, i∗ ) is on the path from k the root to (h, i). Then we have n E Nh,i (n) ≤ u+ P Nh,i (t) > u and Uh,i (t) > f ∗ or Us,i∗ ≤ f ∗ for some s ∈ {k+1, . . . , t−1} s t=u+1 Lemma 3. Let Assumptions 1 and 2 hold. 1, P Uh,i (n) ≤ f ∗ ≤ n−3 . Then, for all optimal nodes and for all integers n ≥ Lemma 4. Let Assumptions 1 and 2 hold. Then, for all integers t ≤ n, for all suboptimal nodes (h, i) 8 ln such that ∆h,i > ν1 ρh , and for all integers u ≥ 1 such that u ≥ (∆h,i −νnρh )2 , one has P Uh,i (t) > 1 f ∗ and Nh,i (t) > u ≤ t n−4 . 2 Note that sometimes packing numbers are defined as the largest packing with disjoint open balls of radius ε/2, or, ε-nets. 5 . Taking u as the integer part of (8 ln n)/(∆h,i − ν1 ρh )2 , and combining the results of Lemma 2, 3, and 4 with a union bound leads to the following key result. Lemma 5. Under Assumptions 1 and 2, for all suboptimal nodes (h, i) such that ∆h,i > ν1 ρh , we have, for all n ≥ 1, 8 ln n 2 E[Nh,i (n)] ≤ + . (∆h,i − ν1 ρh )2 n We are now ready to prove Theorem 1. Proof. For the sake of simplicity we assume that the infimum in the definition of near-optimality is achieved. To obtain the result in the general case one only needs to replace d below by d > d in the proof below. First step. For all h = 1, 2, . . ., denote by Ih the nodes at depth h that are 2ν1 ρh –optimal, i.e., the nodes ∗ (h, i) such that fh,i ≥ f ∗ − 2ν1 ρh . Then, I is the union of these sets of nodes. Further, let J be the set of nodes that are not in I but whose parent is in I. We then denote by Jh the nodes in J that are located at depth h in the tree. Lemma 4 bounds the expected number of times each node (h, i) ∈ Jh is visited. Since ∆h,i > 2ν1 ρh , we get 8 ln n 2 E Nh,i (n) ≤ 2 2h + . ν1 ρ n Second step. We bound here the cardinality |Ih |, h > 0. If (h, i) ∈ Ih then since ∆h,i ≤ 2ν1 ρh , by Lemma 1 Ph,i ⊂ X4ν1 ρh . Since by Assumption 1, the sets (Ph,i ), for (h, i) ∈ Ih , contain disjoint balls of radius ν2 ρh , we have that |Ih | ≤ P ∪(h,i)∈Ih Ph,i , , ν2 ρh ≤ P X(4ν1 /ν2 ) ν2 ρh , , ν2 ρh ≤ C ν2 ρh −d , where we used the assumption that d is the (4ν1 /ν2 , ν2 )–near-optimality dimension of f (and C is the constant introduced in the definition of the near-optimality dimension). Third step. Choose η > 0 and let H be the smallest integer such that ρH ≤ η. We partition the infinite tree T into three sets of nodes, T = T1 ∪ T2 ∪ T3 . The set T1 contains nodes of IH and their descendants, T2 = ∪0≤h < 1 and then, by optimizing over ρH (the worst value being ρH ∼ ( ln n )−1/(d+2) ). 6 Minimax optimality The packing dimension of a set X is the smallest d such that there exists a constant k such that for all ε > 0, P X , , ε ≤ k ε−d . For instance, compact subsets of Rd (with non-empty interior) have a packing dimension of d whenever is a norm. If X has a packing dimension of d, then all environments have a near-optimality dimension less than d. The proof of the main theorem indicates that the constant C(d) only depends on d, k (of the definition of packing dimension), ν1 , ν2 , and ρ, but not on the environment as long as it is weakly Lipschitz. Hence, we can extract from it a distribution-free bound of the form O(n(d+1)/(d+2) ). In fact, this bound can be shown to be optimal as is illustrated by the theorem below, whose assumptions are satisfied by, e.g., compact subsets of Rd and if is some norm of Rd . The proof can be found in the appendix. Theorem 2. If X is such that there exists c > 0 with P(X , , ε) ≥ c ε−d ≥ 2 for all ε ≤ 1/4 then for all n ≥ 4d−1 c/ ln(4/3), all strategies ϕ are bound to suffer a regret of at least 2/(d+2) 1 1 c n(d+1)/(d+2) , 4 4 4 ln(4/3) where the supremum is taken over all environments with weakly Lipschitz payoff functions. sup E Rn (ϕ) ≥ 7 Discussion Several works [1; 6; 3; 2; 7] have considered continuum-armed bandits in Euclidean or metric spaces and provided upper- and lower-bounds on the regret for given classes of environments. Cope [3] derived a regret √ of O( n) for compact and convex subset of Rd and a mean-payoff function with unique minima and second order smoothness. Kleinberg [6] considered mean-payoff functions f on the real line that are H¨ lder with o degree 0 < α ≤ 1. The derived regret is Θ(n(α+1)/(α+2) ). Auer et al. [2] extended the analysis to classes of functions with only a local H¨ lder assumption around maximum (with possibly higher smoothness degree o 1+α−αβ α ∈ [0, ∞)), and derived the regret Θ(n 1+2α−αβ ), where β is such that the Lebesgue measure of ε-optimal 7 states is O(εβ ). Another setting is that of [7] who considered a metric space (X , ) and assumed that f is Lipschitz w.r.t. . The obtained regret is O(n(d+1)/(d+2) ) where d is the zooming dimension (defined similarly to our near-optimality dimension, but using covering numbers instead of packing numbers and the sets Xε \ Xε/2 ). When (X , ) is a metric space covering and packing numbers are equivalent and we may prove that the zooming dimension and near-optimality dimensions are equal. Our main contribution compared to [7] is that our weak-Lipschitz assumption, which is substantially weaker than the global Lipschitz assumption assumed in [7], enables our algorithm to work better in some common situations, such as when the mean-payoff function assumes a local smoothness whose order is larger than one. In order to relate all these results, let us consider a specific example: Let X = [0, 1]D and assume that the mean-reward function f is locally equivalent to a H¨ lder function with degree α ∈ [0, ∞) around any o maxima x∗ of f (the number of maxima is assumed to be finite): f (x∗ ) − f (x) = Θ(||x − x∗ ||α ) as x → x∗ . (3) This means that ∃c1 , c2 , ε0 > 0, ∀x, s.t. ||x − x∗ || ≤ ε0 , c1 ||x − x∗ ||α ≤ f (x∗ ) − f (x) ≤ c2 ||x − x∗ ||α . √ Under this assumption, the result of Auer et al. [2] shows that for D = 1, the regret is Θ( n) (since here √ β = 1/α). Our result allows us to extend the n regret rate to any dimension D. Indeed, if we choose our def dissimilarity measure to be α (x, y) = ||x − y||α , we may prove that f satisfies a locally weak-Lipschitz √ condition (as defined in Remark 3) and that the near-optimality dimension is 0. Thus our regret is O( n), i.e., the rate is independent of the dimension D. In comparison, since Kleinberg et al. [7] have to satisfy a global Lipschitz assumption, they can not use α when α > 1. Indeed a function globally Lipschitz with respect to α is essentially constant. Moreover α does not define a metric for α > 1. If one resort to the Euclidean metric to fulfill their requirement that f be Lipschitz w.r.t. the metric then the zooming dimension becomes D(α − 1)/α, while the regret becomes √ O(n(D(α−1)+α)/(D(α−1)+2α) ), which is strictly worse than O( n) and in fact becomes close to the slow rate O(n(D+1)/(D+2) ) when α is larger. Nevertheless, in the case of α ≤ 1 they get the same regret rate. In contrast, our result shows that under very weak constraints on the mean-payoff function and if the local behavior of the function around its maximum (or finite number of maxima) is known then global optimization √ suffers a regret of order O( n), independent of the space dimension. As an interesting sidenote let us also remark that our results allow different smoothness orders along different dimensions, i.e., heterogenous smoothness spaces. References [1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995. [2] P. Auer, R. Ortner, and Cs. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit problem. 20th a Conference on Learning Theory, pages 454–468, 2007. [3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. Preprint, 2004. [4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006. [6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004. [7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008. [8] L. Kocsis and Cs. Szepesv´ ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference a on Machine Learning, pages 282–293, 2006. 8
5 0.54574597 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan
Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8
6 0.49264449 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
7 0.4816587 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
8 0.38364467 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
9 0.37188464 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
10 0.33868167 33 nips-2008-Bayesian Model of Behaviour in Economic Games
11 0.33725971 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
12 0.32726949 46 nips-2008-Characterizing response behavior in multisensory perception with conflicting cues
13 0.32409084 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
14 0.30809647 212 nips-2008-Skill Characterization Based on Betweenness
15 0.29842693 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making
16 0.28997597 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
17 0.28347757 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
18 0.27920434 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
20 0.26207316 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
topicId topicWeight
[(4, 0.027), (6, 0.052), (7, 0.039), (12, 0.031), (15, 0.018), (18, 0.021), (21, 0.011), (28, 0.202), (51, 0.252), (57, 0.053), (59, 0.021), (63, 0.025), (71, 0.027), (77, 0.077), (78, 0.014), (83, 0.046)]
simIndex simValue paperId paperTitle
1 0.83546919 59 nips-2008-Dependent Dirichlet Process Spike Sorting
Author: Jan Gasthaus, Frank Wood, Dilan Gorur, Yee W. Teh
Abstract: In this paper we propose a new incremental spike sorting model that automatically eliminates refractory period violations, accounts for action potential waveform drift, and can handle “appearance” and “disappearance” of neurons. Our approach is to augment a known time-varying Dirichlet process that ties together a sequence of infinite Gaussian mixture models, one per action potential waveform observation, with an interspike-interval-dependent likelihood that prohibits refractory period violations. We demonstrate this model by showing results from sorting two publicly available neural data recordings for which a partial ground truth labeling is known. 1
same-paper 2 0.80965388 223 nips-2008-Structure Learning in Human Sequential Decision-Making
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
3 0.6780135 195 nips-2008-Regularized Policy Iteration
Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári
Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
4 0.67232239 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
Author: Gerhard Neumann, Jan R. Peters
Abstract: Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantageweighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces. 1
5 0.67221272 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
Author: David Sontag, Amir Globerson, Tommi S. Jaakkola
Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning. 1
6 0.67188221 231 nips-2008-Temporal Dynamics of Cognitive Control
7 0.67008764 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
8 0.66905224 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
9 0.66793859 40 nips-2008-Bounds on marginal probability distributions
10 0.66736251 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
11 0.66654366 101 nips-2008-Human Active Learning
12 0.66630113 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
13 0.66418588 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
14 0.66347426 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
15 0.66298658 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
16 0.66295975 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
17 0.66104591 48 nips-2008-Clustering via LP-based Stabilities
18 0.66087073 107 nips-2008-Influence of graph construction on graph-based clustering measures
19 0.66055542 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
20 0.66007799 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost