nips nips2008 nips2008-170 nips2008-170-reference knowledge-graph by maker-knowledge-mining

170 nips-2008-Online Optimization in X-Armed Bandits

Source: pdf

Author: Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, Rémi Munos

Abstract: We consider a generalization of stochastic bandit problems where the set of arms, X , is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a ﬁnite number of global maxima around which the behavior of the function is locally H¨ lder √ a known exponent, then the expected o with regret is bounded up to a logarithmic factor by n, i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider. 1 Introduction and motivation Bandit problems arise in many settings, including clinical trials, scheduling, on-line parameter tuning of algorithms or optimization of controllers based on simulations. In the classical bandit problem there are a ﬁnite number of arms that the decision maker can select at discrete time steps. Selecting an arm results in a random reward, whose distribution is determined by the identity of the arm selected. The distributions associated with the arms are unknown to the decision maker whose goal is to maximize the expected sum of the rewards received. In many practical situations the arms belong to a large set. This set could be continuous [1; 6; 3; 2; 7], hybrid-continuous, or it could be the space of inﬁnite sequences over a ﬁnite alphabet [4]. In this paper we consider stochastic bandit problems where the set of arms, X , is allowed to be an arbitrary topological space. We assume that the decision maker knows a dissimilarity function deﬁned over this space that constraints the shape of the mean-payoff function. In particular, the dissimilarity function is assumed to put a lower bound on the mean-payoff function from below at each maxima. We also assume that the decision maker is able to cover the space of arms in a recursive manner, successively reﬁning the regions in the covering such that the diameters of these sets shrink at a known geometric rate when measured with the dissimilarity. ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. He also greatly acknowledges the support received from the a Alberta Ingenuity Fund, iCore and NSERC. 1 Our work generalizes and improves previous works on continuum-armed bandit problems: Kleinberg [6] and Auer et al. [2] focussed on one-dimensional problems. Recently, Kleinberg et al. [7] considered generic metric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the space. They proposed an interesting algorithm that achieves essentially the best possible regret in a minimax sense with respect to these environments. The goal of this paper is to further these works in a number of ways: (i) we allow the set of arms to be a generic topological space; (ii) we propose a practical algorithm motivated by the recent very successful tree-based optimization algorithms [8; 5; 4] and show that the algorithm is (iii) able to exploit higher order smoothness. In particular, as we shall argue in Section 7, (i) improves upon the results of Auer et al. [2], while (i), (ii) and (iii) improve upon the work of Kleinberg et al. [7]. Compared to Kleinberg et al. [7], our work represents an improvement in the fact that just like Auer et al. [2] we make use of the local properties of the mean-payoff function around the maxima only, and not a global property, such as Lipschitzness in √ the whole space. This allows us to obtain a regret which scales as O( n) 1 when e.g. the space is the unit hypercube and the mean-payoff function is locally H¨ lder with known exponent in the neighborhood of any o maxima (which are in ﬁnite number) and bounded away from the maxima outside of these neighborhoods. Thus, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. We also prove a minimax lower bound that matches our upper bound up to logarithmic factors, showing that the performance of our algorithm is essentially unimprovable in a minimax sense. Besides these theoretical advances the algorithm is anytime and easy to implement. Since it is based on ideas that have proved to be efﬁcient, we expect it to perform well in practice and to make a signiﬁcant impact on how on-line global optimization is performed. 2 Problem setup, notation We consider a topological space X , whose elements will be referred to as arms. A decision maker “pulls” the arms in X one at a time at discrete time steps. Each pull results in a reward that depends on the arm chosen and which the decision maker learns of. The goal of the decision maker is to choose the arms so as to maximize the sum of the rewards that he receives. In this paper we are concerned with stochastic environments. Such an environment M associates to each arm x ∈ X a distribution Mx on the real line. The support of these distributions is assumed to be uniformly bounded with a known bound. For the sake of simplicity, we assume this bound is 1. We denote by f (x) the expectation of Mx , which is assumed to be measurable (all measurability concepts are with respect to the Borel-algebra over X ). The function f : X → R thus deﬁned is called the mean-payoff function. When in round n the decision maker pulls arm Xn ∈ X , he receives a reward Yn drawn from MXn , independently of the past arm choices and rewards. A pulling strategy of a decision maker is determined by a sequence ϕ = (ϕn )n≥1 of measurable mappings, n−1 where each ϕn maps the history space Hn = X × [0, 1] to the space of probability measures over X . By convention, ϕ1 does not take any argument. A strategy is deterministic if for every n the range of ϕn contains only Dirac distributions. According to the process that was already informally described, a pulling strategy ϕ and an environment M jointly determine a random process (X1 , Y1 , X2 , Y2 , . . .) in the following way: In round one, the decision maker draws an arm X1 at random from ϕ1 and gets a payoff Y1 drawn from MX1 . In round n ≥ 2, ﬁrst, Xn is drawn at random according to ϕn (X1 , Y1 , . . . , Xn−1 , Yn−1 ), but otherwise independently of the past. Then the decision maker gets a rewards Yn drawn from MXn , independently of all other random variables in the past given Xn . Let f ∗ = supx∈X f (x) be the maximal expected payoff. The cumulative regret of a pulling strategy in n n environment M is Rn = n f ∗ − t=1 Yt , and the cumulative pseudo-regret is Rn = n f ∗ − t=1 f (Xt ). 1 We write un = O(vu ) when un = O(vn ) up to a logarithmic factor. 2 In the sequel, we restrict our attention to the expected regret E [Rn ], which in fact equals E[Rn ], as can be seen by the application of the tower rule. 3 3.1 The Hierarchical Optimistic Optimization (HOO) strategy Trees of coverings We ﬁrst introduce the notion of a tree of coverings. Our algorithm will require such a tree as an input. Deﬁnition 1 (Tree of coverings). A tree of coverings is a family of measurable subsets (Ph,i )1≤i≤2h , h≥0 of X such that for all ﬁxed integer h ≥ 0, the covering ∪1≤i≤2h Ph,i = X holds. Moreover, the elements of the covering are obtained recursively: each subset Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i . A tree of coverings can be represented, as the name suggests, by a binary tree T . The whole domain X = P0,1 corresponds to the root of the tree and Ph,i corresponds to the i–th node of depth h, which will be referred to as node (h, i) in the sequel. The fact that each Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i corresponds to the childhood relationship in the tree. Although the deﬁnition allows the childregions of a node to cover a larger part of the space, typically the size of the regions shrinks as depth h increases (cf. Assumption 1). Remark 1. Our algorithm will instantiate the nodes of the tree on an ”as needed” basis, one by one. In fact, at any round n it will only need n nodes connected to the root. 3.2 Statement of the HOO strategy The algorithm picks at each round a node in the inﬁnite tree T as follows. In the ﬁrst round, it chooses the root node (0, 1). Now, consider round n + 1 with n ≥ 1. Let us denote by Tn the set of nodes that have been picked in previous rounds and by Sn the nodes which are not in Tn but whose parent is. The algorithm picks at round n + 1 a node (Hn+1 , In+1 ) ∈ Sn according to the deterministic rule that will be described below. After selecting the node, the algorithm further chooses an arm Xn+1 ∈ PHn+1 ,In+1 . This selection can be stochastic or deterministic. We do not put any further restriction on it. The algorithm then gets a reward Yn+1 as described above and the procedure goes on: (Hn+1 , In+1 ) is added to Tn to form Tn+1 and the children of (Hn+1 , In+1 ) are added to Sn to give rise to Sn+1 . Let us now turn to how (Hn+1 , In+1 ) is selected. Along with the nodes the algorithm stores what we call B–values. The node (Hn+1 , In+1 ) ∈ Sn to expand at round n + 1 is picked by following a path from the root to a node in Sn , where at each node along the path the child with the larger B–value is selected (ties are broken arbitrarily). In order to deﬁne a node’s B–value, we need a few quantities. Let C(h, i) be the set that collects (h, i) and its descendants. We let n Nh,i (n) = I{(Ht ,It )∈C(h,i)} t=1 be the number of times the node (h, i) was visited. A given node (h, i) is always picked at most once, but since its descendants may be picked afterwards, subsequent paths in the tree can go through it. Consequently, 1 ≤ Nh,i (n) ≤ n for all nodes (h, i) ∈ Tn . Let µh,i (n) be the empirical average of the rewards received for the time-points when the path followed by the algorithm went through (h, i): n 1 µh,i (n) = Yt I{(Ht ,It )∈C(h,i)} . Nh,i (n) t=1 The corresponding upper conﬁdence bound is by deﬁnition Uh,i (n) = µh,i (n) + 3 2 ln n + ν 1 ρh , Nh,i (n) where 0 < ρ < 1 and ν1 > 0 are parameters of the algorithm (to be chosen later by the decision maker, see Assumption 1). For nodes not in Tn , by convention, Uh,i (n) = +∞. Now, for a node (h, i) in Sn , we deﬁne its B–value to be Bh,i (n) = +∞. The B–values for nodes in Tn are given by Bh,i (n) = min Uh,i (n), max Bh+1,2i−1 (n), Bh+1,2i (n) . Note that the algorithm is deterministic (apart, maybe, from the arbitrary random choice of Xt in PHt ,It ). Its total space requirement is linear in n while total running time at round n is at most quadratic in n, though we conjecture that it is O(n log n) on average. 4 Assumptions made on the model and statement of the main result We suppose that X is equipped with a dissimilarity , that is a non-negative mapping : X 2 → R satisfying (x, x) = 0. The diameter (with respect to ) of a subset A of X is given by diam A = supx,y∈A (x, y). Given the dissimilarity , the “open” ball with radius ε > 0 and center c ∈ X is B(c, ε) = { x ∈ X : (c, x) < ε } (we do not require the topology induced by to be related to the topology of X .) In what follows when we refer to an (open) ball, we refer to the ball deﬁned with respect to . The dissimilarity will be used to capture the smoothness of the mean-payoff function. The decision maker chooses and the tree of coverings. The following assumption relates this choice to the parameters ρ and ν1 of the algorithm: Assumption 1. There exist ρ < 1 and ν1 , ν2 > 0 such that for all integers h ≥ 0 and all i = 1, . . . , 2h , the diameter of Ph,i is bounded by ν1 ρh , and Ph,i contains an open ball Ph,i of radius ν2 ρh . For a given h, the Ph,i are disjoint for 1 ≤ i ≤ 2h . Remark 2. A typical choice for the coverings in a cubic domain is to let the domains be hyper-rectangles. They can be obtained, e.g., in a dyadic manner, by splitting at each step hyper-rectangles in the middle along their longest side, in an axis parallel manner; if all sides are equal, we split them along the√ axis. In ﬁrst this example, if X = [0, 1]D and (x, y) = x − y α then we can take ρ = 2−α/D , ν1 = ( D/2)α and ν2 = 1/8α . The next assumption concerns the environment. Deﬁnition 2. We say that f is weakly Lipschitz with respect to if for all x, y ∈ X , f ∗ − f (y) ≤ f ∗ − f (x) + max f ∗ − f (x), (x, y) . (1) Note that weak Lipschitzness is satisﬁed whenever f is 1–Lipschitz, i.e., for all x, y ∈ X , one has |f (x) − f (y)| ≤ (x, y). On the other hand, weak Lipschitzness implies local (one-sided) 1–Lipschitzness at any maxima. Indeed, at an optimal arm x∗ (i.e., such that f (x∗ ) = f ∗ ), (1) rewrites to f (x∗ ) − f (y) ≤ (x∗ , y). However, weak Lipschitzness does not constraint the growth of the loss in the vicinity of other points. Further, weak Lipschitzness, unlike Lipschitzness, does not constraint the local decrease of the loss at any point. Thus, weak-Lipschitzness is a property that lies somewhere between a growth condition on the loss around optimal arms and (one-sided) Lipschitzness. Note that since weak Lipschitzness is deﬁned with respect to a dissimilarity, it can actually capture higher-order smoothness at the optima. For example, f (x) = 1 − x2 is weak Lipschitz with the dissimilarity (x, y) = c(x − y)2 for some appropriate constant c. Assumption 2. The mean-payoff function f is weakly Lipschitz. ∗ ∗ Let fh,i = supx∈Ph,i f (x) and ∆h,i = f ∗ − fh,i be the suboptimality of node (h, i). We say that def a node (h, i) is optimal (respectively, suboptimal) if ∆h,i = 0 (respectively, ∆h,i > 0). Let Xε = { x ∈ X : f (x) ≥ f ∗ − ε } be the set of ε-optimal arms. The following result follows from the deﬁnitions; a proof can be found in the appendix. 4 Lemma 1. Let Assumption 1 and 2 hold. If the suboptimality ∆h,i of a region is bounded by cν1 ρh for some c > 0, then all arms in Ph,i are max{2c, c + 1}ν1 ρh -optimal. The last assumption is closely related to Assumption 2 of Auer et al. [2], who observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volume of the sets of ε-optimal arms shrinks as ε → 0. Here, we capture this by deﬁning a new notion, the near-optimality dimension of the mean-payoff function. The connection between these concepts, as well as the zooming dimension deﬁned by Kleinberg et al. [7] will be further discussed in Section 7. Deﬁne the packing number P(X , , ε) to be the size of the largest packing of X with disjoint open balls of radius ε with respect to the dissimilarity .2 We now deﬁne the near-optimality dimension, which characterizes the size of the sets Xε in terms of ε, and then state our main result. Deﬁnition 3. For c > 0 and ε0 > 0, the (c, ε0 )–near-optimality dimension of f with respect to equals inf d ∈ [0, +∞) : ∃ C s.t. ∀ε ≤ ε0 , P Xcε , , ε ≤ C ε−d (2) (with the usual convention that inf ∅ = +∞). Theorem 1 (Main result). Let Assumptions 1 and 2 hold and assume that the (4ν1 /ν2 , ν2 )–near-optimality dimension of the considered environment is d < +∞. Then, for any d > d there exists a constant C(d ) such that for all n ≥ 1, ERn ≤ C(d ) n(d +1)/(d +2) ln n 1/(d +2) . Further, if the near-optimality dimension is achieved, i.e., the inﬁmum is achieved in (2), then the result holds also for d = d. Remark 3. We can relax the weak-Lipschitz property by requiring it to hold only locally around the maxima. In fact, at the price of increased constants, the result continues to hold if there exists ε > 0 such that (1) holds for any x, y ∈ Xε . To show this we only need to carefully adapt the steps of the proof below. We omit the details from this extended abstract. 5 Analysis of the regret and proof of the main result We ﬁrst state three lemmas, whose proofs can be found in the appendix. The proofs of Lemmas 3 and 4 rely on concentration-of-measure techniques, while that of Lemma 2 follows from a simple case study. Let us ﬁx some path (0, 1), (1, i∗ ), . . . , (h, i∗ ), . . . , of optimal nodes, starting from the root. 1 h Lemma 2. Let (h, i) be a suboptimal node. Let k be the largest depth such that (k, i∗ ) is on the path from k the root to (h, i). Then we have n E Nh,i (n) ≤ u+ P Nh,i (t) > u and Uh,i (t) > f ∗ or Us,i∗ ≤ f ∗ for some s ∈ {k+1, . . . , t−1} s t=u+1 Lemma 3. Let Assumptions 1 and 2 hold. 1, P Uh,i (n) ≤ f ∗ ≤ n−3 . Then, for all optimal nodes and for all integers n ≥ Lemma 4. Let Assumptions 1 and 2 hold. Then, for all integers t ≤ n, for all suboptimal nodes (h, i) 8 ln such that ∆h,i > ν1 ρh , and for all integers u ≥ 1 such that u ≥ (∆h,i −νnρh )2 , one has P Uh,i (t) > 1 f ∗ and Nh,i (t) > u ≤ t n−4 . 2 Note that sometimes packing numbers are deﬁned as the largest packing with disjoint open balls of radius ε/2, or, ε-nets. 5 . Taking u as the integer part of (8 ln n)/(∆h,i − ν1 ρh )2 , and combining the results of Lemma 2, 3, and 4 with a union bound leads to the following key result. Lemma 5. Under Assumptions 1 and 2, for all suboptimal nodes (h, i) such that ∆h,i > ν1 ρh , we have, for all n ≥ 1, 8 ln n 2 E[Nh,i (n)] ≤ + . (∆h,i − ν1 ρh )2 n We are now ready to prove Theorem 1. Proof. For the sake of simplicity we assume that the inﬁmum in the deﬁnition of near-optimality is achieved. To obtain the result in the general case one only needs to replace d below by d > d in the proof below. First step. For all h = 1, 2, . . ., denote by Ih the nodes at depth h that are 2ν1 ρh –optimal, i.e., the nodes ∗ (h, i) such that fh,i ≥ f ∗ − 2ν1 ρh . Then, I is the union of these sets of nodes. Further, let J be the set of nodes that are not in I but whose parent is in I. We then denote by Jh the nodes in J that are located at depth h in the tree. Lemma 4 bounds the expected number of times each node (h, i) ∈ Jh is visited. Since ∆h,i > 2ν1 ρh , we get 8 ln n 2 E Nh,i (n) ≤ 2 2h + . ν1 ρ n Second step. We bound here the cardinality |Ih |, h > 0. If (h, i) ∈ Ih then since ∆h,i ≤ 2ν1 ρh , by Lemma 1 Ph,i ⊂ X4ν1 ρh . Since by Assumption 1, the sets (Ph,i ), for (h, i) ∈ Ih , contain disjoint balls of radius ν2 ρh , we have that |Ih | ≤ P ∪(h,i)∈Ih Ph,i , , ν2 ρh ≤ P X(4ν1 /ν2 ) ν2 ρh , , ν2 ρh ≤ C ν2 ρh −d , where we used the assumption that d is the (4ν1 /ν2 , ν2 )–near-optimality dimension of f (and C is the constant introduced in the deﬁnition of the near-optimality dimension). Third step. Choose η > 0 and let H be the smallest integer such that ρH ≤ η. We partition the inﬁnite tree T into three sets of nodes, T = T1 ∪ T2 ∪ T3 . The set T1 contains nodes of IH and their descendants, T2 = ∪0≤h < 1 and then, by optimizing over ρH (the worst value being ρH ∼ ( ln n )−1/(d+2) ). 6 Minimax optimality The packing dimension of a set X is the smallest d such that there exists a constant k such that for all ε > 0, P X , , ε ≤ k ε−d . For instance, compact subsets of Rd (with non-empty interior) have a packing dimension of d whenever is a norm. If X has a packing dimension of d, then all environments have a near-optimality dimension less than d. The proof of the main theorem indicates that the constant C(d) only depends on d, k (of the deﬁnition of packing dimension), ν1 , ν2 , and ρ, but not on the environment as long as it is weakly Lipschitz. Hence, we can extract from it a distribution-free bound of the form O(n(d+1)/(d+2) ). In fact, this bound can be shown to be optimal as is illustrated by the theorem below, whose assumptions are satisﬁed by, e.g., compact subsets of Rd and if is some norm of Rd . The proof can be found in the appendix. Theorem 2. If X is such that there exists c > 0 with P(X , , ε) ≥ c ε−d ≥ 2 for all ε ≤ 1/4 then for all n ≥ 4d−1 c/ ln(4/3), all strategies ϕ are bound to suffer a regret of at least 2/(d+2) 1 1 c n(d+1)/(d+2) , 4 4 4 ln(4/3) where the supremum is taken over all environments with weakly Lipschitz payoff functions. sup E Rn (ϕ) ≥ 7 Discussion Several works [1; 6; 3; 2; 7] have considered continuum-armed bandits in Euclidean or metric spaces and provided upper- and lower-bounds on the regret for given classes of environments. Cope [3] derived a regret √ of O( n) for compact and convex subset of Rd and a mean-payoff function with unique minima and second order smoothness. Kleinberg [6] considered mean-payoff functions f on the real line that are H¨ lder with o degree 0 < α ≤ 1. The derived regret is Θ(n(α+1)/(α+2) ). Auer et al. [2] extended the analysis to classes of functions with only a local H¨ lder assumption around maximum (with possibly higher smoothness degree o 1+α−αβ α ∈ [0, ∞)), and derived the regret Θ(n 1+2α−αβ ), where β is such that the Lebesgue measure of ε-optimal 7 states is O(εβ ). Another setting is that of [7] who considered a metric space (X , ) and assumed that f is Lipschitz w.r.t. . The obtained regret is O(n(d+1)/(d+2) ) where d is the zooming dimension (deﬁned similarly to our near-optimality dimension, but using covering numbers instead of packing numbers and the sets Xε \ Xε/2 ). When (X , ) is a metric space covering and packing numbers are equivalent and we may prove that the zooming dimension and near-optimality dimensions are equal. Our main contribution compared to [7] is that our weak-Lipschitz assumption, which is substantially weaker than the global Lipschitz assumption assumed in [7], enables our algorithm to work better in some common situations, such as when the mean-payoff function assumes a local smoothness whose order is larger than one. In order to relate all these results, let us consider a speciﬁc example: Let X = [0, 1]D and assume that the mean-reward function f is locally equivalent to a H¨ lder function with degree α ∈ [0, ∞) around any o maxima x∗ of f (the number of maxima is assumed to be ﬁnite): f (x∗ ) − f (x) = Θ(||x − x∗ ||α ) as x → x∗ . (3) This means that ∃c1 , c2 , ε0 > 0, ∀x, s.t. ||x − x∗ || ≤ ε0 , c1 ||x − x∗ ||α ≤ f (x∗ ) − f (x) ≤ c2 ||x − x∗ ||α . √ Under this assumption, the result of Auer et al. [2] shows that for D = 1, the regret is Θ( n) (since here √ β = 1/α). Our result allows us to extend the n regret rate to any dimension D. Indeed, if we choose our def dissimilarity measure to be α (x, y) = ||x − y||α , we may prove that f satisﬁes a locally weak-Lipschitz √ condition (as deﬁned in Remark 3) and that the near-optimality dimension is 0. Thus our regret is O( n), i.e., the rate is independent of the dimension D. In comparison, since Kleinberg et al. [7] have to satisfy a global Lipschitz assumption, they can not use α when α > 1. Indeed a function globally Lipschitz with respect to α is essentially constant. Moreover α does not deﬁne a metric for α > 1. If one resort to the Euclidean metric to fulﬁll their requirement that f be Lipschitz w.r.t. the metric then the zooming dimension becomes D(α − 1)/α, while the regret becomes √ O(n(D(α−1)+α)/(D(α−1)+2α) ), which is strictly worse than O( n) and in fact becomes close to the slow rate O(n(D+1)/(D+2) ) when α is larger. Nevertheless, in the case of α ≤ 1 they get the same regret rate. In contrast, our result shows that under very weak constraints on the mean-payoff function and if the local behavior of the function around its maximum (or ﬁnite number of maxima) is known then global optimization √ suffers a regret of order O( n), independent of the space dimension. As an interesting sidenote let us also remark that our results allow different smoothness orders along different dimensions, i.e., heterogenous smoothness spaces. References [1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995. [2] P. Auer, R. Ortner, and Cs. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit problem. 20th a Conference on Learning Theory, pages 454–468, 2007. [3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. Preprint, 2004. [4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty in Artiﬁcial Intelligence, 2007. [5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modiﬁcation of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006. [6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004. [7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008. [8] L. Kocsis and Cs. Szepesv´ ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference a on Machine Learning, pages 282–293, 2006. 8

reference text

[1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995.

[2] P. Auer, R. Ortner, and Cs. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit problem. 20th a Conference on Learning Theory, pages 454–468, 2007.

[3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. Preprint, 2004.

[4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty in Artiﬁcial Intelligence, 2007.

[5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modiﬁcation of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006.

[6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004.

[7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008.

[8] L. Kocsis and Cs. Szepesv´ ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference a on Machine Learning, pages 282–293, 2006. 8