nips nips2008 nips2008-140 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal
Abstract: We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In our model, arms have (stochastic) lifetime after which they expire. [sent-9, score-0.584]
2 In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. [sent-10, score-1.378]
3 Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. [sent-14, score-0.687]
4 The content providers and the ad brokers who match ads to content are paid only when ads are clicked; this is commonly referred to as the pay-per-click model. [sent-16, score-0.621]
5 In this setting, the goal of the ad brokers is to select ads to display from a large corpus, so as to generate the most ad clicks and revenue. [sent-17, score-0.495]
6 exploitation tradeoff: balancing exploration for ads with better click rates against exploitation of the best ads found so far. [sent-19, score-0.52]
7 Following [17, 16], we model the ad selection task as a multi-armed bandit problem [5]. [sent-20, score-0.522]
8 A multiarmed bandit models a casino with k slot machines (one-armed bandits), where each machine (arm) has a different and unknown expected payoff. [sent-21, score-0.501]
9 , slot machine arms to pull) to maximize the expected total reward. [sent-24, score-0.507]
10 Considering each ad as a slot machine, that may or may not provide a reward when presented to users, allows any multi-armed bandit strategy to be used for the ad selection problem. [sent-25, score-0.85]
11 † A standard assumption in the multi-armed bandit setting, however, is that each arm exists perpetually. [sent-30, score-0.843]
12 Although the payoff function of an arm is allowed to evolve over time, the evolution is assumed to be slow. [sent-31, score-0.701]
13 The advertising problem is even more challenging as the set of available ads is often huge (in the tens of millions), while standard multi-armed bandit strategies converge only slowly and require time linear in the number of available options. [sent-34, score-0.678]
14 In this paper we initiate the study of a rapidly changing variant of the multi-armed bandit problem. [sent-35, score-0.426]
15 We call it the mortal multi-armed bandit problem since ads (or equivalently, available bandit arms) are assumed to be born and die regularly. [sent-36, score-1.312]
16 In particular, we will show that while the standard multiarmed bandit setting allows for algorithms that only deviate from the optimal total payoff by O(ln t) [21], in the mortal arm setting a regret of Ω(t) is possible. [sent-37, score-1.571]
17 Our analysis of the mortal multi-arm bandit problem considers two settings. [sent-38, score-0.627]
18 First, in the less realistic but simpler state-aware (deterministic reward) case, pulling arm i always provides a reward that equals the expected payoff of the arm. [sent-39, score-0.931]
19 Second, in the more realistic state-oblivious (stochastic reward) case, the reward from arm i is a binomial random variable indicating the true payoff of the arm only in expectation. [sent-40, score-1.317]
20 This algorithm is based on characterizing the precise payoff threshold below which repeated arm pulls become suboptimal. [sent-42, score-0.813]
21 If the pool of available ads A(t) were static, or if the payoffs were only slowly changing with t, this problem could be solved using any standard multi-armed bandit approach. [sent-57, score-0.788]
22 This is equivalent to each ad being allocated a lifetime budget Li , drawn from a geometric distribution with parameter p, that is fixed when the arm is born but is never revealed to the algorithm; in this case new arms have an expected lifetime of L = 1/p. [sent-68, score-1.417]
23 The mortal multi-armed bandit setting requires different performance measures than the ones used with static multi-armed bandits. [sent-77, score-0.665]
24 In the static setting, very little exploration is needed once an optimal arm is identified with near-certainty; therefore the quality measure is the total regret over time. [sent-78, score-0.682]
25 We define this regret as the expected payoff of the best currently alive arm minus the payoff actually obtained by the algorithm. [sent-81, score-1.184]
26 3 Related work Our work is most related to the study of dynamic versions of the multi-arm bandit (MAB) paradigm where either the set of arms or their expected reward may change over time. [sent-82, score-1.05]
27 Motivated by task scheduling, Gittins [10] proposed a policy where only the state of the active arm (the arm currently being played) can change in a given step, and proved its optimality for the Bayesian formulation with time discounting. [sent-83, score-0.921]
28 In particular, Whittle [23] introduced an extension termed restless bandits [23, 6, 15], where the states of all arms can change in each step according to a known (but arbitrary) stochastic transition function. [sent-85, score-0.691]
29 The mixture-of-experts paradigm is related [11], but assumes that data tuples are provided to each expert, instead of the tuples being picked by the algorithm, as in the bandit setting. [sent-91, score-0.428]
30 [3, 1] also considered a more general definition of regret, where the comparison is to the best policy that can change arms a limited number of times. [sent-96, score-0.445]
31 Another aspect of our model is that unexplored arms are always available. [sent-98, score-0.448]
32 First, new arms can become available over time; the optimality of Gittins’ index was shown to extend to this case [22]. [sent-100, score-0.444]
33 Finally, the bandit arms may be indexed by numbers from the real line, implying uncountably infinite bandit arms, but where “nearby” arms (in terms of distance along the real line) have similar payoffs [12, 14]. [sent-102, score-1.787]
34 However, none of these approaches allows for arms to appear then disappear, which as we show later critically affects any regret bounds. [sent-103, score-0.586]
35 4 Upper bound on mortal reward In this section we show that in the mortal multi-armed bandit setting, the regret per time step of any algorithm can never go to zero, unlike in the standard MAB setting. [sent-104, score-1.297]
36 Specifically, we develop an upper bound on the mean reward per step of any such algorithm for the state-aware, budgeted death case. [sent-105, score-0.449]
37 We prove the bound assuming we always have new arms available. [sent-107, score-0.443]
38 The expected reward of an arm is drawn from a cumulative distribution F (µ) with support in [0, 1]. [sent-108, score-0.682]
39 We assume that the lifetime of an arm has an exponential distribution with parameter p, and denote its expectation by L = 1/p. [sent-110, score-0.609]
40 Let µ(t) denote the maximum mean reward that any algorithm for the state-aware ¯ mortal multi-armed bandit problem can obtain in t steps in the budgeted death case. [sent-113, score-0.986]
41 , pulls of arms that were not pulled before, and repeat arm pulls. [sent-118, score-1.034]
42 Assume that the optimal algorithm pulls τ (t) distinct (fresh) arms in t steps, and hence makes t − τ (t) repeat pulls. [sent-119, score-0.58]
43 The expected number of repeat pulls to an arm before it expires is (1 − p)/p. [sent-120, score-0.641]
44 Thus, using Wald’s equation [8], the expected number of different arms the algorithm must use for the repeat pulls is (t − τ (t)) · p/(1 − p). [sent-121, score-0.611]
45 Let (t) ≤ τ (t) be the number of distinct arms that get pulled more than once. [sent-122, score-0.466]
46 Using Chernoff bounds, we can show that for any δ > 0, for sufficiently large t, with probability ≥ 1 − 1/t2 the algorithm uses at least (t) = p(t − τ (t))/(1 − p) · (1 − δ) different arms for the repeat pulls. [sent-123, score-0.471]
47 Next, we upper bound the expected reward of the best (t) arms found in τ (t) fresh probes. [sent-125, score-0.696]
48 In other words, the probability of picking an arm with expected reward greater or equal to µ(h) is ( (t)/τ (t))(1 − h). [sent-127, score-0.664]
49 Applying the Chernoff bound, for any δ, h > 0 there exists t(δ, h) such that for all t ≥ t(δ, h) the probability that the algorithm finds at least (t) arms with expected reward at least µ(δ, h) = µ(h)(1 − δ) is bounded by 1/t2 . [sent-128, score-0.659]
50 As δ, h → 0, Pr(E(δ, h)) → 1, and the expected reward per step when the algorithm pulls τ (t) fresh arms is given by 1 lim sup µ(t) ≤ ¯ τ (t)E[X] + (t − τ (t))E[X | X ≥ µ] , t t→∞ where µ = F −1 (1 − (t)/τ (t)) and (t) = (t − τ (t))p/(1 − p). [sent-132, score-0.864]
51 Assuming that new arms are always available, any algorithm for the timed death model obtains at least the same reward per timestep in the budgeted death model. [sent-138, score-1.022]
52 Although we omit the proof due to space constraints, the intuition behind this lemma is that an arm in the timed case can die no sooner than in the budgeted case (i. [sent-139, score-0.638]
53 Let µdet (t) and µsto (t) denote the respective maximum mean expected rewards that any ¯ ¯ algorithm for the state-aware and state-oblivious mortal multi-armed bandit problems can obtain after running for t steps. [sent-143, score-0.711]
54 The first simply observes that if the time to find an optimal arm is greater than the lifetime of such an arm, the the mean reward per step of any algorithm must be smaller than the best value. [sent-146, score-0.884]
55 Assume that the expected reward of a bandit arms is 1 with probability p < 1/2 and 1 − δ otherwise, for some δ ∈ (0, 1]. [sent-149, score-1.031]
56 Let the lifetime of arms have geometric distribution with the same parameter p. [sent-150, score-0.584]
57 The mean reward per step of any algorithm for this supply of arms is at most 1 − δ + δp, while the maximum expected reward is 1, yielding an expected regret per step of Ω(1). [sent-151, score-1.197]
58 Assume arm payoffs are drawn from a uniform distribution, F (x) = x, x ∈ [0, 1]. [sent-153, score-0.624]
59 Then the mean reward per step in bounded √ √ 1− p by 1−p and expected regret per step of any algorithm is Ω( p). [sent-155, score-0.539]
60 5 Bandit algorithms for mortal arms In this section we present and analyze a number of algorithms specifically designed for the mortal multi-armed bandit task. [sent-156, score-1.288]
61 We also study a subset approach that can be used in tandem with any standard multi-armed bandit algorithm to substantially improve performance in the mortal multi-armed bandit setting. [sent-158, score-1.061]
62 The expected reward of an arm is drawn from cumulative distribution F (µ). [sent-162, score-0.682]
63 Assume that the lifetime of an arm has an exponential distribution with parameter p, and denote its expectation by L = 1/p. [sent-164, score-0.609]
64 The intuition behind this modification, S TOCHASTIC, is simple: instead of pulling an arm once to determine its payoff µi , the algorithm pulls each arm n times and abandons it unless it looks promising. [sent-171, score-1.306]
65 A variant, called S TOCHASTIC WITH E ARLY S TOPPING, abandons the arm earlier if its maximum possible future reward will still not justify its retention. [sent-172, score-0.641]
66 Algorithm S TOCHASTIC input: Distribution F (µ), expected lifetime L µ∗ ← argmaxµ Γ(µ) [Γ is defined in (1)] while we keep playing [Play a random arm n times] i ← random new arm; r ← 0 for d = 1, . [sent-174, score-0.713]
67 , n Pull arm i; r ← r + R(µi ) end for if r > nµ∗ [If it is good, stay with it forever ] Pull arm i every turn until it dies end if end while Algorithm S TOCH . [sent-177, score-1.094]
68 Why can’t we simply use a standard multi-armed bandit (MAB) algorithm for mortal bandits as well? [sent-179, score-0.767]
69 Intuitively, MAB algorithms invest a lot of pulls on all arms (at least logarithmic in the total number of pulls) to guarantee convergence to the optimal arm. [sent-180, score-0.535]
70 However, such an investment is not justified for mortal bandits: the most gain we can get from an arm is L (if the arm has payoff 1), which reduces the importance of convergence to the best arm. [sent-182, score-1.387]
71 In fact, as shown by Corollary 4, converging to a reasonably good arm suffices. [sent-183, score-0.451]
72 However, standard MAB algorithms do identify better arms very well. [sent-184, score-0.426]
73 This suggests the following epoch-based heuristic: (a) select a subset of k/c arms uniformly at random from the total k arms at the beginning of each epoch, (b) operate a standard bandit algorithm on these until the epoch ends, and repeat. [sent-185, score-1.29]
74 Intuitively, step (a) reduces the load on the bandit algorithm, allowing it to explore less and converge faster, in return for finding an arm that is probably optimal only among the k/c subset. [sent-186, score-0.884]
75 Picking the right c and the epoch length then depends on balancing the speed of convergence of the bandit algorithm, the arm lifetimes, and the difference between the k-th and the k/c-th order statistics of the arm payoff distribution; in our experiments, c is chosen empirically. [sent-187, score-1.587]
76 In the next section, UCB1 K/C is shown to perform far better than UCB1 in the mortal arms setting. [sent-190, score-0.661]
77 Note that the n -greedy algorithm [2] does not apply directly to mortal bandits since the probability t of random exploration decays to zero for large t, which can leave the algorithm with no good choices should the best arm expire. [sent-194, score-0.881]
78 We also compare these to the UCB1 algorithm [2], that does not consider arm mortality in its policy but is among the faster converging standard multi-armed bandit algorithms. [sent-196, score-0.888]
79 We present the results of simulation studies using three different distributions of arm payoffs F (·). [sent-197, score-0.602]
80 Our performance analyses assume that the cumulative payoff distribution F (·) of new arms is known. [sent-199, score-0.694]
81 In particular, even with the longest lifetimes, each arm can be sampled in expectation at most 100 times. [sent-206, score-0.451]
82 5 per turn as would an algorithm that pulls arms at random. [sent-208, score-0.618]
83 In practice, 2 UCB1 plays the arm j, previously pulled nj times, with highest mean historical payoff plus (2 ln n)/nj . [sent-211, score-0.741]
84 1 0 0 100 (a) 1000 10000 Expected arm lifetime 100000 100 (b) 1000 10000 100000 Expected arm lifetime Figure 1: Comparison of the regret per turn obtained by five different algorithms assuming that new arm payoffs come from the (a) uniform distribution and (b) beta(1, 3) distribution. [sent-225, score-2.082]
85 with k = 1000 arms, the best performance was obtained with K/C between 4 and 40, depending on the arm lifetime. [sent-226, score-0.451]
86 This demonstrates that (a) the state-oblivious versions of the optimal deterministic algorithm is effective in general, and (b) the early stopping criterion allows arms with poor payoff to be quickly weeded out. [sent-229, score-0.753]
87 While the strategies discussed perform well when arm payoffs are uniformly distributed, it is unlikely that in a real setting the payoffs would be so well distributed. [sent-231, score-0.793]
88 In particular, if there are occasional arms with substantially higher payoffs, we could expect any algorithm that does not exhaustively search available arms may obtain very high regret per turn. [sent-232, score-1.121]
89 Figure 1(b) shows the results when the arm payoff probabilities are drawn from the beta(1, 3) distribution. [sent-233, score-0.701]
90 We chose this distribution as it has finite support yet tends to select small payoffs for most arms while selecting high payoffs occasionally. [sent-234, score-0.728]
91 Considering the application that motivated this work, we now evaluate the performance of the four new algorithms when the arm payoffs come from the empirically observed distribution of clickthrough rates on real ads served by a large ad broker. [sent-239, score-0.942]
92 Figure 2(a) shows a histogram of the payoff probabilities for a random sample of approximately 300 real ads belonging to a shopping-related category when presented on web pages classified as belonging to the same category. [sent-240, score-0.46]
93 The probabilities have been linearly scaled such that all ads have payoff between 0 and 1. [sent-241, score-0.46]
94 By sampling arm payoffs from a smoothed version of this empirical distribution, we evaluated the performance of the algorithms presented earlier. [sent-243, score-0.602]
95 In particular, while the mean regret per turn is somewhat higher than that seen for the uniform distribution, it is still lower than when payoffs are from the beta distribution. [sent-245, score-0.447]
96 7 Conclusions We have introduced a new formulation of the multi-armed bandit problem motivated by the real world problem of selecting ads to display on webpages. [sent-247, score-0.602]
97 In this setting the set of strategies available to a multi-armed bandit algorithm changes rapidly over time. [sent-248, score-0.487]
98 We provided a lower bound of linear regret under certain payoff distributions. [sent-249, score-0.427]
99 Further, we presented a number of algorithms that perform substantially better in this setting than previous multi-armed bandit algorithms, including one that is optimal under the state-aware setting, and one that is near-optimal under the state-oblivious setting. [sent-250, score-0.45]
100 8 Payoff probability (scaled) 0 1 100 (b) 1000 10000 100000 Expected arm lifetime Figure 2: (a) Distribution of real world ad payoffs, scaled linearly such that the maximum payoff is 1 and (b) Regret per turn under the real-world ad payoff distribution. [sent-267, score-1.449]
wordName wordTfidf (topN-words)
[('arm', 0.451), ('arms', 0.426), ('bandit', 0.392), ('payoff', 0.25), ('mortal', 0.235), ('ads', 0.21), ('reward', 0.165), ('regret', 0.16), ('lifetime', 0.158), ('payoffs', 0.151), ('ad', 0.13), ('tochastic', 0.124), ('bandits', 0.12), ('death', 0.109), ('pulls', 0.092), ('mab', 0.086), ('timed', 0.079), ('pull', 0.075), ('arly', 0.074), ('daptive', 0.074), ('restless', 0.074), ('topping', 0.074), ('budgeted', 0.065), ('reedy', 0.065), ('dead', 0.054), ('per', 0.049), ('expected', 0.048), ('die', 0.043), ('pt', 0.04), ('pulled', 0.04), ('fresh', 0.04), ('adaptivegreedy', 0.037), ('dies', 0.037), ('exploration', 0.035), ('beta', 0.034), ('slot', 0.033), ('playing', 0.033), ('au', 0.032), ('gittins', 0.032), ('turn', 0.031), ('auer', 0.031), ('chakrabarti', 0.03), ('argmax', 0.029), ('stochastic', 0.028), ('end', 0.028), ('multiarmed', 0.028), ('epoch', 0.026), ('repeat', 0.025), ('abandons', 0.025), ('adit', 0.025), ('alive', 0.025), ('brokers', 0.025), ('eli', 0.025), ('expires', 0.025), ('mortality', 0.025), ('pandey', 0.025), ('sleeping', 0.025), ('sto', 0.025), ('exploitation', 0.024), ('budget', 0.024), ('step', 0.024), ('allocation', 0.023), ('keep', 0.023), ('heuristic', 0.023), ('content', 0.023), ('deterministic', 0.022), ('substantially', 0.022), ('uniform', 0.022), ('born', 0.022), ('lifetimes', 0.022), ('sunnyvale', 0.022), ('unexplored', 0.022), ('yahoo', 0.021), ('strategies', 0.021), ('stay', 0.02), ('forever', 0.02), ('algorithm', 0.02), ('setting', 0.019), ('change', 0.019), ('static', 0.019), ('limt', 0.019), ('supply', 0.019), ('advertising', 0.019), ('stopping', 0.018), ('cumulative', 0.018), ('available', 0.018), ('tuples', 0.018), ('chernoff', 0.018), ('berry', 0.018), ('optimal', 0.017), ('bound', 0.017), ('rapidly', 0.017), ('balancing', 0.017), ('pulling', 0.017), ('reductions', 0.017), ('pm', 0.017), ('changing', 0.017), ('ri', 0.016), ('li', 0.016), ('rewards', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 140 nips-2008-Mortal Multi-Armed Bandits
Author: Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal
Abstract: We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. 1
2 0.62348837 17 nips-2008-Algorithms for Infinitely Many-Armed Bandits
Author: Yizao Wang, Jean-yves Audibert, Rémi Munos
Abstract: We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases. 1
3 0.50985593 223 nips-2008-Structure Learning in Human Sequential Decision-Making
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
4 0.36073086 170 nips-2008-Online Optimization in X-Armed Bandits
Author: Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, Rémi Munos
Abstract: We consider a generalization of stochastic bandit problems where the set of arms, X , is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally H¨ lder √ a known exponent, then the expected o with regret is bounded up to a logarithmic factor by n, i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider. 1 Introduction and motivation Bandit problems arise in many settings, including clinical trials, scheduling, on-line parameter tuning of algorithms or optimization of controllers based on simulations. In the classical bandit problem there are a finite number of arms that the decision maker can select at discrete time steps. Selecting an arm results in a random reward, whose distribution is determined by the identity of the arm selected. The distributions associated with the arms are unknown to the decision maker whose goal is to maximize the expected sum of the rewards received. In many practical situations the arms belong to a large set. This set could be continuous [1; 6; 3; 2; 7], hybrid-continuous, or it could be the space of infinite sequences over a finite alphabet [4]. In this paper we consider stochastic bandit problems where the set of arms, X , is allowed to be an arbitrary topological space. We assume that the decision maker knows a dissimilarity function defined over this space that constraints the shape of the mean-payoff function. In particular, the dissimilarity function is assumed to put a lower bound on the mean-payoff function from below at each maxima. We also assume that the decision maker is able to cover the space of arms in a recursive manner, successively refining the regions in the covering such that the diameters of these sets shrink at a known geometric rate when measured with the dissimilarity. ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. He also greatly acknowledges the support received from the a Alberta Ingenuity Fund, iCore and NSERC. 1 Our work generalizes and improves previous works on continuum-armed bandit problems: Kleinberg [6] and Auer et al. [2] focussed on one-dimensional problems. Recently, Kleinberg et al. [7] considered generic metric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the space. They proposed an interesting algorithm that achieves essentially the best possible regret in a minimax sense with respect to these environments. The goal of this paper is to further these works in a number of ways: (i) we allow the set of arms to be a generic topological space; (ii) we propose a practical algorithm motivated by the recent very successful tree-based optimization algorithms [8; 5; 4] and show that the algorithm is (iii) able to exploit higher order smoothness. In particular, as we shall argue in Section 7, (i) improves upon the results of Auer et al. [2], while (i), (ii) and (iii) improve upon the work of Kleinberg et al. [7]. Compared to Kleinberg et al. [7], our work represents an improvement in the fact that just like Auer et al. [2] we make use of the local properties of the mean-payoff function around the maxima only, and not a global property, such as Lipschitzness in √ the whole space. This allows us to obtain a regret which scales as O( n) 1 when e.g. the space is the unit hypercube and the mean-payoff function is locally H¨ lder with known exponent in the neighborhood of any o maxima (which are in finite number) and bounded away from the maxima outside of these neighborhoods. Thus, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. We also prove a minimax lower bound that matches our upper bound up to logarithmic factors, showing that the performance of our algorithm is essentially unimprovable in a minimax sense. Besides these theoretical advances the algorithm is anytime and easy to implement. Since it is based on ideas that have proved to be efficient, we expect it to perform well in practice and to make a significant impact on how on-line global optimization is performed. 2 Problem setup, notation We consider a topological space X , whose elements will be referred to as arms. A decision maker “pulls” the arms in X one at a time at discrete time steps. Each pull results in a reward that depends on the arm chosen and which the decision maker learns of. The goal of the decision maker is to choose the arms so as to maximize the sum of the rewards that he receives. In this paper we are concerned with stochastic environments. Such an environment M associates to each arm x ∈ X a distribution Mx on the real line. The support of these distributions is assumed to be uniformly bounded with a known bound. For the sake of simplicity, we assume this bound is 1. We denote by f (x) the expectation of Mx , which is assumed to be measurable (all measurability concepts are with respect to the Borel-algebra over X ). The function f : X → R thus defined is called the mean-payoff function. When in round n the decision maker pulls arm Xn ∈ X , he receives a reward Yn drawn from MXn , independently of the past arm choices and rewards. A pulling strategy of a decision maker is determined by a sequence ϕ = (ϕn )n≥1 of measurable mappings, n−1 where each ϕn maps the history space Hn = X × [0, 1] to the space of probability measures over X . By convention, ϕ1 does not take any argument. A strategy is deterministic if for every n the range of ϕn contains only Dirac distributions. According to the process that was already informally described, a pulling strategy ϕ and an environment M jointly determine a random process (X1 , Y1 , X2 , Y2 , . . .) in the following way: In round one, the decision maker draws an arm X1 at random from ϕ1 and gets a payoff Y1 drawn from MX1 . In round n ≥ 2, first, Xn is drawn at random according to ϕn (X1 , Y1 , . . . , Xn−1 , Yn−1 ), but otherwise independently of the past. Then the decision maker gets a rewards Yn drawn from MXn , independently of all other random variables in the past given Xn . Let f ∗ = supx∈X f (x) be the maximal expected payoff. The cumulative regret of a pulling strategy in n n environment M is Rn = n f ∗ − t=1 Yt , and the cumulative pseudo-regret is Rn = n f ∗ − t=1 f (Xt ). 1 We write un = O(vu ) when un = O(vn ) up to a logarithmic factor. 2 In the sequel, we restrict our attention to the expected regret E [Rn ], which in fact equals E[Rn ], as can be seen by the application of the tower rule. 3 3.1 The Hierarchical Optimistic Optimization (HOO) strategy Trees of coverings We first introduce the notion of a tree of coverings. Our algorithm will require such a tree as an input. Definition 1 (Tree of coverings). A tree of coverings is a family of measurable subsets (Ph,i )1≤i≤2h , h≥0 of X such that for all fixed integer h ≥ 0, the covering ∪1≤i≤2h Ph,i = X holds. Moreover, the elements of the covering are obtained recursively: each subset Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i . A tree of coverings can be represented, as the name suggests, by a binary tree T . The whole domain X = P0,1 corresponds to the root of the tree and Ph,i corresponds to the i–th node of depth h, which will be referred to as node (h, i) in the sequel. The fact that each Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i corresponds to the childhood relationship in the tree. Although the definition allows the childregions of a node to cover a larger part of the space, typically the size of the regions shrinks as depth h increases (cf. Assumption 1). Remark 1. Our algorithm will instantiate the nodes of the tree on an ”as needed” basis, one by one. In fact, at any round n it will only need n nodes connected to the root. 3.2 Statement of the HOO strategy The algorithm picks at each round a node in the infinite tree T as follows. In the first round, it chooses the root node (0, 1). Now, consider round n + 1 with n ≥ 1. Let us denote by Tn the set of nodes that have been picked in previous rounds and by Sn the nodes which are not in Tn but whose parent is. The algorithm picks at round n + 1 a node (Hn+1 , In+1 ) ∈ Sn according to the deterministic rule that will be described below. After selecting the node, the algorithm further chooses an arm Xn+1 ∈ PHn+1 ,In+1 . This selection can be stochastic or deterministic. We do not put any further restriction on it. The algorithm then gets a reward Yn+1 as described above and the procedure goes on: (Hn+1 , In+1 ) is added to Tn to form Tn+1 and the children of (Hn+1 , In+1 ) are added to Sn to give rise to Sn+1 . Let us now turn to how (Hn+1 , In+1 ) is selected. Along with the nodes the algorithm stores what we call B–values. The node (Hn+1 , In+1 ) ∈ Sn to expand at round n + 1 is picked by following a path from the root to a node in Sn , where at each node along the path the child with the larger B–value is selected (ties are broken arbitrarily). In order to define a node’s B–value, we need a few quantities. Let C(h, i) be the set that collects (h, i) and its descendants. We let n Nh,i (n) = I{(Ht ,It )∈C(h,i)} t=1 be the number of times the node (h, i) was visited. A given node (h, i) is always picked at most once, but since its descendants may be picked afterwards, subsequent paths in the tree can go through it. Consequently, 1 ≤ Nh,i (n) ≤ n for all nodes (h, i) ∈ Tn . Let µh,i (n) be the empirical average of the rewards received for the time-points when the path followed by the algorithm went through (h, i): n 1 µh,i (n) = Yt I{(Ht ,It )∈C(h,i)} . Nh,i (n) t=1 The corresponding upper confidence bound is by definition Uh,i (n) = µh,i (n) + 3 2 ln n + ν 1 ρh , Nh,i (n) where 0 < ρ < 1 and ν1 > 0 are parameters of the algorithm (to be chosen later by the decision maker, see Assumption 1). For nodes not in Tn , by convention, Uh,i (n) = +∞. Now, for a node (h, i) in Sn , we define its B–value to be Bh,i (n) = +∞. The B–values for nodes in Tn are given by Bh,i (n) = min Uh,i (n), max Bh+1,2i−1 (n), Bh+1,2i (n) . Note that the algorithm is deterministic (apart, maybe, from the arbitrary random choice of Xt in PHt ,It ). Its total space requirement is linear in n while total running time at round n is at most quadratic in n, though we conjecture that it is O(n log n) on average. 4 Assumptions made on the model and statement of the main result We suppose that X is equipped with a dissimilarity , that is a non-negative mapping : X 2 → R satisfying (x, x) = 0. The diameter (with respect to ) of a subset A of X is given by diam A = supx,y∈A (x, y). Given the dissimilarity , the “open” ball with radius ε > 0 and center c ∈ X is B(c, ε) = { x ∈ X : (c, x) < ε } (we do not require the topology induced by to be related to the topology of X .) In what follows when we refer to an (open) ball, we refer to the ball defined with respect to . The dissimilarity will be used to capture the smoothness of the mean-payoff function. The decision maker chooses and the tree of coverings. The following assumption relates this choice to the parameters ρ and ν1 of the algorithm: Assumption 1. There exist ρ < 1 and ν1 , ν2 > 0 such that for all integers h ≥ 0 and all i = 1, . . . , 2h , the diameter of Ph,i is bounded by ν1 ρh , and Ph,i contains an open ball Ph,i of radius ν2 ρh . For a given h, the Ph,i are disjoint for 1 ≤ i ≤ 2h . Remark 2. A typical choice for the coverings in a cubic domain is to let the domains be hyper-rectangles. They can be obtained, e.g., in a dyadic manner, by splitting at each step hyper-rectangles in the middle along their longest side, in an axis parallel manner; if all sides are equal, we split them along the√ axis. In first this example, if X = [0, 1]D and (x, y) = x − y α then we can take ρ = 2−α/D , ν1 = ( D/2)α and ν2 = 1/8α . The next assumption concerns the environment. Definition 2. We say that f is weakly Lipschitz with respect to if for all x, y ∈ X , f ∗ − f (y) ≤ f ∗ − f (x) + max f ∗ − f (x), (x, y) . (1) Note that weak Lipschitzness is satisfied whenever f is 1–Lipschitz, i.e., for all x, y ∈ X , one has |f (x) − f (y)| ≤ (x, y). On the other hand, weak Lipschitzness implies local (one-sided) 1–Lipschitzness at any maxima. Indeed, at an optimal arm x∗ (i.e., such that f (x∗ ) = f ∗ ), (1) rewrites to f (x∗ ) − f (y) ≤ (x∗ , y). However, weak Lipschitzness does not constraint the growth of the loss in the vicinity of other points. Further, weak Lipschitzness, unlike Lipschitzness, does not constraint the local decrease of the loss at any point. Thus, weak-Lipschitzness is a property that lies somewhere between a growth condition on the loss around optimal arms and (one-sided) Lipschitzness. Note that since weak Lipschitzness is defined with respect to a dissimilarity, it can actually capture higher-order smoothness at the optima. For example, f (x) = 1 − x2 is weak Lipschitz with the dissimilarity (x, y) = c(x − y)2 for some appropriate constant c. Assumption 2. The mean-payoff function f is weakly Lipschitz. ∗ ∗ Let fh,i = supx∈Ph,i f (x) and ∆h,i = f ∗ − fh,i be the suboptimality of node (h, i). We say that def a node (h, i) is optimal (respectively, suboptimal) if ∆h,i = 0 (respectively, ∆h,i > 0). Let Xε = { x ∈ X : f (x) ≥ f ∗ − ε } be the set of ε-optimal arms. The following result follows from the definitions; a proof can be found in the appendix. 4 Lemma 1. Let Assumption 1 and 2 hold. If the suboptimality ∆h,i of a region is bounded by cν1 ρh for some c > 0, then all arms in Ph,i are max{2c, c + 1}ν1 ρh -optimal. The last assumption is closely related to Assumption 2 of Auer et al. [2], who observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volume of the sets of ε-optimal arms shrinks as ε → 0. Here, we capture this by defining a new notion, the near-optimality dimension of the mean-payoff function. The connection between these concepts, as well as the zooming dimension defined by Kleinberg et al. [7] will be further discussed in Section 7. Define the packing number P(X , , ε) to be the size of the largest packing of X with disjoint open balls of radius ε with respect to the dissimilarity .2 We now define the near-optimality dimension, which characterizes the size of the sets Xε in terms of ε, and then state our main result. Definition 3. For c > 0 and ε0 > 0, the (c, ε0 )–near-optimality dimension of f with respect to equals inf d ∈ [0, +∞) : ∃ C s.t. ∀ε ≤ ε0 , P Xcε , , ε ≤ C ε−d (2) (with the usual convention that inf ∅ = +∞). Theorem 1 (Main result). Let Assumptions 1 and 2 hold and assume that the (4ν1 /ν2 , ν2 )–near-optimality dimension of the considered environment is d < +∞. Then, for any d > d there exists a constant C(d ) such that for all n ≥ 1, ERn ≤ C(d ) n(d +1)/(d +2) ln n 1/(d +2) . Further, if the near-optimality dimension is achieved, i.e., the infimum is achieved in (2), then the result holds also for d = d. Remark 3. We can relax the weak-Lipschitz property by requiring it to hold only locally around the maxima. In fact, at the price of increased constants, the result continues to hold if there exists ε > 0 such that (1) holds for any x, y ∈ Xε . To show this we only need to carefully adapt the steps of the proof below. We omit the details from this extended abstract. 5 Analysis of the regret and proof of the main result We first state three lemmas, whose proofs can be found in the appendix. The proofs of Lemmas 3 and 4 rely on concentration-of-measure techniques, while that of Lemma 2 follows from a simple case study. Let us fix some path (0, 1), (1, i∗ ), . . . , (h, i∗ ), . . . , of optimal nodes, starting from the root. 1 h Lemma 2. Let (h, i) be a suboptimal node. Let k be the largest depth such that (k, i∗ ) is on the path from k the root to (h, i). Then we have n E Nh,i (n) ≤ u+ P Nh,i (t) > u and Uh,i (t) > f ∗ or Us,i∗ ≤ f ∗ for some s ∈ {k+1, . . . , t−1} s t=u+1 Lemma 3. Let Assumptions 1 and 2 hold. 1, P Uh,i (n) ≤ f ∗ ≤ n−3 . Then, for all optimal nodes and for all integers n ≥ Lemma 4. Let Assumptions 1 and 2 hold. Then, for all integers t ≤ n, for all suboptimal nodes (h, i) 8 ln such that ∆h,i > ν1 ρh , and for all integers u ≥ 1 such that u ≥ (∆h,i −νnρh )2 , one has P Uh,i (t) > 1 f ∗ and Nh,i (t) > u ≤ t n−4 . 2 Note that sometimes packing numbers are defined as the largest packing with disjoint open balls of radius ε/2, or, ε-nets. 5 . Taking u as the integer part of (8 ln n)/(∆h,i − ν1 ρh )2 , and combining the results of Lemma 2, 3, and 4 with a union bound leads to the following key result. Lemma 5. Under Assumptions 1 and 2, for all suboptimal nodes (h, i) such that ∆h,i > ν1 ρh , we have, for all n ≥ 1, 8 ln n 2 E[Nh,i (n)] ≤ + . (∆h,i − ν1 ρh )2 n We are now ready to prove Theorem 1. Proof. For the sake of simplicity we assume that the infimum in the definition of near-optimality is achieved. To obtain the result in the general case one only needs to replace d below by d > d in the proof below. First step. For all h = 1, 2, . . ., denote by Ih the nodes at depth h that are 2ν1 ρh –optimal, i.e., the nodes ∗ (h, i) such that fh,i ≥ f ∗ − 2ν1 ρh . Then, I is the union of these sets of nodes. Further, let J be the set of nodes that are not in I but whose parent is in I. We then denote by Jh the nodes in J that are located at depth h in the tree. Lemma 4 bounds the expected number of times each node (h, i) ∈ Jh is visited. Since ∆h,i > 2ν1 ρh , we get 8 ln n 2 E Nh,i (n) ≤ 2 2h + . ν1 ρ n Second step. We bound here the cardinality |Ih |, h > 0. If (h, i) ∈ Ih then since ∆h,i ≤ 2ν1 ρh , by Lemma 1 Ph,i ⊂ X4ν1 ρh . Since by Assumption 1, the sets (Ph,i ), for (h, i) ∈ Ih , contain disjoint balls of radius ν2 ρh , we have that |Ih | ≤ P ∪(h,i)∈Ih Ph,i , , ν2 ρh ≤ P X(4ν1 /ν2 ) ν2 ρh , , ν2 ρh ≤ C ν2 ρh −d , where we used the assumption that d is the (4ν1 /ν2 , ν2 )–near-optimality dimension of f (and C is the constant introduced in the definition of the near-optimality dimension). Third step. Choose η > 0 and let H be the smallest integer such that ρH ≤ η. We partition the infinite tree T into three sets of nodes, T = T1 ∪ T2 ∪ T3 . The set T1 contains nodes of IH and their descendants, T2 = ∪0≤h < 1 and then, by optimizing over ρH (the worst value being ρH ∼ ( ln n )−1/(d+2) ). 6 Minimax optimality The packing dimension of a set X is the smallest d such that there exists a constant k such that for all ε > 0, P X , , ε ≤ k ε−d . For instance, compact subsets of Rd (with non-empty interior) have a packing dimension of d whenever is a norm. If X has a packing dimension of d, then all environments have a near-optimality dimension less than d. The proof of the main theorem indicates that the constant C(d) only depends on d, k (of the definition of packing dimension), ν1 , ν2 , and ρ, but not on the environment as long as it is weakly Lipschitz. Hence, we can extract from it a distribution-free bound of the form O(n(d+1)/(d+2) ). In fact, this bound can be shown to be optimal as is illustrated by the theorem below, whose assumptions are satisfied by, e.g., compact subsets of Rd and if is some norm of Rd . The proof can be found in the appendix. Theorem 2. If X is such that there exists c > 0 with P(X , , ε) ≥ c ε−d ≥ 2 for all ε ≤ 1/4 then for all n ≥ 4d−1 c/ ln(4/3), all strategies ϕ are bound to suffer a regret of at least 2/(d+2) 1 1 c n(d+1)/(d+2) , 4 4 4 ln(4/3) where the supremum is taken over all environments with weakly Lipschitz payoff functions. sup E Rn (ϕ) ≥ 7 Discussion Several works [1; 6; 3; 2; 7] have considered continuum-armed bandits in Euclidean or metric spaces and provided upper- and lower-bounds on the regret for given classes of environments. Cope [3] derived a regret √ of O( n) for compact and convex subset of Rd and a mean-payoff function with unique minima and second order smoothness. Kleinberg [6] considered mean-payoff functions f on the real line that are H¨ lder with o degree 0 < α ≤ 1. The derived regret is Θ(n(α+1)/(α+2) ). Auer et al. [2] extended the analysis to classes of functions with only a local H¨ lder assumption around maximum (with possibly higher smoothness degree o 1+α−αβ α ∈ [0, ∞)), and derived the regret Θ(n 1+2α−αβ ), where β is such that the Lebesgue measure of ε-optimal 7 states is O(εβ ). Another setting is that of [7] who considered a metric space (X , ) and assumed that f is Lipschitz w.r.t. . The obtained regret is O(n(d+1)/(d+2) ) where d is the zooming dimension (defined similarly to our near-optimality dimension, but using covering numbers instead of packing numbers and the sets Xε \ Xε/2 ). When (X , ) is a metric space covering and packing numbers are equivalent and we may prove that the zooming dimension and near-optimality dimensions are equal. Our main contribution compared to [7] is that our weak-Lipschitz assumption, which is substantially weaker than the global Lipschitz assumption assumed in [7], enables our algorithm to work better in some common situations, such as when the mean-payoff function assumes a local smoothness whose order is larger than one. In order to relate all these results, let us consider a specific example: Let X = [0, 1]D and assume that the mean-reward function f is locally equivalent to a H¨ lder function with degree α ∈ [0, ∞) around any o maxima x∗ of f (the number of maxima is assumed to be finite): f (x∗ ) − f (x) = Θ(||x − x∗ ||α ) as x → x∗ . (3) This means that ∃c1 , c2 , ε0 > 0, ∀x, s.t. ||x − x∗ || ≤ ε0 , c1 ||x − x∗ ||α ≤ f (x∗ ) − f (x) ≤ c2 ||x − x∗ ||α . √ Under this assumption, the result of Auer et al. [2] shows that for D = 1, the regret is Θ( n) (since here √ β = 1/α). Our result allows us to extend the n regret rate to any dimension D. Indeed, if we choose our def dissimilarity measure to be α (x, y) = ||x − y||α , we may prove that f satisfies a locally weak-Lipschitz √ condition (as defined in Remark 3) and that the near-optimality dimension is 0. Thus our regret is O( n), i.e., the rate is independent of the dimension D. In comparison, since Kleinberg et al. [7] have to satisfy a global Lipschitz assumption, they can not use α when α > 1. Indeed a function globally Lipschitz with respect to α is essentially constant. Moreover α does not define a metric for α > 1. If one resort to the Euclidean metric to fulfill their requirement that f be Lipschitz w.r.t. the metric then the zooming dimension becomes D(α − 1)/α, while the regret becomes √ O(n(D(α−1)+α)/(D(α−1)+2α) ), which is strictly worse than O( n) and in fact becomes close to the slow rate O(n(D+1)/(D+2) ) when α is larger. Nevertheless, in the case of α ≤ 1 they get the same regret rate. In contrast, our result shows that under very weak constraints on the mean-payoff function and if the local behavior of the function around its maximum (or finite number of maxima) is known then global optimization √ suffers a regret of order O( n), independent of the space dimension. As an interesting sidenote let us also remark that our results allow different smoothness orders along different dimensions, i.e., heterogenous smoothness spaces. References [1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995. [2] P. Auer, R. Ortner, and Cs. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit problem. 20th a Conference on Learning Theory, pages 454–468, 2007. [3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. Preprint, 2004. [4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006. [6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004. [7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008. [8] L. Kocsis and Cs. Szepesv´ ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference a on Machine Learning, pages 282–293, 2006. 8
5 0.095159672 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
Author: Peter Auer, Thomas Jaksch, Ronald Ortner
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1
6 0.068606973 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
7 0.061911311 184 nips-2008-Predictive Indexing for Fast Search
8 0.06089573 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
9 0.056692138 214 nips-2008-Sparse Online Learning via Truncated Gradient
10 0.055002175 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning
11 0.053381566 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
12 0.051226459 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
13 0.049534783 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
14 0.047698159 169 nips-2008-Online Models for Content Optimization
15 0.03977444 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
16 0.037143096 131 nips-2008-MDPs with Non-Deterministic Policies
17 0.036398727 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
18 0.035605431 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor
19 0.03558138 181 nips-2008-Policy Search for Motor Primitives in Robotics
20 0.034625731 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
topicId topicWeight
[(0, -0.122), (1, 0.32), (2, -0.148), (3, -0.192), (4, -0.347), (5, -0.498), (6, -0.261), (7, 0.062), (8, 0.056), (9, 0.082), (10, 0.076), (11, -0.221), (12, -0.091), (13, 0.017), (14, -0.058), (15, -0.043), (16, -0.009), (17, 0.018), (18, 0.007), (19, -0.01), (20, 0.063), (21, 0.031), (22, -0.04), (23, -0.009), (24, 0.007), (25, 0.047), (26, -0.033), (27, 0.014), (28, -0.033), (29, 0.007), (30, 0.003), (31, -0.002), (32, -0.059), (33, 0.001), (34, 0.031), (35, -0.01), (36, -0.041), (37, -0.009), (38, -0.007), (39, -0.008), (40, -0.007), (41, 0.033), (42, 0.035), (43, -0.027), (44, -0.049), (45, 0.028), (46, -0.028), (47, 0.013), (48, -0.043), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.97843248 140 nips-2008-Mortal Multi-Armed Bandits
Author: Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal
Abstract: We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. 1
2 0.93393278 17 nips-2008-Algorithms for Infinitely Many-Armed Bandits
Author: Yizao Wang, Jean-yves Audibert, Rémi Munos
Abstract: We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases. 1
3 0.77212507 223 nips-2008-Structure Learning in Human Sequential Decision-Making
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
4 0.70170826 170 nips-2008-Online Optimization in X-Armed Bandits
Author: Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, Rémi Munos
Abstract: We consider a generalization of stochastic bandit problems where the set of arms, X , is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally H¨ lder √ a known exponent, then the expected o with regret is bounded up to a logarithmic factor by n, i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider. 1 Introduction and motivation Bandit problems arise in many settings, including clinical trials, scheduling, on-line parameter tuning of algorithms or optimization of controllers based on simulations. In the classical bandit problem there are a finite number of arms that the decision maker can select at discrete time steps. Selecting an arm results in a random reward, whose distribution is determined by the identity of the arm selected. The distributions associated with the arms are unknown to the decision maker whose goal is to maximize the expected sum of the rewards received. In many practical situations the arms belong to a large set. This set could be continuous [1; 6; 3; 2; 7], hybrid-continuous, or it could be the space of infinite sequences over a finite alphabet [4]. In this paper we consider stochastic bandit problems where the set of arms, X , is allowed to be an arbitrary topological space. We assume that the decision maker knows a dissimilarity function defined over this space that constraints the shape of the mean-payoff function. In particular, the dissimilarity function is assumed to put a lower bound on the mean-payoff function from below at each maxima. We also assume that the decision maker is able to cover the space of arms in a recursive manner, successively refining the regions in the covering such that the diameters of these sets shrink at a known geometric rate when measured with the dissimilarity. ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. He also greatly acknowledges the support received from the a Alberta Ingenuity Fund, iCore and NSERC. 1 Our work generalizes and improves previous works on continuum-armed bandit problems: Kleinberg [6] and Auer et al. [2] focussed on one-dimensional problems. Recently, Kleinberg et al. [7] considered generic metric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the space. They proposed an interesting algorithm that achieves essentially the best possible regret in a minimax sense with respect to these environments. The goal of this paper is to further these works in a number of ways: (i) we allow the set of arms to be a generic topological space; (ii) we propose a practical algorithm motivated by the recent very successful tree-based optimization algorithms [8; 5; 4] and show that the algorithm is (iii) able to exploit higher order smoothness. In particular, as we shall argue in Section 7, (i) improves upon the results of Auer et al. [2], while (i), (ii) and (iii) improve upon the work of Kleinberg et al. [7]. Compared to Kleinberg et al. [7], our work represents an improvement in the fact that just like Auer et al. [2] we make use of the local properties of the mean-payoff function around the maxima only, and not a global property, such as Lipschitzness in √ the whole space. This allows us to obtain a regret which scales as O( n) 1 when e.g. the space is the unit hypercube and the mean-payoff function is locally H¨ lder with known exponent in the neighborhood of any o maxima (which are in finite number) and bounded away from the maxima outside of these neighborhoods. Thus, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. We also prove a minimax lower bound that matches our upper bound up to logarithmic factors, showing that the performance of our algorithm is essentially unimprovable in a minimax sense. Besides these theoretical advances the algorithm is anytime and easy to implement. Since it is based on ideas that have proved to be efficient, we expect it to perform well in practice and to make a significant impact on how on-line global optimization is performed. 2 Problem setup, notation We consider a topological space X , whose elements will be referred to as arms. A decision maker “pulls” the arms in X one at a time at discrete time steps. Each pull results in a reward that depends on the arm chosen and which the decision maker learns of. The goal of the decision maker is to choose the arms so as to maximize the sum of the rewards that he receives. In this paper we are concerned with stochastic environments. Such an environment M associates to each arm x ∈ X a distribution Mx on the real line. The support of these distributions is assumed to be uniformly bounded with a known bound. For the sake of simplicity, we assume this bound is 1. We denote by f (x) the expectation of Mx , which is assumed to be measurable (all measurability concepts are with respect to the Borel-algebra over X ). The function f : X → R thus defined is called the mean-payoff function. When in round n the decision maker pulls arm Xn ∈ X , he receives a reward Yn drawn from MXn , independently of the past arm choices and rewards. A pulling strategy of a decision maker is determined by a sequence ϕ = (ϕn )n≥1 of measurable mappings, n−1 where each ϕn maps the history space Hn = X × [0, 1] to the space of probability measures over X . By convention, ϕ1 does not take any argument. A strategy is deterministic if for every n the range of ϕn contains only Dirac distributions. According to the process that was already informally described, a pulling strategy ϕ and an environment M jointly determine a random process (X1 , Y1 , X2 , Y2 , . . .) in the following way: In round one, the decision maker draws an arm X1 at random from ϕ1 and gets a payoff Y1 drawn from MX1 . In round n ≥ 2, first, Xn is drawn at random according to ϕn (X1 , Y1 , . . . , Xn−1 , Yn−1 ), but otherwise independently of the past. Then the decision maker gets a rewards Yn drawn from MXn , independently of all other random variables in the past given Xn . Let f ∗ = supx∈X f (x) be the maximal expected payoff. The cumulative regret of a pulling strategy in n n environment M is Rn = n f ∗ − t=1 Yt , and the cumulative pseudo-regret is Rn = n f ∗ − t=1 f (Xt ). 1 We write un = O(vu ) when un = O(vn ) up to a logarithmic factor. 2 In the sequel, we restrict our attention to the expected regret E [Rn ], which in fact equals E[Rn ], as can be seen by the application of the tower rule. 3 3.1 The Hierarchical Optimistic Optimization (HOO) strategy Trees of coverings We first introduce the notion of a tree of coverings. Our algorithm will require such a tree as an input. Definition 1 (Tree of coverings). A tree of coverings is a family of measurable subsets (Ph,i )1≤i≤2h , h≥0 of X such that for all fixed integer h ≥ 0, the covering ∪1≤i≤2h Ph,i = X holds. Moreover, the elements of the covering are obtained recursively: each subset Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i . A tree of coverings can be represented, as the name suggests, by a binary tree T . The whole domain X = P0,1 corresponds to the root of the tree and Ph,i corresponds to the i–th node of depth h, which will be referred to as node (h, i) in the sequel. The fact that each Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i corresponds to the childhood relationship in the tree. Although the definition allows the childregions of a node to cover a larger part of the space, typically the size of the regions shrinks as depth h increases (cf. Assumption 1). Remark 1. Our algorithm will instantiate the nodes of the tree on an ”as needed” basis, one by one. In fact, at any round n it will only need n nodes connected to the root. 3.2 Statement of the HOO strategy The algorithm picks at each round a node in the infinite tree T as follows. In the first round, it chooses the root node (0, 1). Now, consider round n + 1 with n ≥ 1. Let us denote by Tn the set of nodes that have been picked in previous rounds and by Sn the nodes which are not in Tn but whose parent is. The algorithm picks at round n + 1 a node (Hn+1 , In+1 ) ∈ Sn according to the deterministic rule that will be described below. After selecting the node, the algorithm further chooses an arm Xn+1 ∈ PHn+1 ,In+1 . This selection can be stochastic or deterministic. We do not put any further restriction on it. The algorithm then gets a reward Yn+1 as described above and the procedure goes on: (Hn+1 , In+1 ) is added to Tn to form Tn+1 and the children of (Hn+1 , In+1 ) are added to Sn to give rise to Sn+1 . Let us now turn to how (Hn+1 , In+1 ) is selected. Along with the nodes the algorithm stores what we call B–values. The node (Hn+1 , In+1 ) ∈ Sn to expand at round n + 1 is picked by following a path from the root to a node in Sn , where at each node along the path the child with the larger B–value is selected (ties are broken arbitrarily). In order to define a node’s B–value, we need a few quantities. Let C(h, i) be the set that collects (h, i) and its descendants. We let n Nh,i (n) = I{(Ht ,It )∈C(h,i)} t=1 be the number of times the node (h, i) was visited. A given node (h, i) is always picked at most once, but since its descendants may be picked afterwards, subsequent paths in the tree can go through it. Consequently, 1 ≤ Nh,i (n) ≤ n for all nodes (h, i) ∈ Tn . Let µh,i (n) be the empirical average of the rewards received for the time-points when the path followed by the algorithm went through (h, i): n 1 µh,i (n) = Yt I{(Ht ,It )∈C(h,i)} . Nh,i (n) t=1 The corresponding upper confidence bound is by definition Uh,i (n) = µh,i (n) + 3 2 ln n + ν 1 ρh , Nh,i (n) where 0 < ρ < 1 and ν1 > 0 are parameters of the algorithm (to be chosen later by the decision maker, see Assumption 1). For nodes not in Tn , by convention, Uh,i (n) = +∞. Now, for a node (h, i) in Sn , we define its B–value to be Bh,i (n) = +∞. The B–values for nodes in Tn are given by Bh,i (n) = min Uh,i (n), max Bh+1,2i−1 (n), Bh+1,2i (n) . Note that the algorithm is deterministic (apart, maybe, from the arbitrary random choice of Xt in PHt ,It ). Its total space requirement is linear in n while total running time at round n is at most quadratic in n, though we conjecture that it is O(n log n) on average. 4 Assumptions made on the model and statement of the main result We suppose that X is equipped with a dissimilarity , that is a non-negative mapping : X 2 → R satisfying (x, x) = 0. The diameter (with respect to ) of a subset A of X is given by diam A = supx,y∈A (x, y). Given the dissimilarity , the “open” ball with radius ε > 0 and center c ∈ X is B(c, ε) = { x ∈ X : (c, x) < ε } (we do not require the topology induced by to be related to the topology of X .) In what follows when we refer to an (open) ball, we refer to the ball defined with respect to . The dissimilarity will be used to capture the smoothness of the mean-payoff function. The decision maker chooses and the tree of coverings. The following assumption relates this choice to the parameters ρ and ν1 of the algorithm: Assumption 1. There exist ρ < 1 and ν1 , ν2 > 0 such that for all integers h ≥ 0 and all i = 1, . . . , 2h , the diameter of Ph,i is bounded by ν1 ρh , and Ph,i contains an open ball Ph,i of radius ν2 ρh . For a given h, the Ph,i are disjoint for 1 ≤ i ≤ 2h . Remark 2. A typical choice for the coverings in a cubic domain is to let the domains be hyper-rectangles. They can be obtained, e.g., in a dyadic manner, by splitting at each step hyper-rectangles in the middle along their longest side, in an axis parallel manner; if all sides are equal, we split them along the√ axis. In first this example, if X = [0, 1]D and (x, y) = x − y α then we can take ρ = 2−α/D , ν1 = ( D/2)α and ν2 = 1/8α . The next assumption concerns the environment. Definition 2. We say that f is weakly Lipschitz with respect to if for all x, y ∈ X , f ∗ − f (y) ≤ f ∗ − f (x) + max f ∗ − f (x), (x, y) . (1) Note that weak Lipschitzness is satisfied whenever f is 1–Lipschitz, i.e., for all x, y ∈ X , one has |f (x) − f (y)| ≤ (x, y). On the other hand, weak Lipschitzness implies local (one-sided) 1–Lipschitzness at any maxima. Indeed, at an optimal arm x∗ (i.e., such that f (x∗ ) = f ∗ ), (1) rewrites to f (x∗ ) − f (y) ≤ (x∗ , y). However, weak Lipschitzness does not constraint the growth of the loss in the vicinity of other points. Further, weak Lipschitzness, unlike Lipschitzness, does not constraint the local decrease of the loss at any point. Thus, weak-Lipschitzness is a property that lies somewhere between a growth condition on the loss around optimal arms and (one-sided) Lipschitzness. Note that since weak Lipschitzness is defined with respect to a dissimilarity, it can actually capture higher-order smoothness at the optima. For example, f (x) = 1 − x2 is weak Lipschitz with the dissimilarity (x, y) = c(x − y)2 for some appropriate constant c. Assumption 2. The mean-payoff function f is weakly Lipschitz. ∗ ∗ Let fh,i = supx∈Ph,i f (x) and ∆h,i = f ∗ − fh,i be the suboptimality of node (h, i). We say that def a node (h, i) is optimal (respectively, suboptimal) if ∆h,i = 0 (respectively, ∆h,i > 0). Let Xε = { x ∈ X : f (x) ≥ f ∗ − ε } be the set of ε-optimal arms. The following result follows from the definitions; a proof can be found in the appendix. 4 Lemma 1. Let Assumption 1 and 2 hold. If the suboptimality ∆h,i of a region is bounded by cν1 ρh for some c > 0, then all arms in Ph,i are max{2c, c + 1}ν1 ρh -optimal. The last assumption is closely related to Assumption 2 of Auer et al. [2], who observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volume of the sets of ε-optimal arms shrinks as ε → 0. Here, we capture this by defining a new notion, the near-optimality dimension of the mean-payoff function. The connection between these concepts, as well as the zooming dimension defined by Kleinberg et al. [7] will be further discussed in Section 7. Define the packing number P(X , , ε) to be the size of the largest packing of X with disjoint open balls of radius ε with respect to the dissimilarity .2 We now define the near-optimality dimension, which characterizes the size of the sets Xε in terms of ε, and then state our main result. Definition 3. For c > 0 and ε0 > 0, the (c, ε0 )–near-optimality dimension of f with respect to equals inf d ∈ [0, +∞) : ∃ C s.t. ∀ε ≤ ε0 , P Xcε , , ε ≤ C ε−d (2) (with the usual convention that inf ∅ = +∞). Theorem 1 (Main result). Let Assumptions 1 and 2 hold and assume that the (4ν1 /ν2 , ν2 )–near-optimality dimension of the considered environment is d < +∞. Then, for any d > d there exists a constant C(d ) such that for all n ≥ 1, ERn ≤ C(d ) n(d +1)/(d +2) ln n 1/(d +2) . Further, if the near-optimality dimension is achieved, i.e., the infimum is achieved in (2), then the result holds also for d = d. Remark 3. We can relax the weak-Lipschitz property by requiring it to hold only locally around the maxima. In fact, at the price of increased constants, the result continues to hold if there exists ε > 0 such that (1) holds for any x, y ∈ Xε . To show this we only need to carefully adapt the steps of the proof below. We omit the details from this extended abstract. 5 Analysis of the regret and proof of the main result We first state three lemmas, whose proofs can be found in the appendix. The proofs of Lemmas 3 and 4 rely on concentration-of-measure techniques, while that of Lemma 2 follows from a simple case study. Let us fix some path (0, 1), (1, i∗ ), . . . , (h, i∗ ), . . . , of optimal nodes, starting from the root. 1 h Lemma 2. Let (h, i) be a suboptimal node. Let k be the largest depth such that (k, i∗ ) is on the path from k the root to (h, i). Then we have n E Nh,i (n) ≤ u+ P Nh,i (t) > u and Uh,i (t) > f ∗ or Us,i∗ ≤ f ∗ for some s ∈ {k+1, . . . , t−1} s t=u+1 Lemma 3. Let Assumptions 1 and 2 hold. 1, P Uh,i (n) ≤ f ∗ ≤ n−3 . Then, for all optimal nodes and for all integers n ≥ Lemma 4. Let Assumptions 1 and 2 hold. Then, for all integers t ≤ n, for all suboptimal nodes (h, i) 8 ln such that ∆h,i > ν1 ρh , and for all integers u ≥ 1 such that u ≥ (∆h,i −νnρh )2 , one has P Uh,i (t) > 1 f ∗ and Nh,i (t) > u ≤ t n−4 . 2 Note that sometimes packing numbers are defined as the largest packing with disjoint open balls of radius ε/2, or, ε-nets. 5 . Taking u as the integer part of (8 ln n)/(∆h,i − ν1 ρh )2 , and combining the results of Lemma 2, 3, and 4 with a union bound leads to the following key result. Lemma 5. Under Assumptions 1 and 2, for all suboptimal nodes (h, i) such that ∆h,i > ν1 ρh , we have, for all n ≥ 1, 8 ln n 2 E[Nh,i (n)] ≤ + . (∆h,i − ν1 ρh )2 n We are now ready to prove Theorem 1. Proof. For the sake of simplicity we assume that the infimum in the definition of near-optimality is achieved. To obtain the result in the general case one only needs to replace d below by d > d in the proof below. First step. For all h = 1, 2, . . ., denote by Ih the nodes at depth h that are 2ν1 ρh –optimal, i.e., the nodes ∗ (h, i) such that fh,i ≥ f ∗ − 2ν1 ρh . Then, I is the union of these sets of nodes. Further, let J be the set of nodes that are not in I but whose parent is in I. We then denote by Jh the nodes in J that are located at depth h in the tree. Lemma 4 bounds the expected number of times each node (h, i) ∈ Jh is visited. Since ∆h,i > 2ν1 ρh , we get 8 ln n 2 E Nh,i (n) ≤ 2 2h + . ν1 ρ n Second step. We bound here the cardinality |Ih |, h > 0. If (h, i) ∈ Ih then since ∆h,i ≤ 2ν1 ρh , by Lemma 1 Ph,i ⊂ X4ν1 ρh . Since by Assumption 1, the sets (Ph,i ), for (h, i) ∈ Ih , contain disjoint balls of radius ν2 ρh , we have that |Ih | ≤ P ∪(h,i)∈Ih Ph,i , , ν2 ρh ≤ P X(4ν1 /ν2 ) ν2 ρh , , ν2 ρh ≤ C ν2 ρh −d , where we used the assumption that d is the (4ν1 /ν2 , ν2 )–near-optimality dimension of f (and C is the constant introduced in the definition of the near-optimality dimension). Third step. Choose η > 0 and let H be the smallest integer such that ρH ≤ η. We partition the infinite tree T into three sets of nodes, T = T1 ∪ T2 ∪ T3 . The set T1 contains nodes of IH and their descendants, T2 = ∪0≤h < 1 and then, by optimizing over ρH (the worst value being ρH ∼ ( ln n )−1/(d+2) ). 6 Minimax optimality The packing dimension of a set X is the smallest d such that there exists a constant k such that for all ε > 0, P X , , ε ≤ k ε−d . For instance, compact subsets of Rd (with non-empty interior) have a packing dimension of d whenever is a norm. If X has a packing dimension of d, then all environments have a near-optimality dimension less than d. The proof of the main theorem indicates that the constant C(d) only depends on d, k (of the definition of packing dimension), ν1 , ν2 , and ρ, but not on the environment as long as it is weakly Lipschitz. Hence, we can extract from it a distribution-free bound of the form O(n(d+1)/(d+2) ). In fact, this bound can be shown to be optimal as is illustrated by the theorem below, whose assumptions are satisfied by, e.g., compact subsets of Rd and if is some norm of Rd . The proof can be found in the appendix. Theorem 2. If X is such that there exists c > 0 with P(X , , ε) ≥ c ε−d ≥ 2 for all ε ≤ 1/4 then for all n ≥ 4d−1 c/ ln(4/3), all strategies ϕ are bound to suffer a regret of at least 2/(d+2) 1 1 c n(d+1)/(d+2) , 4 4 4 ln(4/3) where the supremum is taken over all environments with weakly Lipschitz payoff functions. sup E Rn (ϕ) ≥ 7 Discussion Several works [1; 6; 3; 2; 7] have considered continuum-armed bandits in Euclidean or metric spaces and provided upper- and lower-bounds on the regret for given classes of environments. Cope [3] derived a regret √ of O( n) for compact and convex subset of Rd and a mean-payoff function with unique minima and second order smoothness. Kleinberg [6] considered mean-payoff functions f on the real line that are H¨ lder with o degree 0 < α ≤ 1. The derived regret is Θ(n(α+1)/(α+2) ). Auer et al. [2] extended the analysis to classes of functions with only a local H¨ lder assumption around maximum (with possibly higher smoothness degree o 1+α−αβ α ∈ [0, ∞)), and derived the regret Θ(n 1+2α−αβ ), where β is such that the Lebesgue measure of ε-optimal 7 states is O(εβ ). Another setting is that of [7] who considered a metric space (X , ) and assumed that f is Lipschitz w.r.t. . The obtained regret is O(n(d+1)/(d+2) ) where d is the zooming dimension (defined similarly to our near-optimality dimension, but using covering numbers instead of packing numbers and the sets Xε \ Xε/2 ). When (X , ) is a metric space covering and packing numbers are equivalent and we may prove that the zooming dimension and near-optimality dimensions are equal. Our main contribution compared to [7] is that our weak-Lipschitz assumption, which is substantially weaker than the global Lipschitz assumption assumed in [7], enables our algorithm to work better in some common situations, such as when the mean-payoff function assumes a local smoothness whose order is larger than one. In order to relate all these results, let us consider a specific example: Let X = [0, 1]D and assume that the mean-reward function f is locally equivalent to a H¨ lder function with degree α ∈ [0, ∞) around any o maxima x∗ of f (the number of maxima is assumed to be finite): f (x∗ ) − f (x) = Θ(||x − x∗ ||α ) as x → x∗ . (3) This means that ∃c1 , c2 , ε0 > 0, ∀x, s.t. ||x − x∗ || ≤ ε0 , c1 ||x − x∗ ||α ≤ f (x∗ ) − f (x) ≤ c2 ||x − x∗ ||α . √ Under this assumption, the result of Auer et al. [2] shows that for D = 1, the regret is Θ( n) (since here √ β = 1/α). Our result allows us to extend the n regret rate to any dimension D. Indeed, if we choose our def dissimilarity measure to be α (x, y) = ||x − y||α , we may prove that f satisfies a locally weak-Lipschitz √ condition (as defined in Remark 3) and that the near-optimality dimension is 0. Thus our regret is O( n), i.e., the rate is independent of the dimension D. In comparison, since Kleinberg et al. [7] have to satisfy a global Lipschitz assumption, they can not use α when α > 1. Indeed a function globally Lipschitz with respect to α is essentially constant. Moreover α does not define a metric for α > 1. If one resort to the Euclidean metric to fulfill their requirement that f be Lipschitz w.r.t. the metric then the zooming dimension becomes D(α − 1)/α, while the regret becomes √ O(n(D(α−1)+α)/(D(α−1)+2α) ), which is strictly worse than O( n) and in fact becomes close to the slow rate O(n(D+1)/(D+2) ) when α is larger. Nevertheless, in the case of α ≤ 1 they get the same regret rate. In contrast, our result shows that under very weak constraints on the mean-payoff function and if the local behavior of the function around its maximum (or finite number of maxima) is known then global optimization √ suffers a regret of order O( n), independent of the space dimension. As an interesting sidenote let us also remark that our results allow different smoothness orders along different dimensions, i.e., heterogenous smoothness spaces. References [1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995. [2] P. Auer, R. Ortner, and Cs. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit problem. 20th a Conference on Learning Theory, pages 454–468, 2007. [3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. Preprint, 2004. [4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006. [6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004. [7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008. [8] L. Kocsis and Cs. Szepesv´ ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference a on Machine Learning, pages 282–293, 2006. 8
5 0.26249164 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
Author: Peter Auer, Thomas Jaksch, Ronald Ortner
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1
6 0.24500003 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
7 0.21970761 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
8 0.20550394 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
9 0.18548174 169 nips-2008-Online Models for Content Optimization
10 0.1707276 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
11 0.1503727 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
12 0.14968689 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning
13 0.1493123 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
14 0.1366749 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
15 0.1343179 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making
16 0.11943866 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
17 0.11781254 33 nips-2008-Bayesian Model of Behaviour in Economic Games
18 0.11754176 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
19 0.11226114 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
20 0.11031177 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
topicId topicWeight
[(6, 0.065), (7, 0.041), (12, 0.041), (28, 0.134), (57, 0.03), (59, 0.014), (63, 0.023), (71, 0.022), (77, 0.047), (78, 0.013), (83, 0.042), (92, 0.41), (94, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.71110481 140 nips-2008-Mortal Multi-Armed Bandits
Author: Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal
Abstract: We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. 1
2 0.55386466 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
Author: Laurent E. Ghaoui, Assane Gueye
Abstract: We consider the problem of bounding from above the log-partition function corresponding to second-order Ising models for binary distributions. We introduce a new bound, the cardinality bound, which can be computed via convex optimization. The corresponding error on the logpartition function is bounded above by twice the distance, in model parameter space, to a class of “standard” Ising models, for which variable inter-dependence is described via a simple mean field term. In the context of maximum-likelihood, using the new bound instead of the exact log-partition function, while constraining the distance to the class of standard Ising models, leads not only to a good approximation to the log-partition function, but also to a model that is parsimonious, and easily interpretable. We compare our bound with the log-determinant bound introduced by Wainwright and Jordan (2006), and show that when the l1 -norm of the model parameter vector is small enough, the latter is outperformed by the new bound. 1 1.1
3 0.38642856 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
4 0.38447085 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
5 0.38396093 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
6 0.38379946 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
8 0.38321227 196 nips-2008-Relative Margin Machines
9 0.38299179 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
10 0.38293836 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
11 0.38283193 17 nips-2008-Algorithms for Infinitely Many-Armed Bandits
12 0.38278332 195 nips-2008-Regularized Policy Iteration
13 0.38205561 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
14 0.38066992 85 nips-2008-Fast Rates for Regularized Objectives
15 0.38038877 62 nips-2008-Differentiable Sparse Coding
16 0.37997767 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
17 0.37989473 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
18 0.37989366 179 nips-2008-Phase transitions for high-dimensional joint support recovery
19 0.37961552 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
20 0.37912205 228 nips-2008-Support Vector Machines with a Reject Option