jmlr jmlr2013 jmlr2013-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Theja Tulabandhula, Cynthia Rudin
Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory
Reference: text
sentIndex sentText sentNum sentScore
1 We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. [sent-4, score-1.174]
2 The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. [sent-5, score-0.557]
3 To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. [sent-6, score-0.88]
4 From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. [sent-7, score-0.526]
5 We also show that learning with operational costs is related to robust optimization. [sent-9, score-0.584]
6 Introduction Machine learning algorithms are used to produce predictions, and these predictions are often used to make a policy or plan of action afterwards, where there is a cost to implement the policy. [sent-11, score-0.388]
7 ” The three questions above cannot be answered by standard decision theory, where the goal is to produce a single policy that minimizes expected cost. [sent-19, score-0.386]
8 T ULABANDHULA AND RUDIN robust optimization, where the goal is to produce a single policy that is robust to the uncertainty in nature. [sent-21, score-0.467]
9 Those paradigms produce a single policy decision that takes uncertainty into account, and the chosen policy might not be a best response policy to any realistic situation. [sent-22, score-0.885]
10 In order to propagate the uncertainty in modeling to the uncertainty in costs, we introduce what we call the simultaneous process, where we explore the range of predictive models and corresponding policy decisions at the same time. [sent-28, score-0.611]
11 The sequential process is commonly used in practice, even though there may actually be a whole class of models that could be relevant for the policy decision problem. [sent-30, score-0.403]
12 ” In the simultaneous process, the machine learning algorithm contains a regularization term encoding the policy and its associated cost, with an adjustable regularization parameter. [sent-32, score-0.473]
13 If there is some uncertainty about how much it will cost to solve the problem, the regularization parameter can be swept through an interval to find a range of possible costs, from optimistic to pessimistic. [sent-33, score-0.411]
14 ” This is an important question, since business managers often like to know if there is some scenario/decision pair that is supported by the data, but for which the operational cost is low (or high); the simultaneous process would be able to find such scenarios directly. [sent-40, score-0.767]
15 To do this, we would look at the setting of the regularization parameter that resulted in the desired value of the cost, and then look at the solution of the simultaneous formulation, which gives the model and its corresponding policy decision. [sent-41, score-0.441]
16 ” The regularization parameter can be interpreted to regulate the strength of our belief in the operational cost. [sent-43, score-0.49]
17 The operational cost regularization term can be the optimal value of a complicated optimization problem, like a scheduling problem. [sent-50, score-0.696]
18 We trace out this path by changing our prior belief on the operational cost (that is, by changing the strength of our regularization term). [sent-69, score-0.614]
19 Robust optimization is a maximin approach to decision making, and the simultaneous process also differs in principle from robust optimization. [sent-76, score-0.383]
20 In robust optimization, one would generally need to allocate much more than is necessary for any single realistic situation, in order to produce a policy that is robust to almost all situations. [sent-77, score-0.452]
21 In Section 3, we give several examples of algorithms that incorporate these operational costs. [sent-80, score-0.424]
22 A fourth example is the “Machine Learning and Traveling Repairman Problem” (ML&TRP;) where the policy decision is a route for a repair crew. [sent-87, score-0.407]
23 We also discuss the overlap between RO and the optimistic and pessimistic versions of the simultaneous process. [sent-92, score-0.441]
24 In particular, we aim to understand how operational costs affect prediction (generalization) ability. [sent-94, score-0.526]
25 This helps answer the third question Q3, about how intuition about operational cost can help produce a better probabilistic model. [sent-95, score-0.639]
26 Commonly in machine learning this is done by choosing f to be the solution of a minimization problem: f ∗ ∈ argmin f ∈F unc n ∑ l( f (xi ), yi ) +C2 R( f ) , (1) i=1 for some loss function l : Y × Y → R+ , regularizer R : F unc → R, constant C2 and function class F unc . [sent-102, score-0.564]
27 Given a new collection of unlabeled instances {xi }m , the organization wants to create a policy π∗ that minimizes a certain operational ˜ i=1 cost OpCost(π, f ∗ , {xi }i ). [sent-110, score-0.856]
28 Of course, if the organization knew the true labels for the {xi }i ’s before˜ ˜ hand, it would choose a policy to optimize the operational cost based directly on these labels, and would not need f ∗ . [sent-111, score-0.784]
29 Since the labels are not known, the operational costs are calculated using the model’s predictions, the f ∗ (xi )’s. [sent-112, score-0.526]
30 The difference between the traditional sequential process and the ˜ new simultaneous process is whether f ∗ is chosen with or without knowledge of the operational cost. [sent-113, score-0.748]
31 The traditional ˜ sequential process picks a model f ∗ , based on past failure data without the knowledge of operational cost, and afterwards computes π∗ based on an optimization problem involving the { f ∗ (xi )}i ’s and ˜ the operational cost. [sent-117, score-0.997]
32 The new simultaneous process picks f ∗ and π∗ at the same time, based on optimism or pessimism on the operational cost of π∗ . [sent-118, score-0.767]
33 That is f ∗ ∈ argmin f ∈F unc n ∑ l( f (xi ), yi ) +C2 R( f ) i=1 Step 2: Choose policy π∗ to minimize the operational cost, π∗ ∈ argminπ∈Π OpCost(π, f ∗ , {xi }i ). [sent-121, score-0.865]
34 T ULABANDHULA AND RUDIN The operational cost OpCost(π, f ∗ , {xi }i ) is the amount the organization will spend if policy π is ˜ chosen in response to the values of { f ∗ (xi )}i . [sent-123, score-0.784]
35 in other words, the optimistic bias lowers costs when there is uncertainty, whereas the pessimistic bias raises them. [sent-126, score-0.45]
36 Step 1: Choose a model f ◦ obeying one of the following: Optimistic Bias: f ◦ ∈ argmin f ∈F unc n ∑ l ( f (xi ), yi ) i=1 +C2 R( f ) +C1 min OpCost (π, f , {xi }i ) , ˜ (2) π∈Π Pessimistic Bias: f ◦ ∈ argmin f ∈F unc n ∑ l ( f (xi ), yi ) i=1 +C2 R( f ) −C1 min OpCost (π, f , {xi }i ) . [sent-128, score-0.488]
37 ˜ π∈Π When C1 = 0, the simultaneous process becomes the sequential process; the sequential process is a special case of the simultaneous process. [sent-130, score-0.556]
38 However, if the number of unlabeled instances is small, or if the policy decision can be broken into several smaller subproblems, then even if the training set is large, one can solve Step 1 using different types of mathematical programming solvers, including MINLP solvers (Bonami et al. [sent-132, score-0.373]
39 The simultaneous process is more intensive than the sequential process in that it requires repeated solutions of that optimization problem, rather than a single solution. [sent-136, score-0.368]
40 In addition, we can pick a value of C1 such that the resulting operational cost is a specific amount. [sent-145, score-0.548]
41 One should not view the operational cost as a utility function that needs to be estimated, as in reinforcement learning, where we do not know the cost. [sent-155, score-0.548]
42 The simultaneous process also has a resemblance to transductive learning (see Zhu, 2007), whose goal is to produce the output labels on the set of unlabeled examples; in this case, we produce a function (namely the operational cost) applied to those output labels. [sent-162, score-0.744]
43 In addition, it is possible that a small change in the choice of predictive model could lead to a large change in the cost required to implement the policy recommended by the model. [sent-174, score-0.388]
44 b) A possible operational cost as a function of model complexity. [sent-182, score-0.548]
45 Assume we have a strong prior belief that the operational cost will not be above a certain fixed amount. [sent-186, score-0.582]
46 Because the hypothesis space is smaller, we may be able to produce a tighter 1996 M ACHINE L EARNING WITH O PERATIONAL C OSTS bound on the complexity of the hypothesis space, thereby obtaining a better prediction guarantee for the simultaneous process than for the sequential process. [sent-190, score-0.425]
47 These results indicate that in some cases, the operational cost can be an important quantity for generalization. [sent-192, score-0.548]
48 The first two are small scale reproducible examples, designed to demonstrate new types of constraints due to operational costs. [sent-199, score-0.506]
49 In the first example, the operational cost subproblem involves scheduling. [sent-200, score-0.548]
50 In the first, second and fourth examples, the operational cost leads to a linear constraint, while in the third example, the cost leads to a quadratic constraint. [sent-203, score-0.672]
51 The operational goal is to minimize the total time of the clinic’s operation, from when the checkin happens at time π1 until the check-out happens at time π5 . [sent-223, score-0.424]
52 Figure 3 shows the operational cost, training loss, and r2 statistic2 for various values of C1 . [sent-242, score-0.454]
53 The resulting 1999 T ULABANDHULA AND RUDIN (mixed-integer) program for Step 1 of the simultaneous process is: n min β∈{β:β∈R13 , β 2 2 ∑ (yi − βT xi )2 ≤C } ∗ 2 i=1 6 6 +C1 ˜ ∑ (βT xi + ci )πi max π∈{0,1}6 i=1 subject to ∑ πi ≤ 3 . [sent-283, score-0.404]
54 Figure 4 shows the operational cost which is the predicted total value of the houses after remodeling, the training loss, and r2 values for a range of C1 . [sent-289, score-0.578]
55 The pessimistic bias shows that even if the developer chose the best response policy to the prices, she might end up with the expected total value of the purchased properties on the order of 6. [sent-294, score-0.376]
56 The operational cost is the total number of employees hired to work that day (times a constant, which is the amount each person is paid). [sent-339, score-0.63]
57 An optimistic bias on cost is chosen, so that the (mixed-integer) program for Step 1 is: n min ∑ (yi − βT xi )2 ∗ β: β 2 ≤C2 i=1 2 3 +C1 min ∑ πi subject to aT π ≥ (βT xi )2 for i = 1, . [sent-357, score-0.556]
58 Figure 6 shows the operational cost, the training loss, and r2 values for a range of C1 . [sent-377, score-0.454]
59 The optimistic bias shows that the management might incur operational costs on the order of 9% less if they are lucky. [sent-383, score-0.765]
60 Further, the simultaneous process produces a reasonable model where costs are about 9% less. [sent-384, score-0.352]
61 Let us now investigate the structure of the operational cost regularization term we have in (9). [sent-386, score-0.58]
62 We might wish to find a policy that is not only supported by the historical power grid data (that ranks more vulnerable manholes above less vulnerable ones), but also would give a better route for the repair crew. [sent-415, score-0.446]
63 The simultaneous process could be used to solve this problem, where the operational cost is the price to route the repair crew along a graph, and the probabilities of failure at each node in the graph must be estimated. [sent-417, score-0.876]
64 In general, robust optimization can be overly pessimistic, requiring us to allocate enough to handle all reasonable situations; it can be substantially more pessimistic than the pessimistic simultaneous process. [sent-430, score-0.578]
65 Using techniques of RO, we would minimize the largest possible operational cost that could arise from parameter settings in these ranges. [sent-434, score-0.548]
66 Using even a small optimization problem as the operational cost might have a large impact on the model and decision. [sent-441, score-0.592]
67 An example where the policy optimization might be broken into smaller subproblems is when the policy involves routing several different vehicles, where each vehicle must visit part of the unlabeled set; in that case there is a small subproblem for each vehicle. [sent-443, score-0.593]
68 On the other hand, even though the goals of the simultaneous process and RO are entirely different, there is a strong connection with respect to the formulations for the simultaneous process and RO, and a class of problems for which they are equivalent. [sent-444, score-0.438]
69 Let the optimization problem for the policy decision π be defined by: min π∈Π( f ;{x}i ) ˜ OpCost(π, f ; {xi }), ˜ (Base problem) (10) where Π( f ; {xi }) is the feasible set for the optimization problem. [sent-459, score-0.458]
70 The equivalence relationship of Proposition 1 shows that there is a problem class in which each instance can be viewed either as a RO problem or an estimation problem with an operational cost bias. [sent-505, score-0.574]
71 The least squares loss leads to ellipsoidal constraints on the uncertainty set, but it is unclear what the structure would be for uncertainty sets arising from the 0-1 loss, ramp loss, hinge loss, logistic loss and exponential loss among others. [sent-529, score-0.459]
72 establish an uncertainty set based on the loss function and f ∗ , for example, ellipsoidal constraints arising from the least squares loss (or one could use any of the new uncertainty sets discussed in the previous paragraph), 3. [sent-538, score-0.387]
73 For optimistic optimization, more uncertainty is favorable, and we find the best policy for the best possible situation. [sent-544, score-0.491]
74 ˜ f ∈Fgood (Optimistic optimization) In optimistic optimization, we view operational cost optimistically (min f ∈Fgood OpCost) whereas in the robust optimization counterpart (11), we view operational cost conservatively (max f ∈Fgood OpCost). [sent-547, score-1.393]
75 The policy π∗ is feasible in more situations in RO (minπ∈∩g∈Fgood Π ) since it must be feasible with respect to each g ∈ Fgood , whereas the OpCost is lower in optimistic optimization (minπ∈∪g∈Fgood Π ) since it need only be feasible with respect to at least one of the g’s. [sent-548, score-0.547]
76 Both optimistic optimization and robust optimization, considered with respect to uncertainty sets Fgood , have non-trivial overlap with the simultaneous process. [sent-552, score-0.53]
77 In particular, we showed in Proposition 1 that pessimistic bias on operational cost is equivalent to robust optimization under specific conditions on OpCost and Π. [sent-553, score-0.79]
78 Using an analogous proof, one can show that optimistic bias on operational cost is equivalent to optimistic optimization under the same set of conditions. [sent-554, score-0.968]
79 Both robust and optimistic optimization and the simultaneous process encompass large classes of problems, some of which overlap. [sent-555, score-0.489]
80 There is a class of problems that fall into the simultaneous process, but are not equivalent to robust or optimistic optimization problems. [sent-559, score-0.443]
81 These are problems where we use operational cost to assist with estimation, as in the call center example and ML&TRP; discussed in Section 3. [sent-560, score-0.548]
82 There are also problems contained in either robust optimization or optimistic optimization alone and do not belong to the simultaneous process. [sent-563, score-0.487]
83 Note that the housing problem presented in Section 3 lies within the intersection of optimistic optimization and the simultaneous process; this can be deduced from (7). [sent-565, score-0.385]
84 In this section, we consider the complexity of hypothesis spaces that results from an operational cost bias. [sent-576, score-0.594]
85 , Bartlett and Mendelson, 2002), which combined with our bound, will yield a specific generalization bound for machine learning with operational costs, as we will construct in Theorem 10. [sent-588, score-0.451]
86 In Section 3, we showed that a bias on the operational cost can sometimes be transformed into linear constraints on model parameter β (see Equations (5) and (8)). [sent-589, score-0.67]
87 The bound improves as the constraints given by c jν on the operational cost become tighter. [sent-651, score-0.657]
88 |PK0 | is the size of the polytope without the operational K cost constraints. [sent-658, score-0.548]
89 Our main result in Theorem 6 can be used in conjunction with Theorems 8 and 9, to directly see how the true error relates to the empirical error and the constraints on the restricted function class F (the ℓq -norm bound on β and linear constraint on β from the operational cost bias). [sent-688, score-0.69]
90 ,V ˜ ˜ 2015 δν p ˜ ∑ j=1 |c jν | 2 3 dε + √ 2 log 1 δ , n T ULABANDHULA AND RUDIN This bound implies that prior knowledge about the operational cost can be important for generalization. [sent-694, score-0.575]
91 Specifically, the operational cost term for the ML&TRP; allowed us to reduce the covering number term in the bound from log N(·, ·, ·) to log(αN(·, ·, · 2 )), or equivalently log N(·, ·, · 2 ) + log α, where α is a function of the operational cost constraint. [sent-706, score-1.21]
92 These constraints on the {β j } j ’s are the ones coming from constraints on the operational cost. [sent-818, score-0.588]
93 In other words, we want to know that our (discretized) approximation of y also obeys a constraint coming from the operational cost. [sent-819, score-0.457]
94 This lemma states that as long as the discretization is fine enough, our approximation yK obeys similar operational cost constraints to y. [sent-847, score-0.63]
95 n But we know ˜ hj r = n1/r Xb Bb hj hj r = r n1/r Xb Bb hj hj r r = n1/r Xb Bb . [sent-925, score-0.39]
96 The first two were answered by exploring how optimistic and pessimistic views can influence the probabilistic models and the operational cost range. [sent-995, score-0.883]
97 The third question was comprehensively answered in Section 5 by evaluating how intuition about the operational cost can restrict the probabilistic model space and in turn lead to better sample complexity if the intuition is correct. [sent-998, score-0.615]
98 For instance, domain experts could use the simultaneous process to explore the space of probabilistic models and policies, and then simply pick the policy among these that most agrees with their intuition. [sent-1001, score-0.49]
99 This happens when the training data are scarce, or the dimensionality of the problem is large compared to the sample size, and the operational cost is not smooth. [sent-1004, score-0.578]
100 The simultaneous process can be used in cases where the optimization problem is difficult enough that sampling the posterior of Bayesian models, with computing the policy at each round, is not feasible. [sent-1008, score-0.499]
wordName wordTfidf (topN-words)
[('operational', 0.424), ('opcost', 0.252), ('policy', 0.236), ('fgood', 0.235), ('bb', 0.179), ('simultaneous', 0.173), ('xb', 0.17), ('optimistic', 0.168), ('ro', 0.168), ('ulabandhula', 0.168), ('rudin', 0.167), ('osts', 0.16), ('perational', 0.16), ('unc', 0.158), ('yk', 0.149), ('achine', 0.137), ('cost', 0.124), ('costs', 0.102), ('pessimistic', 0.1), ('tulabandhula', 0.092), ('covering', 0.087), ('uncertainty', 0.087), ('staf', 0.084), ('staff', 0.084), ('stations', 0.084), ('pc', 0.082), ('constraints', 0.082), ('hj', 0.078), ('repair', 0.076), ('xi', 0.073), ('allocate', 0.072), ('station', 0.072), ('scheduling', 0.072), ('kj', 0.072), ('arrival', 0.065), ('decision', 0.062), ('manholes', 0.059), ('sequential', 0.059), ('robust', 0.058), ('earning', 0.055), ('rademacher', 0.052), ('knapsack', 0.05), ('maurey', 0.05), ('nxb', 0.05), ('day', 0.048), ('policies', 0.047), ('yi', 0.047), ('process', 0.046), ('hypothesis', 0.046), ('unlabeled', 0.045), ('cynthia', 0.045), ('ellipsoidal', 0.045), ('optimization', 0.044), ('loss', 0.043), ('gurobi', 0.042), ('maintenance', 0.042), ('react', 0.042), ('theja', 0.042), ('historical', 0.042), ('patients', 0.041), ('bias', 0.04), ('min', 0.039), ('remp', 0.036), ('polyhedron', 0.036), ('rtrue', 0.036), ('probabilistic', 0.035), ('belief', 0.034), ('clinic', 0.034), ('employees', 0.034), ('manhole', 0.034), ('manpower', 0.034), ('repairman', 0.034), ('route', 0.033), ('constraint', 0.033), ('feasible', 0.033), ('theories', 0.032), ('routing', 0.032), ('counting', 0.032), ('ng', 0.032), ('answered', 0.032), ('regularization', 0.032), ('reasonable', 0.031), ('management', 0.031), ('training', 0.03), ('purchase', 0.03), ('ramp', 0.029), ('traveling', 0.029), ('nyc', 0.029), ('produce', 0.028), ('predictive', 0.028), ('help', 0.028), ('questions', 0.028), ('counterpart', 0.027), ('wants', 0.027), ('bound', 0.027), ('prices', 0.026), ('schedule', 0.026), ('equivalence', 0.026), ('integer', 0.025), ('electrical', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 68 jmlr-2013-Machine Learning with Operational Costs
Author: Theja Tulabandhula, Cynthia Rudin
Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory
2 0.19458607 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
3 0.10862884 120 jmlr-2013-Variational Algorithms for Marginal MAP
Author: Qiang Liu, Alexander Ihler
Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms significantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models
4 0.085402913 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem
Author: Antoine Salomon, Jean-Yves Audibert, Issam El Alaoui
Abstract: This paper is devoted to regret lower bounds in the classical model of stochastic multi-armed bandit. A well-known result of Lai and Robbins, which has then been extended by Burnetas and Katehakis, has established the presence of a logarithmic bound for all consistent policies. We relax the notion of consistency, and exhibit a generalisation of the bound. We also study the existence of logarithmic bounds in general and in the case of Hannan consistency. Moreover, we prove that it is impossible to design an adaptive policy that would select the best of two algorithms by taking advantage of the properties of the environment. To get these results, we study variants of popular Upper Confidence Bounds (UCB) policies. Keywords: stochastic bandits, regret lower bounds, consistency, selectivity, UCB policies 1. Introduction and Notations Multi-armed bandits are a classical way to illustrate the difficulty of decision making in the case of a dilemma between exploration and exploitation. The denomination of these models comes from an analogy with playing a slot machine with more than one arm. Each arm has a given (and unknown) reward distribution and, for a given number of rounds, the agent has to choose one of them. As the goal is to maximize the sum of rewards, each round decision consists in a trade-off between exploitation (i.e., playing the arm that has been the more lucrative so far) and exploration (i.e., testing another arm, hoping to discover an alternative that beats the current best choice). One possible application is clinical trial: a doctor wants to heal as many patients as possible, the patients arrive sequentially, and the effectiveness of each treatment is initially unknown (Thompson, 1933). Bandit problems have initially been studied by Robbins (1952), and since then they have been applied to many fields such as economics (Lamberton et al., 2004; Bergemann and Valimaki, 2008), games (Gelly and Wang, 2006), and optimisation (Kleinberg, 2005; Coquelin and Munos, 2007; Kleinberg et al., 2008; Bubeck et al., 2009). ∗. Also at Willow, CNRS/ENS/INRIA - UMR 8548. c 2013 Antoine Salomon, Jean-Yves Audibert and Issam El Alaoui. S ALOMON , AUDIBERT AND E L A LAOUI 1.1 Setting In this paper, we consider the following model. A stochastic multi-armed bandit problem is defined by: • a number of rounds n, • a number of arms K ≥ 2, • an environment θ = (ν1 , · · · , νK ), where each νk (k ∈ {1, · · · , K}) is a real-valued measure that represents the distribution reward of arm k. The number of rounds n may or may not be known by the agent, but this will not affect the present study. We assume that rewards are bounded. Thus, for simplicity, each νk is a probability on [0, 1]. Environment θ is initially unknown by the agent but lies in some known set Θ. For the problem to be interesting, the agent should not have great knowledges of its environment, so that Θ should not be too small and/or only contain too trivial distributions such as Dirac measures. To make it simple, we may assume that Θ contains all environments where each reward distribution is a Dirac distribution or a Bernoulli distribution. This will be acknowledged as Θ having the Dirac/Bernoulli property. For technical reason, we may also assume that Θ is of the form Θ1 × . . . × ΘK , meaning that Θk is the set of possible reward distributions of arm k. This will be acknowledged as Θ having the product property. The game is as follows. At each round (or time step) t = 1, · · · , n, the agent has to choose an arm It in the set {1, · · · , K}. This decision is based on past actions and observations, and the agent may also randomize his choice. Once the decision is made, the agent gets and observes a reward that is drawn from νIt independently from the past. Thus a policy (or strategy) can be described by a sequence (σt )t≥1 (or (σt )1≤t≤n if the number of rounds n is known) such that each σt is a mapping from the set {1, . . . , K}t−1 × [0, 1]t−1 of past decisions and rewards into the set of arm {1, . . . , K} (or into the set of probabilities on {1, . . . , K}, in case the agent randomizes his choices). For each arm k and all time step t, let Tk (t) = ∑ts=1 ½Is =k denote the sampling time, that is, the number of times arm k was pulled from round 1 to round t, and Xk,1 , Xk,2 , . . . , Xk,Tk (t) the corresponding sequence of rewards. We denote by Pθ the distribution on the probability space such that for any k ∈ {1, . . . , K}, the random variables Xk,1 , Xk,2 , . . . , Xk,n are i.i.d. realizations of νk , and such that these K sequences of random variables are independent. Let Eθ denote the associated expectation. Let µk = xdνk (x) be the mean reward of arm k. Introduce µ∗ = maxk∈{1,...,K} µk and fix an arm ∗ ∈ argmax ∗ k k∈{1,...,K} µk , that is, k has the best expected reward. The agent aims at minimizing its regret, defined as the difference between the cumulative reward he would have obtained by always drawing the best arm and the cumulative reward he actually received. Its regret is thus n n Rn = ∑ Xk∗ ,t − ∑ XIt ,TIt (t) . t=1 t=1 As most of the publications on this topic, we focus on the expected regret, for which one can check that: K E θ Rn = ∑ ∆k Eθ [Tk (n)], k=1 188 (1) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS where ∆k is the optimality gap of arm k, defined by ∆k = µ∗ − µk . We also define ∆ as the gap between the best arm and the second best arm, that is, ∆ := mink:∆k >0 ∆k . Other notions of regret exist in the literature. One of them is the quantity n max ∑ Xk,t − XIt ,TIt (t) , k t=1 which is mostly used in adversarial settings. Results and ideas we want to convey here are more suited to expected regret, and considering other definitions of regret would only bring some more technical intricacies. 1.2 Consistency and Regret Lower Bounds Former works have shown the existence of lower bounds on the expected regret of a large class of policies: intuitively, to perform well the agent has to explore all arms, and this requires a significant amount of suboptimal choices. In this way, Lai and Robbins (1985) proved a lower bound of order log n in a particular parametric framework, and they also exhibited optimal policies. This work has then been extended by Burnetas and Katehakis (1996). Both papers deal with consistent policies, meaning that they only consider policies such that: ∀a > 0, ∀θ ∈ Θ, Eθ [Rn ] = o(na ). (2) Let us detail the bound of Burnetas and Katehakis, which is valid when Θ has the product property. Given an environment θ = (ν1 , · · · , νK ) and an arm k ∈ {1, . . . , K}, define: Dk (θ) := inf ˜ νk ∈Θk :E[˜ k ]>µ∗ ν ˜ KL(νk , νk ), where KL(ν, µ) denotes the Kullback-Leibler divergence of measures ν and µ. Now fix a consistent policy and an environment θ ∈ Θ. If k is a suboptimal arm (i.e., µk = µ∗ ) such that 0 < Dk (θ) < ∞, then (1 − ε) log n ∀ε > 0, lim P Tk (n) ≥ = 1. n→+∞ Dk (θ) This readily implies that: lim inf n→+∞ Eθ [Tk (n)] 1 ≥ . log n Dk (θ) Thanks to Formula (1), it is then easy to deduce a lower bound of the expected regret. One contribution of this paper is to generalize the study of regret lower bounds, by considering weaker notions of consistency: α-consistency and Hannan consistency. We will define αconsistency (α ∈ [0, 1)) as a variant of Equation (2), where equality Eθ [Rn ] = o(na ) only holds for all a > α. We show that the logarithmic bound of Burnetas and Katehakis still holds, but coefficient 1−α 1 Dk (θ) is turned into Dk (θ) . We also prove that the dependence of this new bound with respect to the term 1 − α is asymptotically optimal when n → +∞ (up to a constant). We will also consider the case of Hannan consistency. Indeed, any policy achieves at most an expected regret of order n: because of the equality ∑K Tk (n) = n and thanks to Equation (1), one k=1 can show that Eθ Rn ≤ n maxk ∆k . More intuitively, this comes from the fact that the average cost of pulling an arm k is a constant ∆k . As a consequence, it is natural to wonder what happens when 189 S ALOMON , AUDIBERT AND E L A LAOUI dealing with policies whose expected regret is only required to be o(n), which is equivalent to Hannan consistency. This condition is less restrictive than any of the previous notions of consistency. In this larger class of policy, we show that the lower bounds on the expected regret are no longer logarithmic, but can be much smaller. Finally, even if no logarithmic lower bound holds on the whole set Θ, we show that there necessarily exist some environments θ for which the expected regret is at least logarithmic. The latter result is actually valid without any assumptions on the considered policies, and only requires a simple property on Θ. 1.3 Selectivity As we exhibit new lower bounds, we want to know if it is possible to provide optimal policies that achieve these lower bounds, as it is the case in the classical class of consistent policies. We answer negatively to this question, and for this we solve the more general problem of selectivity. Given a set of policies, we define selectivity as the ability to perform at least as good as the policy that is best suited to the current environment θ. Still, such an ability can not be implemented. As a by-product it is not possible to design a procedure that would specifically adapt to some kinds of environments, for example by selecting a particular policy. This contribution is linked with selectivity in on-line learning problem with perfect information, commonly addressed by prediction with expert advice (see, e.g., Cesa-Bianchi et al., 1997). In this spirit, a closely related problem is the one of regret against the best strategy from a pool studied by Auer et al. (2003). The latter designed an algorithm in the context of adversarial/nonstochastic bandit whose decisions are based on a given number of recommendations (experts), which are themselves possibly the rewards received by a set of given policies. To a larger extent, model selection has been intensively studied in statistics, and is commonly solved by penalization methods (Mallows, 1973; Akaike, 1973; Schwarz, 1978). 1.4 UCB Policies Some of our results are obtained using particular Upper Confidence Bound algorithms. These algorithms were introduced by Lai and Robbins (1985): they basically consist in computing an index for each arm, and selecting the arm with the greatest index. A simple and efficient way to design such policies is as follows: choose each index as low as possible such that, conditional to past observations, it is an upper bound of the mean reward of the considered arm with high probability (or, say, with high confidence level). This idea can be traced back to Agrawal (1995), and has been popularized by Auer et al. (2002), who notably described a policy called UCB1. In this policy, each index Bk,s,t is defined by an arm k, a time step t, an integer s that indicates the number of times arm k has been pulled before round t, and is given by: ˆ Bk,s,t = Xk,s + 2 logt , s ˆ ˆ where Xk,s is the empirical mean of arm k after s pulls, that is, Xk,s = 1 ∑s Xk,u . s u=1 To summarize, UCB1 policy first pulls each arm once and then, at each round t > K, selects an arm k that maximizes Bk,Tk (t−1),t . Note that, by means of Hoeffding’s inequality, the index Bk,Tk (t−1),t is indeed an upper bound of µk with high probability (i.e., the probability is greater than 1 − 1/t 4 ). ˆ Another way to understand this index is to interpret the empiric mean Xk,Tk (t−1) as an ”exploitation” 190 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS term, and the square root 2 logt/s as an ”exploration” term (because the latter gradually increases when arm k is not selected). Policy UCB1 achieves the logarithmic bound (up to a multiplicative constant), as it was shown that: ∀θ ∈ Θ, ∀n ≥ 3, Eθ [Tk (n)] ≤ 12 K log n log n log n ≤ 12K . and Eθ Rn ≤ 12 ∑ 2 ∆ ∆k k=1 ∆k Audibert et al. (2009) studied some variants of UCB1 policy. Among them, one consists in changing the 2 logt in the exploration term into ρ logt, where ρ > 0. This can be interpreted as a way to tune exploration: the smaller ρ is, the better the policy will perform in simple environments where information is disclosed easily (for example when all reward distributions are Dirac measures). On the contrary, ρ has to be greater to face more challenging environments (typically when reward distributions are Bernoulli laws with close parameters). This policy, that we denote UCB(ρ), was proven by Audibert et al. to achieve the logarithmic bound when ρ > 1, and the optimality was also obtained when ρ > 1 for a variant of UCB(ρ). 2 Bubeck (2010) showed in his PhD dissertation that their ideas actually enable to prove optimality 1 of UCB(ρ) for ρ > 1 . Moreover, the case ρ = 2 corresponds to a confidence level of 1 (because 2 t of Hoeffding’s inequality, as above), and several studies (Lai and Robbins, 1985; Agrawal, 1995; Burnetas and Katehakis, 1996; Audibert et al., 2009; Honda and Takemura, 2010) have shown that this level is critical. We complete these works by a precise study of UCB(ρ) when ρ ≤ 1 . We prove that UCB(ρ) 2 is (1 − 2ρ)-consistent and that it is not α-consistent for any α < 1 − 2ρ (in view of the definition above, this means that the expected regret is roughly of order n1−2ρ ). Thus it does not achieve the logarithmic bound, but it performs well in simple environments, for example, environments where all reward distributions are Dirac measures. Moreover, we exhibit expected regret bounds of general UCB policies, with the 2 logt in the exploration term of UCB1 replaced by an arbitrary function. We give sufficient conditions for such policies to be Hannan consistent and, as mentioned before, find that lower bounds need not be logarithmic any more. 1.5 Outline The paper is organized as follows: in Section 2, we give bounds on the expected regret of general 1 UCB policies and of UCB (ρ) (ρ ≤ 2 ), as preliminary results. In Section 3, we focus on α-consistent policies. Then, in Section 4, we study the problem of selectivity, and we conclude in Section 5 by general results on the existence of logarithmic lower bounds. Throughout the paper ⌈x⌉ denotes the smallest integer not less than x whereas ⌊x⌋ denotes the largest integer not greater than x, ½A stands for the indicator function of event A, Ber(p) is the Bernoulli law with parameter p, and δx is the Dirac measure centred on x. 2. Preliminary Results In this section, we estimate the expected regret of the paper. UCB 191 policies. This will be useful for the rest of S ALOMON , AUDIBERT AND E L A LAOUI 2.1 Bounds on the Expected Regret of General UCB Policies We first study general UCB policies, defined by: • Draw each arm once, • then, at each round t, draw an arm It ∈ argmax Bk,Tk (t−1),t , k∈{1,...,K} ˆ where Bk,s,t is defined by Bk,s,t = Xk,s + creasing. fk (t) s and where functions fk (1 ≤ k ≤ K) are in- This definition is inspired by popular UCB1 algorithm, for which fk (t) = 2 logt for all k. The following lemma estimates the performances of UCB policies in simple environments, for which reward distributions are Dirac measures. Lemma 1 Let 0 ≤ b < a ≤ 1 and n ≥ 1. For θ = (δa , δb ), the random variable T2 (n) is uniformly 1 upper bounded by ∆2 f2 (n) + 1. Consequently, the expected regret of UCB is upper bounded by 1 ∆ f 2 (n) + 1. Proof In environment θ, best arm is arm 1 and ∆ = ∆2 = a − b. Let us first prove the upper bound of the sampling time. The assertion is true for n = 1 and n = 2: the first two rounds consists in 1 drawing each arm once, so that T2 (n) ≤ 1 ≤ ∆2 f2 (n) + 1 for n ∈ {1, 2}. If, by contradiction, the as1 1 sertion is false, then there exists t ≥ 3 such that T2 (t) > ∆2 f2 (t) + 1 and T2 (t − 1) ≤ ∆2 f2 (t − 1) + 1. Since f2 (t) ≥ f2 (t − 1), this leads to T2 (t) > T2 (t − 1), meaning that arm 2 is drawn at round t. Therefore, we have a + f1 (t) T1 (t−1) ≤ b+ f2 (t) T2 (t−1) , hence a − b = ∆ ≤ f2 (t) T2 (t−1) , which implies 1 1 T2 (t − 1) ≤ ∆2 f2 (t) and thus T2 (t) ≤ ∆2 f2 (t) + 1. This contradicts the definition of t, and this ends the proof of the first statement. The second statement is a direct consequence of Formula (1). Remark: throughout the paper, we will often use environments with K = 2 arms to provide bounds on expected regrets. However, we do not lose generality by doing so, because all corresponding proofs can be written almost identically to suit to any K ≥ 2, by simply assuming that the distribution of each arm k ≥ 3 is δ0 . We now give an upper bound of the expected sampling time of any arm such that ∆k > 0. This bound is valid in any environment, and not only those of the form (δa , δb ). Lemma 2 For any θ ∈ Θ and any β ∈ (0, 1), if ∆k > 0 the following upper bound holds: n Eθ [Tk (n)] ≤ u + where u = 4 fk (n) ∆2 k ∑ t=u+1 1+ logt 1 log( β ) . 192 e−2β fk (t) + e−2β fk∗ (t) , L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS An upper bound of the expected regret can be deduced from this lemma thanks to Formula 1. Proof The core of the proof is a peeling argument and the use of Hoeffding’s maximal inequality (see, e.g., Cesa-Bianchi and Lugosi, 2006, section A.1.3 for details). The idea is originally taken from Audibert et al. (2009), and the following is an adaptation of the proof of an upper bound of UCB (ρ) in the case ρ > 1 which can be found in S. Bubeck’s PhD dissertation. 2 First, let us notice that the policy selects an arm k such that ∆k > 0 at time step t ≤ n only if at least one of the three following equations holds: Bk∗ ,Tk∗ (t−1),t ≤ µ∗ , (3) fk (t) , Tk (t − 1) (4) ˆ Xk,t ≥ µk + Tk (t − 1) < 4 fk (n) . ∆2 k (5) Indeed, if none of the equations is true, then: fk (n) ˆ > Xk,t + Tk (t − 1) Bk∗ ,Tk∗ (t−1),t > µ∗ = µk + ∆k ≥ µk + 2 fk (t) = Bk,Tk (t−1),t , Tk (t − 1) which implies that arm k can not be chosen at time step t. We denote respectively by ξ1,t , ξ2,t and ξ3,t the events corresponding to Equations (3), (4) and (5). We have: n ∑ ½I =k Eθ [Tk (n)] = Eθ t n n ∑ ½{I =k}∩ξ = Eθ t t=1 + Eθ 3,t ∑ ½{I =k}\ξ t 3,t . t=1 t=1 n Let us show that the sum ∑t=1 ½{It =k}∩ξ3,t is almost surely lower than u := ⌈4 fk (n)/∆2 ⌉. We assume k m−1 n by contradiction that ∑t=1 ½{It =k}∩ξ3,t > u. Then there exists m < n such that ∑t=1 ½{It =k}∩ξ3,t < m 4 fk (n)/∆2 and ∑t=1 ½{It =k}∩ξ3,t = ⌈4 fk (n)/∆2 ⌉. Therefore, for any s > m, we have: k k m m t=1 t=1 Tk (s − 1) ≥ Tk (m) = ∑ ½{It =k} ≥ ∑ ½{It =k}∩ξ3,t = 4 fk (n) 4 fk (n) ≥ , 2 ∆k ∆2 k so that ½{Is =k}∩ξ3,s = 0. But then m n ∑ ½{I =k}∩ξ t t=1 3,t = ∑ ½{It =k}∩ξ3,t = t=1 4 fk (n) ≤ u, ∆2 k which is the contradiction expected. n n We also have ∑t=1 ½{It =k}\ξ3,t = ∑t=u+1 ½{It =k}\ξ3,t : since Tk (t − 1) ≤ t − 1, event ξ3,t always happens at time step t ∈ {1, . . . , u}. And then, since event {It = k} is included in ξ1,t ∪ ξ2,t ∪ ξ3,t : n Eθ ∑ ½{It =k}\ξ3,t ≤ Eθ t=u+1 n n t=u+1 t=u+1 ∑ ½ξ1,t ∪ξ2,t ≤ ∑ Pθ (ξ1,t ) + Pθ (ξ2,t ). 193 S ALOMON , AUDIBERT AND E L A LAOUI It remains to find upper bounds of Pθ (ξ1,t ) and Pθ (ξ2,t ). To this aim, we apply the peeling argument with a geometric grid over the time interval [1,t]: fk∗ (t) ≤ µ∗ Tk∗ (t − 1) ˆ Pθ (ξ1,t ) = Pθ Bk∗ ,Tk∗ (t−1),t ≤ µ∗ = Pθ Xk∗ ,Tk∗ (t−1) + ˆ ≤ Pθ ∃s ∈ {1, · · · ,t}, Xk∗ ,s + fk∗ (t) ≤ µ∗ s logt log(1/β) ≤ ∑ j=0 ˆ Pθ ∃s : {β j+1t < s ≤ β j t}, Xk∗ ,s + logt log(1/β) ≤ ∑ j=0 s Pθ ∃s : {β j+1t < s ≤ β j t}, logt log(1/β) ≤ ∑ j=0 ∑ j=0 ∑ (Xk ,l − µ∗ ) ≤ − ∗ s fk∗ (t) β j+1t fk∗ (t) l=1 ∑ (µ∗ − Xk ,l ) ≥ t < s ≤ β j t}, logt log(1/β) = fk∗ (t) ≤ µ∗ s j ∗ β j+1t fk∗ (t) l=1 s Pθ max ∑ (µ∗ − Xk∗ ,l ) ≥ s≤β j t l=1 β j+1t fk∗ (t) . As the range of the random variables (Xk∗ ,l )1≤l≤s is [0, 1], Hoeffding’s maximal inequality gives: 2 logt log(1/β) β j+1t fk∗ (t) 2 logt Pθ (ξ1,t ) ≤ + 1 e−2β fk∗ (t) . ≤ ∑ exp − jt β log(1/β) j=0 Similarly, we have: logt + 1 e−2β fk (t) , log(1/β) and the result follows from the combination of previous inequalities. Pθ (ξ2,t ) ≤ 2.2 Bounds on the Expected Regret of UCB(ρ), ρ ≤ We study the performances of UCB (ρ) 1 2 1 policy, with ρ ∈ (0, 2 ]. We recall that ρ logt s . UCB (ρ) is the UCB ˆ policy defined by fk (t) = ρ log(t) for all k, that is, Bk,s,t = Xk,s + Small values of ρ can be interpreted as a low level of experimentation in the balance between exploration and exploitation. 1 Precise regret bound orders of UCB(ρ) when ρ ∈ (0, 2 ] are not documented in the literature. We first give an upper bound of expected regret in simple environments, where it is supposed to perform well. As stated in the following proposition (which is a direct consequence of Lemma 1), the order of the bound is ρ log n . ∆ 194 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS Lemma 3 Let 0 ≤ b < a ≤ 1 and n ≥ 1. For θ = (δa , δb ), the random variable T2 (n) is uniformly ρ upper bounded by ∆2 log(n) + 1. Consequently, the expected regret of UCB(ρ) is upper bounded by ρ ∆ log(n) + 1. One can show that the expected regret of UCB(ρ) is actually equivalent to ρ log n as n goes to ∆ infinity. These good performances are compensated by poor results in more complex environments, as showed in the following theorem. We exhibit an expected regret upper bound which is valid for any θ ∈ Θ, and which is roughly of order n1−2ρ . We also show that this upper bound is asymptot1 ically optimal. Thus, with ρ ∈ (0, 2 ), UCB(ρ) does not perform enough exploration to achieve the logarithmic bound, as opposed to UCB(ρ) with ρ ∈ ( 1 , +∞). 2 1 Theorem 4 For any ρ ∈ (0, 2 ], any θ ∈ Θ and any β ∈ (0, 1), one has Eθ [Rn ] ≤ 4ρ log n ∑ ∆k + ∆k + 2∆k k:∆k >0 log n n1−2ρβ +1 . log(1/β) 1 − 2ρβ Moreover, if Θ has the Dirac/Bernoulli property, then for any ε > 0 there exists θ ∈ Θ such that Eθ [Rn ] lim n→+∞ n1−2ρ−ε = +∞. 1 1 The value ρ = 2 is critical, but we can deduce from the upper bound of this theorem that UCB( 2 ) is consistent in the classical sense of Lai and Robbins (1985) and of Burnetas and Katehakis (1996). log Proof We set u = 4ρ∆2 n . By Lemma 2 we get: k n Eθ [Tk (n)] ≤ u + 2 = u+2 ∑ logt + 1 e−2βρ log(t) log(1/β) ∑ logt 1 + 1 2ρβ log(1/β) t t=u+1 n t=u+1 n 1 ≤ u+2 log n +1 log(1/β) ≤ u+2 log n +1 log(1/β) 1+ ∑ ≤ u+2 log n +1 log(1/β) 1+ ≤ u+2 log n +1 . log(1/β) 1 − 2ρβ ∑ t 2ρβ t=1 n 1 t 2ρβ t=2 n−1 1 1−2ρβ n 1 t 2ρβ dt As usual, the upper bound of the expected regret follows from Formula (1). Now, let us show the lower bound. The result is obtained by considering an environment θ of the √ 1 form Ber( 1 ), δ 1 −∆ , where ∆ lies in (0, 2 ) and is such that 2ρ(1 + ∆)2 < 2ρ + ε. This notation is 2 2 obviously consistent with the definition of ∆ as an optimality gap. We set Tn := ⌈ ρ log n ⌉, and define ∆ the event ξn by: 1 1 ˆ ξn = X1,Tn < − (1 + √ )∆ . 2 ∆ 195 S ALOMON , AUDIBERT AND E L A LAOUI When event ξn occurs, one has for any t ∈ {Tn , . . . , n} ˆ X1,Tn + ρ logt Tn ˆ ≤ X1,Tn + ≤ √ ρ log n 1 1 < − (1 + √ )∆ + ∆ Tn 2 ∆ 1 − ∆, 2 so that arm 1 is chosen no more than Tn times by UCB(ρ) policy. Consequently: Eθ [T2 (n)] ≥ Pθ (ξn )(n − Tn ). We will now find a lower bound of the probability of ξn thanks to Berry-Esseen inequality. We denote by C the corresponding constant, and by Φ the c.d.f. of the standard normal distribution. For convenience, we also define the following quantities: σ := E X1,1 − Using the fact that Φ(−x) = e− √ 2 β(x) 2πx 1 2 2 1 = , M3 := E 2 X1,1 − 1 2 3 1 = . 8 x2 with β(x) − − → 1, we have: −− x→+∞ ˆ √ X1,Tn − 1 √ 1 2 Tn ≤ −2 1 + √ ∆ Tn σ ∆ √ √ CM3 Φ −2(∆ + ∆) Tn − 3 √ σ Tn √ 2 exp −2(∆ + ∆) Tn √ √ CM3 √ √ √ β 2(∆ + ∆) Tn − 3 √ σ Tn 2 2π(∆ + ∆) Tn √ 2 ρ log n exp −2(∆ + ∆) ( ∆ + 1) √ √ CM3 √ √ √ β 2(∆ + ∆) Tn − 3 √ σ Tn 2 2π(∆ + ∆) Tn √ √ −2ρ(1+ ∆)2 exp −2(∆ + ∆)2 √ √ CM3 n √ √ √ β 2(∆ + ∆) Tn − 3 √ . Tn σ Tn 2 2π(∆ + ∆) Pθ (ξn ) = Pθ ≥ ≥ ≥ ≥ Previous calculations and Formula (1) give Eθ [Rn ] = ∆Eθ [T2 (n)] ≥ ∆Pθ (ξn )(n − Tn ), √ 1−2ρ(1+ ∆)2 so that we finally obtain a lower bound of Eθ [Rn ] of order n √log n . Therefore, nEθ [Rn ] is at least 1−2ρ−ε √ 2 √ 2 n2ρ+ε−2ρ(1+ ∆) √ of order . Since 2ρ + ε − 2ρ(1 + ∆) > 0, the numerator goes to infinity, faster than log n √ log n. This concludes the proof. 196 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS 3. Bounds on the Class α-consistent Policies In this section, our aim is to find how the classical results of Lai and Robbins (1985) and of Burnetas and Katehakis (1996) can be generalised if we do not restrict the study to consistent policies. As a by-product, we will adapt their results to the present setting, which is simpler than their parametric frameworks. We recall that a policy is consistent if its expected regret is o(na ) for all a > 0 in all environments θ ∈ Θ. A natural way to relax this definition is the following. Definition 5 A policy is α-consistent if ∀a > α, ∀θ ∈ Θ, Eθ [Rn ] = o(na ). For example, we showed in the previous section that, for any ρ ∈ (0, 1 ], UCB(ρ) is (1−2ρ)-consistent 2 and not α-consistent if α < 1 − 2ρ. Note that the relevant range of α in this definition is [0, 1): the case α = 0 corresponds to the standard definition of consistency (so that throughout the paper the term ”consistent” also means ”0-consistent”), and any value α ≥ 1 is pointless as any policy is then α-consistent. Indeed, the expected regret of any policy is at most of order n. This also lead us to wonder what happens if we only require the expected regret to be o(n): ∀θ ∈ Θ, Eθ [Rn ] = o(n). This requirement corresponds to the definition of Hannan consistency. The class of Hannan consistent policies includes consistent policies and α-consistent policies for any α ∈ [0, 1). Some results about this class will be obtained in Section 5. We focus on regret lower bounds on α-consistent policies. We first show that the main result of Burnetas and Katehakis can be extended in the following way. Theorem 6 Assume that Θ has the product property. Fix an α-consistent policy and θ ∈ Θ. If ∆k > 0 and if 0 < Dk (θ) < ∞, then ∀ε > 0, lim Pθ Tk (n) ≥ (1 − ε) n→+∞ (1 − α) log n = 1. Dk (θ) Consequently lim inf n→+∞ 1−α Eθ [Tk (n)] ≥ . log n Dk (θ) Remind that the lower bound of the expected regret is then deduced from Formula (1), and that coefficient Dk (θ) is defined by: Dk (θ) := inf ˜ νk ∈Θk :E[˜ k ]>µ∗ ν ˜ KL(νk , νk ), where KL(ν, µ) denotes the Kullback-Leibler divergence of measures ν and µ. Note that, as opposed to Burnetas and Katehakis (1996), there is no optimal policy in general (i.e., a policy that would achieve the lower bound in all environment θ). This can be explained intuitively as follows. If by contradiction there existed such a policy, its expected regret would be of order log n and consequently it would be (0-)consistent. Then the lower bounds in the case of 197 S ALOMON , AUDIBERT AND E L A LAOUI 1−α 0-consistency would necessarily hold. This can not happen if α > 0 because Dk (θ) < Dk1 . (θ) Nevertheless, this argument is not rigorous because the fact that the regret would be of order log n is only valid for environments θ such that 0 < Dk (θ) < ∞. The non-existence of optimal policies is implied by a stronger result of the next section (yet, only if α > 0.2). Proof We adapt Proposition 1 in Burnetas and Katehakis (1996) and its proof. Let us denote θ = (ν1 , . . . , νK ). We fix ε > 0, and we want to show that: lim Pθ n→+∞ Set δ > 0 and δ′ > α such that ˜ that E[νk ] > µ∗ and 1−δ′ 1+δ Tk (n) (1 − ε)(1 − α) < log n Dk (θ) = 0. ˜ > (1 − ε)(1 − α). By definition of Dk (θ), there exists νk such ˜ Dk (θ) < KL(νk , νk ) < (1 + δ)Dk (θ). ˜ ˜ ˜ Let us set θ = (ν1 , . . . , νk−1 , νk , νk+1 , . . . , νK ). Environment θ lies in Θ by the product property and δ = KL(ν , ν ) and arm k is its best arm. Define I k ˜k ′ Aδ := n Tk (n) 1 − δ′ < δ log n I ′′ δ , Cn := log LTk (n) ≤ 1 − δ′′ log n , where δ′′ is such that α < δ′′ < δ′ and Lt is defined by log Lt = ∑ts=1 log δ′ δ′ δ′′ δ′ dνk ˜ d νk (Xk,s ) . δ′′ Now, we show that Pθ (An ) = Pθ (An ∩Cn ) + Pθ (An \Cn ) − − → 0. −− n→+∞ On the one hand, one has: ′′ ′′ ′ ′′ ′ δ δ Pθ (Aδ ∩Cn ) ≤ n1−δ Pθ (Aδ ∩Cn ) ˜ n n ′′ ′ (6) ′′ ≤ n1−δ Pθ (Aδ ) = n1−δ Pθ n − Tk (n) > n − ˜ ˜ n 1 − δ′ Iδ log n ′′ ≤ n1−δ Eθ [n − Tk (n)] ˜ (7) ′ n − 1−δ log n Iδ ′′ = n−δ Eθ ∑K Tℓ (n) − Tk (n) ˜ l=1 ′ n − 1−δ Iδ log n n ′′ ≤ ∑ℓ=k n−δ Eθ [Tℓ (n)] ˜ ′ 1 − 1−δ Iδ log n n − − → 0. −− (8) n→+∞ ′ Equation (6) results from a partition of Aδ into events {Tk (n) = a}, 0 ≤ a < n ′′ 1−δ′ Iδ log n . Each event ′′ δ {Tk (n) = a} ∩ Cn equals {Tk (n) = a} ∩ ∏a dνk (Xk,s ) ≤ n1−δ and is measurable with respect s=1 d νk ˜ to Xk,1 , . . . , Xk,a and to Xℓ,1 , . . . , Xℓ,n (ℓ = k). Thus, ½{Tk (n)=a}∩Cn ′′ can be written as a function f of δ the latter r.v. and we have: ′′ δ Pθ {Tk (n) = a} ∩Cn = f (xk,s )1≤s≤a , (xℓ,s )ℓ=k,1≤s≤n ∏ ℓ=k 1≤s≤n ≤ f (xk,s )1≤s≤a , (xℓ,s )ℓ=k,1≤s≤n ∏ ℓ=k 1≤s≤n ′′ ′′ δ = n1−δ Pθ {Tk (n) = a} ∩Cn ˜ 198 . dνℓ (xℓ,s ) ∏ dνk (xk,s ) 1≤s≤a ′′ dνℓ (xℓ,s )n1−δ ∏ 1≤s≤a ˜ d νk (xk,s ) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS Equation (7) is a consequence of Markov’s inequality, and the limit in (8) is a consequence of α-consistency. ′ On the other hand, we set bn := 1−δ log n, so that Iδ ′ ′′ δ Pθ (Aδ \Cn ) ≤ P n ≤ P max log L j > (1 − δ′′ ) log n j≤⌊bn ⌋ 1 1 − δ′′ max log L j > I δ bn j≤⌊bn ⌋ 1 − δ′ . This term tends to zero, as a consequence of the law of large numbers. ′ Now that Pθ (Aδ ) tends to zero, the conclusion results from n 1 − δ′ 1 − δ′ (1 − ε)(1 − α) > ≥ . δ (1 + δ)Dk (θ) Dk (θ) I The previous lower bound is asymptotically optimal with respect to its dependence in α, as claimed in the following proposition. Proposition 7 Assume that Θ has the Dirac/Bernoulli property. There exist θ ∈ Θ and a constant c > 0 such that, for any α ∈ [0, 1), there exists an α-consistent policy such that: lim inf n→+∞ Eθ [Tk (n)] ≤ c, (1 − α) log n for any k satisfying ∆k > 0. Proof In any environment of the form θ = (δa , δb ) with a = b, Lemma 3 implies the following estimate for UCB(ρ): Eθ Tk (n) ρ lim inf ≤ 2, n→+∞ log n ∆ where k = k∗ . Because 1−α ∈ (0, 1 ) and since UCB(ρ) is (1 − 2ρ)-consistent for any ρ ∈ (0, 1 ] (Theorem 4), we 2 2 2 1 obtain the result by choosing the α-consistent policy UCB( 1−α ) and by setting c = 2∆2 . 2 4. Selectivity In this section, we address the problem of selectivity. By selectivity, we mean the ability to adapt to the environment as and when rewards are observed. More precisely, a set of two (or more) policies is given. The one that performs the best depends on environment θ. We wonder if there exists an adaptive procedure that, given any environment θ, would be as good as any policy in the given set. Two major reasons motivate this study. On the one hand this question was answered by Burnetas and Katehakis within the class of consistent policies. They exhibits an asymptotically optimal policy, that is, that achieves the regret 199 S ALOMON , AUDIBERT AND E L A LAOUI lower bounds they have proven. The fact that a policy performs as best as any other one obviously solves the problem of selectivity. On the other hand, this problem has already been studied in the context of adversarial bandit by Auer et al. (2003). Their setting differs from our not only because their bandits are nonstochastic, but also because their adaptive procedure takes only into account a given number of recommendations, whereas in our setting the adaptation is supposed to come from observing rewards of the chosen arms (only one per time step). Nevertheless, one can wonder if an ”exponentially weighted forecasters” procedure like E XP 4 could be transposed to our context. The answer is negative, as stated in the following theorem. To avoid confusion, we make the notations of the regret and of sampling time more precise by adding the considered policy: under policy A , Rn and Tk (n) will be respectively denoted Rn (A ) and Tk (n, A ). ˜ Theorem 8 Let A be a consistent policy and let ρ be a real number in (0, 0.4). If Θ has the ˜ Dirac/Bernoulli property and the product property, there is no policy which can both beat A and UCB (ρ), that is: ∀A , ∃θ ∈ Θ, lim sup n→+∞ Eθ [Rn (A )] > 1. ˜ min(Eθ [Rn (A )], Eθ [Rn (UCB(ρ))]) Thus the existence of optimal policies does not hold when we extend the notion of consistency. Precisely, as UCB(ρ) is (1 − 2ρ)-consistent, we have shown that there is no optimal policy within the class of α-consistent policies, with α > 0.2. Consequently, there do not exist optimal policies in the class of Hannan consistent policies either. Moreover, Theorem 8 shows that methods that would be inspired by related literature in adversarial bandit can not apply to our framework. As we said, this impossibility may come from the fact that we can not observe at each time step the decisions and rewards of more than one algorithm. If we were able to observe a given set of policies from step to step, then it would be easy to beat them all: it would be sufficient to aggregate all the observations and simply pull the arm with the greater empiric mean. The case where we only observe decisions (and not rewards) of a set of policies may be interesting, but is left outside of the scope of this paper. Proof Assume by contradiction that ∃A , ∀θ ∈ Θ, lim sup un,θ ≤ 1, n→+∞ [Rn where un,θ = min(E [R (Eθ)],E(A )](UCB(ρ))]) . ˜ θ n A θ [Rn For any θ, we have Eθ [Rn (A )] = Eθ [Rn (A )] ˜ ˜ Eθ [Rn (A )] ≤ un,θ Eθ [Rn (A )], ˜ Eθ [Rn (A )] (9) ˜ so that the fact that A is a consistent policy implies that A is also consistent. Consequently the lower bound of Theorem 6 also holds for policy A . For the rest of the proof, we focus on environments of the form θ = (δ0 , δ∆ ) with ∆ > 0. In this case, arm 2 is the best arm, so that we have to compute D1 (θ). On the one hand, we have: D1 (θ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ν ]>µ∗ ˜ KL(ν1 , ν1 ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ]>∆ ν 200 ˜ KL(δ0 , ν1 ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ]>∆ ν log 1 . ˜ ν1 (0) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS ˜ ˜ As E[ν1 ] ≤ 1 − ν1 (0), we get: D1 (θ) ≥ inf ˜ ν1 ∈Θ1 :1−˜ 1 (0)≥∆ ν log 1 ˜ ν1 (0) ≥ log 1 . 1−∆ One the other hand, we have for any ε > 0: D1 (θ) ≤ KL(δ0 , Ber(∆ + ε)) = log Consequently D1 (θ) = log 1 1−∆ 1 1−∆−ε , and the lower bound of Theorem 6 reads: lim inf n→+∞ 1 Eθ [T1 (n, A )] . ≥ 1 log n log 1−∆ Just like Equation (9), we have: Eθ [Rn (A )] ≤ un,θ Eθ [Rn (UCB(ρ))]. Moreover, Lemma 3 provides: Eθ [Rn (UCB(ρ))] ≤ 1 + ρ log n . ∆ Now, by gathering the three previous inequalities and Formula (1), we get: 1 log 1 1−∆ ≤ lim inf n→+∞ Eθ [T1 (n, A )] Eθ [Rn (A )] = lim inf n→+∞ log n ∆ log n un,θ Eθ [Rn (UCB(ρ))] un,θ ρ log n 1+ ≤ lim inf n→+∞ ∆ log n ∆ log n ∆ ρun,θ un,θ ρ ρ + lim inf 2 = 2 lim inf un,θ ≤ 2 lim sup un,θ ≤ lim inf n→+∞ ∆ n→+∞ ∆ log n ∆ n→+∞ ∆ n→+∞ ρ . ≤ ∆2 ≤ lim inf n→+∞ This means that ρ has to be lower bounded by ∆2 , 1 log( 1−∆ ) but this is greater than 0.4 if ∆ = 0.75, hence the contradiction. Note that this proof gives a simple alternative to Theorem 4 to show that UCB(ρ) is not consistent (if ρ ≤ 0.4). Indeed if it were consistent, then in environment θ = (δ0 , δ∆ ) the same contradiction between the lower bound of Theorem 6 and the upper bound of Lemma 3 would hold. 5. General Bounds In this section, we study lower bounds on the expected regret with few requirements on Θ and on the class of policies. With a simple property on Θ but without any assumption on the policy, we show that there always exist logarithmic lower bounds for some environments θ. Then, still with a 201 S ALOMON , AUDIBERT AND E L A LAOUI simple property on Θ, we show that there exists a Hannan consistent policy for which the expected regret is sub-logarithmic for some environment θ. Note that the policy that always pulls arm 1 has a 0 expected regret in environments where arm 1 has the best mean reward, and an expected regret of order n in other environments. So, for this policy, expected regret is sub-logarithmic in some environments. Nevertheless, this policy is not Hannan consistent because its expected regret is not always o(n). 5.1 The Necessity of a Logarithmic Regret in Some Environments The necessity of a logarithmic regret in some environments can be explained by a simple sketch proof. Assume that the agent knows the number of rounds n, and that he balances exploration and exploitation in the following way: he first pulls each arm s(n) times, and then selects the arm that has obtained the best empiric mean for the rest of the game. Denote by ps(n) the probability that the best arm does not have the best empiric mean after the exploration phase (i.e., after the first Ks(n) rounds). The expected regret is then of the form c1 (1 − ps(n) )s(n) + c2 ps(n) n. (10) Indeed, if the agent manages to match the best arm then he only suffers the pulls of suboptimal arms during the exploration phase. That represents an expected regret of order s(n). If not, the number of pulls of suboptimal arms is of order n, and so is the expected regret. Now, let us approximate ps(n) . It has the same order as the probability that the best arm gets X ∗ −µ∗ an empiric mean lower than the second best mean reward. Moreover, k ,s(n) s(n) (where σ is σ ∗ ,1 ) has approximately a standard normal distribution by the central limit theorem. the variance of Xk Therefore, we have: ps(n) ≈ Pθ (Xk∗ ,s(n) ≤ µ∗ − ∆) = Pθ ≈ ≈ σ 1 1 √ exp − 2 2π ∆ s(n) Xk∗ ,s(n) − µ∗ σ 2 ∆ s(n) σ s(n) ≤ − ∆ s(n) σ 1 σ ∆2 s(n) √ . exp − 2σ2 2π ∆ s(n) It follows that the expected regret has to be at least logarithmic. Indeed, to ensure that the second term c2 ps(n) n of Equation (10) is sub-logarithmic, s(n) has to be greater than log n. But then first term c1 (1 − ps(n) )s(n) is greater than log n. Actually, the necessity of a logarithmic regret can be written as a consequence of Theorem 6. n Indeed, if we assume by contradiction that lim supn→+∞ Eθ Rn = 0 for all θ (i.e., Eθ Rn = o(log n)), log the considered policy is consistent. Consequently, Theorem 6 implies that lim sup n→+∞ E θ Rn E θ Rn ≥ lim inf > 0. n→+∞ log n log n Yet, this reasoning needs Θ having the product property, and conditions of the form 0 < Dk (θ) < ∞ also have to hold. The following proposition is a rigorous version of our sketch, and it shows that the necessity of a logarithmic lower bound can be based on a simple property on Θ. 202 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS ˜ ˜ ˜ Proposition 9 Assume that there exist two environments θ = (ν1 , . . . , νK ) ∈ Θ, θ = (ν1 , . . . , νK ) ∈ Θ, and an arm k ∈ {1, . . . , K} such that 1. k has the best mean reward in environment θ, ˜ 2. k is not the winning arm in environment θ, ˜ 3. νk = νk and there exists η ∈ (0, 1) such that dνℓ ∏ d νℓ (Xℓ,1 ) ≥ η ˜ ℓ=k Pθ − a.s. ˜ (11) ˆ Then, for any policy, there exists θ ∈ Θ such that lim sup n→+∞ E θ Rn ˆ > 0. log n ˜ Let us explain the logic of the three conditions of the proposition. If νk = νk , and in case νk seems to be the reward distribution of arm k, then arm k has to be pulled often enough for the regret to be small if the environment is θ. Nevertheless, one has to explore other arms to know ˜ whether the environment is actually θ. Moreover, Inequality (11) makes sure that the distinction ˜ is tough to make: it ensures that pulling any arm ℓ = k gives a reward which is between θ and θ likely in both environments. Without such an assumption the problem may be very simple, and providing a logarithmic lower bound is hopeless. Indeed, the distinction between any pair of tricky ˜ environments (θ, θ) may be solved in only one pull of a given arm ℓ = k, that would almost surely give a reward that is possible in only one of the two environments. The third condition can be seen as an alternate version of condition 0 < Dk (θ) < ∞ in Theorem 6, though there is no logical connection with it. Finally, let us remark that one can check that any set Θ that has the Dirac/Bernoulli property satisfies the conditions of Proposition 9. Proof The proof consists in writing a proper version of Expression (10). To this aim we compute a lower bound of Eθ Rn , expressed as a function of Eθ Rn and of an arbitrary function g(n). ˜ ˜ ˜ In the following, ∆k denotes the optimality gap of arm k in environment θ. As event ∑ℓ=k Tℓ (n) ≤ g(n) is measurable with respect to Xℓ,1 , . . . , Xℓ,⌊g(n)⌋ (ℓ = k) and to Xk,1 , . . . , Xk,n , we also introduce the function q such that ½{∑ℓ=k Tℓ (n)≤g(n)} = q (Xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (Xk,s )s=1..n . 203 S ALOMON , AUDIBERT AND E L A LAOUI We have: ˜ ˜ ˜ Eθ Rn ≥ ∆k Eθ [Tk (n)] ≥ ∆k (n − g(n))Pθ (Tk (n) ≥ n − g(n)) ˜ ˜ (12) ˜ = ∆k (n − g(n))Pθ n − ∑ Tℓ (n) ≥ n − g(n) ˜ ℓ=k ˜ = ∆k (n − g(n))Pθ ˜ ˜ = ∆k (n − g(n)) ∑ Tℓ (n) ≤ g(n) ℓ=k q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n ˜ ˜ ∏ d νℓ (xℓ,s )∏d νk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ s=1..n ˜ ≥ ∆k (n − g(n)) q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n η⌊g(n)⌋∏ dνℓ (xℓ,s )∏dνk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ ˜ ≥ ∆k (n − g(n))ηg(n) q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n ∏ dνℓ (xℓ,s )∏dνk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ ˜ = ∆k (n − g(n))ηg(n) Pθ (13) s=1..n s=1..n ∑ Tℓ (n) ≤ g(n) ℓ=k ˜ = ∆k (n − g(n))ηg(n) 1 − Pθ ∑ Tℓ (n) > g(n) ℓ=k ˜ ≥ ∆k (n − g(n))ηg(n) 1 − Eθ ∑ℓ=k Tℓ (n) g(n) (14) ˜ ≥ ∆k (n − g(n))ηg(n) 1 − Eθ ∑ℓ=k ∆ℓ Tℓ (n) ∆g(n) (15) E θ Rn ˜ ≥ ∆k (n − g(n))ηg(n) 1 − , ∆g(n) where the first inequality of (12) is a consequence of Formula (1), the second inequality of (12) and inequality (14) come from Markov’s inequality, Inequality (13) is a consequence of (11), and Inequality (15) results from the fact that ∆ℓ ≥ ∆ for all ℓ. n θ −− Now, let us conclude. If Eθ Rn − − → 0, we set g(n) = 2E∆Rn , so that log n→+∞ g(n) ≤ min n − log n 2 , 2 log η for n large enough. Then, we have: √ − log n ˜ k n − g(n) ηg(n) ≥ ∆k n η 2 log η = ∆k n . ˜ ˜ E θ Rn ≥ ∆ ˜ 2 4 4 In particular, Eθ Rn ˜ −− log n − − → n→+∞ +∞, and the result follows. 204 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS 5.2 Hannan Consistency We will prove that there exists a Hannan consistent policy such that there can not be a logarithmic lower bound for every environment θ of Θ. To this aim, we make use of general UCB policies again (cf. Section 2.1). Let us first give sufficient conditions on the fk for UCB policy to be Hannan consistent. Proposition 10 Assume that fk (n) = o(n) for all k ∈ {1, . . . , K}. Assume also that there exist γ > 1 2 and N ≥ 3 such that fk (n) ≥ γ log log n for all k ∈ {1, . . . , K} and for all n ≥ N. Then UCB is Hannan consistent. Proof Fix an arm k such that ∆k > 0 and choose β ∈ (0, 1) such that 2βγ > 1. By means of Lemma 2, we have for n large enough: n Eθ [Tk (n)] ≤ u + 2 ∑ 1+ t=u+1 logt 1 log( β ) e−2βγ log logt , k where u = 4 f∆(n) . 2 k Consequently, we have: n Eθ [Tk (n)] ≤ u + 2 ∑ t=2 1 1 1 + 1 (logt)2βγ−1 2βγ (logt) log( β ) . (16) n n 1 Sums of the form ∑t=2 (logt)c with c > 0 are equivalent to (log n)c as n goes to infinity. Indeed, on the one hand we have n n n 1 dx 1 ∑ (logt)c ≤ 2 (log x)c ≤ ∑ (logt)c , t=2 t=3 n 1 so that ∑t=2 (logt)c ∼ n dx 2 (log x)c . n 2 On the other hand, we have n x dx = c (log x) (log x)c n dx 2 (log x)c+1 n 1 n ∑t=2 (logt)c ∼ (log n)c n +c 2 2 dx . (log x)c+1 n dx 2 (log x)c n dx 2 (log x)c n (log n)c . As both integrals are divergent we have =o Combining the fact that constant C > 0 such that with Equation (16), we get the existence of a Eθ [Tk (n)] ≤ , so that ∼ Cn 4 fk (n) + . 2 ∆ (log n)2βγ−1 Since fk (n) = o(n) and 2βγ − 1 > 0, the latter inequality shows that Eθ [Tk (n)] = o(n). The result follows. We are now in the position to prove the main result of this section. Theorem 11 If Θ has the Dirac/Bernoulli property, there exist Hannan consistent policies for which the expected regret can not be lower bounded by a logarithmic function in all environments θ. 205 S ALOMON , AUDIBERT AND E L A LAOUI Proof If f1 (n) = f2 (n) = log log n for n ≥ 3, UCB is Hannan consistent by Proposition 10. According to Lemma 1, the expected regret is then of order log log n in environments of the form (δa , δb ), a = b. Hence the conclusion on the non-existence of logarithmic lower bounds. Thus we have obtained a lower bound of order log log n. This order is critical regarding the methods we used. Yet, we do not know if this order is optimal. Acknowledgments This work has been supported by the French National Research Agency (ANR) through the COSINUS program (ANR-08-COSI-004: EXPLO-RA project). References R. Agrawal. Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, 27:1054–1078, 1995. H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, volume 1, pages 267–281. Springer Verlag, 1973. J.-Y. Audibert, R. Munos, and C. Szepesv´ ri. Exploration-exploitation tradeoff using variance estia mates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002. P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. D. Bergemann and J. Valimaki. Bandit problems. In The New Palgrave Dictionary of Economics, 2nd ed. Macmillan Press, 2008. S. Bubeck. Bandits Games and Clustering Foundations. PhD thesis, Universit´ Lille 1, France, e 2010. S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems 21, pages 201–208. 2009. A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, 2006. N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485, 1997. 206 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS P.A. Coquelin and R. Munos. Bandit algorithms for tree search. In Uncertainty in Artificial Intelligence, 2007. S. Gelly and Y. Wang. Exploration exploitation in go: UCT for Monte-Carlo go. In Online Trading between Exploration and Exploitation Workshop, Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006), 2006. J. Honda and A. Takemura. An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the Twenty-Third Annual Conference on Learning Theory (COLT), 2010. R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681–690, 2008. R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems 17, pages 697–704. 2005. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. D. Lamberton, G. Pag` s, and P. Tarr` s. When can the two-armed bandit algorithm be trusted? e e Annals of Applied Probability, 14(3):1424–1454, 2004. C.L. Mallows. Some comments on cp. Technometrics, pages 661–675, 1973. H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. 207
5 0.077027813 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
6 0.076689444 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
7 0.0620181 114 jmlr-2013-The Rate of Convergence of AdaBoost
8 0.058266461 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
9 0.050171204 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
10 0.049620446 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
11 0.048157211 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
12 0.047988661 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
13 0.046993632 8 jmlr-2013-A Theory of Multiclass Boosting
14 0.046060756 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
15 0.044472516 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
16 0.04211653 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
17 0.040517468 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
18 0.038511019 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
19 0.037463967 90 jmlr-2013-Quasi-Newton Method: A New Direction
20 0.037338413 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory
topicId topicWeight
[(0, -0.219), (1, 0.073), (2, 0.008), (3, 0.207), (4, -0.157), (5, 0.085), (6, -0.141), (7, -0.012), (8, 0.017), (9, 0.008), (10, -0.029), (11, 0.054), (12, 0.116), (13, -0.004), (14, 0.072), (15, -0.112), (16, -0.023), (17, 0.146), (18, -0.25), (19, 0.134), (20, -0.031), (21, -0.19), (22, -0.034), (23, 0.121), (24, 0.067), (25, 0.1), (26, -0.195), (27, -0.107), (28, -0.104), (29, -0.085), (30, 0.001), (31, 0.01), (32, 0.046), (33, 0.036), (34, -0.036), (35, -0.059), (36, 0.099), (37, 0.051), (38, -0.076), (39, 0.032), (40, -0.06), (41, 0.052), (42, 0.024), (43, 0.034), (44, 0.011), (45, 0.089), (46, 0.019), (47, -0.106), (48, 0.059), (49, -0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.9148078 68 jmlr-2013-Machine Learning with Operational Costs
Author: Theja Tulabandhula, Cynthia Rudin
Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory
2 0.7365604 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
3 0.4903588 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
4 0.42579946 120 jmlr-2013-Variational Algorithms for Marginal MAP
Author: Qiang Liu, Alexander Ihler
Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms significantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models
5 0.36696801 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem
Author: Antoine Salomon, Jean-Yves Audibert, Issam El Alaoui
Abstract: This paper is devoted to regret lower bounds in the classical model of stochastic multi-armed bandit. A well-known result of Lai and Robbins, which has then been extended by Burnetas and Katehakis, has established the presence of a logarithmic bound for all consistent policies. We relax the notion of consistency, and exhibit a generalisation of the bound. We also study the existence of logarithmic bounds in general and in the case of Hannan consistency. Moreover, we prove that it is impossible to design an adaptive policy that would select the best of two algorithms by taking advantage of the properties of the environment. To get these results, we study variants of popular Upper Confidence Bounds (UCB) policies. Keywords: stochastic bandits, regret lower bounds, consistency, selectivity, UCB policies 1. Introduction and Notations Multi-armed bandits are a classical way to illustrate the difficulty of decision making in the case of a dilemma between exploration and exploitation. The denomination of these models comes from an analogy with playing a slot machine with more than one arm. Each arm has a given (and unknown) reward distribution and, for a given number of rounds, the agent has to choose one of them. As the goal is to maximize the sum of rewards, each round decision consists in a trade-off between exploitation (i.e., playing the arm that has been the more lucrative so far) and exploration (i.e., testing another arm, hoping to discover an alternative that beats the current best choice). One possible application is clinical trial: a doctor wants to heal as many patients as possible, the patients arrive sequentially, and the effectiveness of each treatment is initially unknown (Thompson, 1933). Bandit problems have initially been studied by Robbins (1952), and since then they have been applied to many fields such as economics (Lamberton et al., 2004; Bergemann and Valimaki, 2008), games (Gelly and Wang, 2006), and optimisation (Kleinberg, 2005; Coquelin and Munos, 2007; Kleinberg et al., 2008; Bubeck et al., 2009). ∗. Also at Willow, CNRS/ENS/INRIA - UMR 8548. c 2013 Antoine Salomon, Jean-Yves Audibert and Issam El Alaoui. S ALOMON , AUDIBERT AND E L A LAOUI 1.1 Setting In this paper, we consider the following model. A stochastic multi-armed bandit problem is defined by: • a number of rounds n, • a number of arms K ≥ 2, • an environment θ = (ν1 , · · · , νK ), where each νk (k ∈ {1, · · · , K}) is a real-valued measure that represents the distribution reward of arm k. The number of rounds n may or may not be known by the agent, but this will not affect the present study. We assume that rewards are bounded. Thus, for simplicity, each νk is a probability on [0, 1]. Environment θ is initially unknown by the agent but lies in some known set Θ. For the problem to be interesting, the agent should not have great knowledges of its environment, so that Θ should not be too small and/or only contain too trivial distributions such as Dirac measures. To make it simple, we may assume that Θ contains all environments where each reward distribution is a Dirac distribution or a Bernoulli distribution. This will be acknowledged as Θ having the Dirac/Bernoulli property. For technical reason, we may also assume that Θ is of the form Θ1 × . . . × ΘK , meaning that Θk is the set of possible reward distributions of arm k. This will be acknowledged as Θ having the product property. The game is as follows. At each round (or time step) t = 1, · · · , n, the agent has to choose an arm It in the set {1, · · · , K}. This decision is based on past actions and observations, and the agent may also randomize his choice. Once the decision is made, the agent gets and observes a reward that is drawn from νIt independently from the past. Thus a policy (or strategy) can be described by a sequence (σt )t≥1 (or (σt )1≤t≤n if the number of rounds n is known) such that each σt is a mapping from the set {1, . . . , K}t−1 × [0, 1]t−1 of past decisions and rewards into the set of arm {1, . . . , K} (or into the set of probabilities on {1, . . . , K}, in case the agent randomizes his choices). For each arm k and all time step t, let Tk (t) = ∑ts=1 ½Is =k denote the sampling time, that is, the number of times arm k was pulled from round 1 to round t, and Xk,1 , Xk,2 , . . . , Xk,Tk (t) the corresponding sequence of rewards. We denote by Pθ the distribution on the probability space such that for any k ∈ {1, . . . , K}, the random variables Xk,1 , Xk,2 , . . . , Xk,n are i.i.d. realizations of νk , and such that these K sequences of random variables are independent. Let Eθ denote the associated expectation. Let µk = xdνk (x) be the mean reward of arm k. Introduce µ∗ = maxk∈{1,...,K} µk and fix an arm ∗ ∈ argmax ∗ k k∈{1,...,K} µk , that is, k has the best expected reward. The agent aims at minimizing its regret, defined as the difference between the cumulative reward he would have obtained by always drawing the best arm and the cumulative reward he actually received. Its regret is thus n n Rn = ∑ Xk∗ ,t − ∑ XIt ,TIt (t) . t=1 t=1 As most of the publications on this topic, we focus on the expected regret, for which one can check that: K E θ Rn = ∑ ∆k Eθ [Tk (n)], k=1 188 (1) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS where ∆k is the optimality gap of arm k, defined by ∆k = µ∗ − µk . We also define ∆ as the gap between the best arm and the second best arm, that is, ∆ := mink:∆k >0 ∆k . Other notions of regret exist in the literature. One of them is the quantity n max ∑ Xk,t − XIt ,TIt (t) , k t=1 which is mostly used in adversarial settings. Results and ideas we want to convey here are more suited to expected regret, and considering other definitions of regret would only bring some more technical intricacies. 1.2 Consistency and Regret Lower Bounds Former works have shown the existence of lower bounds on the expected regret of a large class of policies: intuitively, to perform well the agent has to explore all arms, and this requires a significant amount of suboptimal choices. In this way, Lai and Robbins (1985) proved a lower bound of order log n in a particular parametric framework, and they also exhibited optimal policies. This work has then been extended by Burnetas and Katehakis (1996). Both papers deal with consistent policies, meaning that they only consider policies such that: ∀a > 0, ∀θ ∈ Θ, Eθ [Rn ] = o(na ). (2) Let us detail the bound of Burnetas and Katehakis, which is valid when Θ has the product property. Given an environment θ = (ν1 , · · · , νK ) and an arm k ∈ {1, . . . , K}, define: Dk (θ) := inf ˜ νk ∈Θk :E[˜ k ]>µ∗ ν ˜ KL(νk , νk ), where KL(ν, µ) denotes the Kullback-Leibler divergence of measures ν and µ. Now fix a consistent policy and an environment θ ∈ Θ. If k is a suboptimal arm (i.e., µk = µ∗ ) such that 0 < Dk (θ) < ∞, then (1 − ε) log n ∀ε > 0, lim P Tk (n) ≥ = 1. n→+∞ Dk (θ) This readily implies that: lim inf n→+∞ Eθ [Tk (n)] 1 ≥ . log n Dk (θ) Thanks to Formula (1), it is then easy to deduce a lower bound of the expected regret. One contribution of this paper is to generalize the study of regret lower bounds, by considering weaker notions of consistency: α-consistency and Hannan consistency. We will define αconsistency (α ∈ [0, 1)) as a variant of Equation (2), where equality Eθ [Rn ] = o(na ) only holds for all a > α. We show that the logarithmic bound of Burnetas and Katehakis still holds, but coefficient 1−α 1 Dk (θ) is turned into Dk (θ) . We also prove that the dependence of this new bound with respect to the term 1 − α is asymptotically optimal when n → +∞ (up to a constant). We will also consider the case of Hannan consistency. Indeed, any policy achieves at most an expected regret of order n: because of the equality ∑K Tk (n) = n and thanks to Equation (1), one k=1 can show that Eθ Rn ≤ n maxk ∆k . More intuitively, this comes from the fact that the average cost of pulling an arm k is a constant ∆k . As a consequence, it is natural to wonder what happens when 189 S ALOMON , AUDIBERT AND E L A LAOUI dealing with policies whose expected regret is only required to be o(n), which is equivalent to Hannan consistency. This condition is less restrictive than any of the previous notions of consistency. In this larger class of policy, we show that the lower bounds on the expected regret are no longer logarithmic, but can be much smaller. Finally, even if no logarithmic lower bound holds on the whole set Θ, we show that there necessarily exist some environments θ for which the expected regret is at least logarithmic. The latter result is actually valid without any assumptions on the considered policies, and only requires a simple property on Θ. 1.3 Selectivity As we exhibit new lower bounds, we want to know if it is possible to provide optimal policies that achieve these lower bounds, as it is the case in the classical class of consistent policies. We answer negatively to this question, and for this we solve the more general problem of selectivity. Given a set of policies, we define selectivity as the ability to perform at least as good as the policy that is best suited to the current environment θ. Still, such an ability can not be implemented. As a by-product it is not possible to design a procedure that would specifically adapt to some kinds of environments, for example by selecting a particular policy. This contribution is linked with selectivity in on-line learning problem with perfect information, commonly addressed by prediction with expert advice (see, e.g., Cesa-Bianchi et al., 1997). In this spirit, a closely related problem is the one of regret against the best strategy from a pool studied by Auer et al. (2003). The latter designed an algorithm in the context of adversarial/nonstochastic bandit whose decisions are based on a given number of recommendations (experts), which are themselves possibly the rewards received by a set of given policies. To a larger extent, model selection has been intensively studied in statistics, and is commonly solved by penalization methods (Mallows, 1973; Akaike, 1973; Schwarz, 1978). 1.4 UCB Policies Some of our results are obtained using particular Upper Confidence Bound algorithms. These algorithms were introduced by Lai and Robbins (1985): they basically consist in computing an index for each arm, and selecting the arm with the greatest index. A simple and efficient way to design such policies is as follows: choose each index as low as possible such that, conditional to past observations, it is an upper bound of the mean reward of the considered arm with high probability (or, say, with high confidence level). This idea can be traced back to Agrawal (1995), and has been popularized by Auer et al. (2002), who notably described a policy called UCB1. In this policy, each index Bk,s,t is defined by an arm k, a time step t, an integer s that indicates the number of times arm k has been pulled before round t, and is given by: ˆ Bk,s,t = Xk,s + 2 logt , s ˆ ˆ where Xk,s is the empirical mean of arm k after s pulls, that is, Xk,s = 1 ∑s Xk,u . s u=1 To summarize, UCB1 policy first pulls each arm once and then, at each round t > K, selects an arm k that maximizes Bk,Tk (t−1),t . Note that, by means of Hoeffding’s inequality, the index Bk,Tk (t−1),t is indeed an upper bound of µk with high probability (i.e., the probability is greater than 1 − 1/t 4 ). ˆ Another way to understand this index is to interpret the empiric mean Xk,Tk (t−1) as an ”exploitation” 190 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS term, and the square root 2 logt/s as an ”exploration” term (because the latter gradually increases when arm k is not selected). Policy UCB1 achieves the logarithmic bound (up to a multiplicative constant), as it was shown that: ∀θ ∈ Θ, ∀n ≥ 3, Eθ [Tk (n)] ≤ 12 K log n log n log n ≤ 12K . and Eθ Rn ≤ 12 ∑ 2 ∆ ∆k k=1 ∆k Audibert et al. (2009) studied some variants of UCB1 policy. Among them, one consists in changing the 2 logt in the exploration term into ρ logt, where ρ > 0. This can be interpreted as a way to tune exploration: the smaller ρ is, the better the policy will perform in simple environments where information is disclosed easily (for example when all reward distributions are Dirac measures). On the contrary, ρ has to be greater to face more challenging environments (typically when reward distributions are Bernoulli laws with close parameters). This policy, that we denote UCB(ρ), was proven by Audibert et al. to achieve the logarithmic bound when ρ > 1, and the optimality was also obtained when ρ > 1 for a variant of UCB(ρ). 2 Bubeck (2010) showed in his PhD dissertation that their ideas actually enable to prove optimality 1 of UCB(ρ) for ρ > 1 . Moreover, the case ρ = 2 corresponds to a confidence level of 1 (because 2 t of Hoeffding’s inequality, as above), and several studies (Lai and Robbins, 1985; Agrawal, 1995; Burnetas and Katehakis, 1996; Audibert et al., 2009; Honda and Takemura, 2010) have shown that this level is critical. We complete these works by a precise study of UCB(ρ) when ρ ≤ 1 . We prove that UCB(ρ) 2 is (1 − 2ρ)-consistent and that it is not α-consistent for any α < 1 − 2ρ (in view of the definition above, this means that the expected regret is roughly of order n1−2ρ ). Thus it does not achieve the logarithmic bound, but it performs well in simple environments, for example, environments where all reward distributions are Dirac measures. Moreover, we exhibit expected regret bounds of general UCB policies, with the 2 logt in the exploration term of UCB1 replaced by an arbitrary function. We give sufficient conditions for such policies to be Hannan consistent and, as mentioned before, find that lower bounds need not be logarithmic any more. 1.5 Outline The paper is organized as follows: in Section 2, we give bounds on the expected regret of general 1 UCB policies and of UCB (ρ) (ρ ≤ 2 ), as preliminary results. In Section 3, we focus on α-consistent policies. Then, in Section 4, we study the problem of selectivity, and we conclude in Section 5 by general results on the existence of logarithmic lower bounds. Throughout the paper ⌈x⌉ denotes the smallest integer not less than x whereas ⌊x⌋ denotes the largest integer not greater than x, ½A stands for the indicator function of event A, Ber(p) is the Bernoulli law with parameter p, and δx is the Dirac measure centred on x. 2. Preliminary Results In this section, we estimate the expected regret of the paper. UCB 191 policies. This will be useful for the rest of S ALOMON , AUDIBERT AND E L A LAOUI 2.1 Bounds on the Expected Regret of General UCB Policies We first study general UCB policies, defined by: • Draw each arm once, • then, at each round t, draw an arm It ∈ argmax Bk,Tk (t−1),t , k∈{1,...,K} ˆ where Bk,s,t is defined by Bk,s,t = Xk,s + creasing. fk (t) s and where functions fk (1 ≤ k ≤ K) are in- This definition is inspired by popular UCB1 algorithm, for which fk (t) = 2 logt for all k. The following lemma estimates the performances of UCB policies in simple environments, for which reward distributions are Dirac measures. Lemma 1 Let 0 ≤ b < a ≤ 1 and n ≥ 1. For θ = (δa , δb ), the random variable T2 (n) is uniformly 1 upper bounded by ∆2 f2 (n) + 1. Consequently, the expected regret of UCB is upper bounded by 1 ∆ f 2 (n) + 1. Proof In environment θ, best arm is arm 1 and ∆ = ∆2 = a − b. Let us first prove the upper bound of the sampling time. The assertion is true for n = 1 and n = 2: the first two rounds consists in 1 drawing each arm once, so that T2 (n) ≤ 1 ≤ ∆2 f2 (n) + 1 for n ∈ {1, 2}. If, by contradiction, the as1 1 sertion is false, then there exists t ≥ 3 such that T2 (t) > ∆2 f2 (t) + 1 and T2 (t − 1) ≤ ∆2 f2 (t − 1) + 1. Since f2 (t) ≥ f2 (t − 1), this leads to T2 (t) > T2 (t − 1), meaning that arm 2 is drawn at round t. Therefore, we have a + f1 (t) T1 (t−1) ≤ b+ f2 (t) T2 (t−1) , hence a − b = ∆ ≤ f2 (t) T2 (t−1) , which implies 1 1 T2 (t − 1) ≤ ∆2 f2 (t) and thus T2 (t) ≤ ∆2 f2 (t) + 1. This contradicts the definition of t, and this ends the proof of the first statement. The second statement is a direct consequence of Formula (1). Remark: throughout the paper, we will often use environments with K = 2 arms to provide bounds on expected regrets. However, we do not lose generality by doing so, because all corresponding proofs can be written almost identically to suit to any K ≥ 2, by simply assuming that the distribution of each arm k ≥ 3 is δ0 . We now give an upper bound of the expected sampling time of any arm such that ∆k > 0. This bound is valid in any environment, and not only those of the form (δa , δb ). Lemma 2 For any θ ∈ Θ and any β ∈ (0, 1), if ∆k > 0 the following upper bound holds: n Eθ [Tk (n)] ≤ u + where u = 4 fk (n) ∆2 k ∑ t=u+1 1+ logt 1 log( β ) . 192 e−2β fk (t) + e−2β fk∗ (t) , L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS An upper bound of the expected regret can be deduced from this lemma thanks to Formula 1. Proof The core of the proof is a peeling argument and the use of Hoeffding’s maximal inequality (see, e.g., Cesa-Bianchi and Lugosi, 2006, section A.1.3 for details). The idea is originally taken from Audibert et al. (2009), and the following is an adaptation of the proof of an upper bound of UCB (ρ) in the case ρ > 1 which can be found in S. Bubeck’s PhD dissertation. 2 First, let us notice that the policy selects an arm k such that ∆k > 0 at time step t ≤ n only if at least one of the three following equations holds: Bk∗ ,Tk∗ (t−1),t ≤ µ∗ , (3) fk (t) , Tk (t − 1) (4) ˆ Xk,t ≥ µk + Tk (t − 1) < 4 fk (n) . ∆2 k (5) Indeed, if none of the equations is true, then: fk (n) ˆ > Xk,t + Tk (t − 1) Bk∗ ,Tk∗ (t−1),t > µ∗ = µk + ∆k ≥ µk + 2 fk (t) = Bk,Tk (t−1),t , Tk (t − 1) which implies that arm k can not be chosen at time step t. We denote respectively by ξ1,t , ξ2,t and ξ3,t the events corresponding to Equations (3), (4) and (5). We have: n ∑ ½I =k Eθ [Tk (n)] = Eθ t n n ∑ ½{I =k}∩ξ = Eθ t t=1 + Eθ 3,t ∑ ½{I =k}\ξ t 3,t . t=1 t=1 n Let us show that the sum ∑t=1 ½{It =k}∩ξ3,t is almost surely lower than u := ⌈4 fk (n)/∆2 ⌉. We assume k m−1 n by contradiction that ∑t=1 ½{It =k}∩ξ3,t > u. Then there exists m < n such that ∑t=1 ½{It =k}∩ξ3,t < m 4 fk (n)/∆2 and ∑t=1 ½{It =k}∩ξ3,t = ⌈4 fk (n)/∆2 ⌉. Therefore, for any s > m, we have: k k m m t=1 t=1 Tk (s − 1) ≥ Tk (m) = ∑ ½{It =k} ≥ ∑ ½{It =k}∩ξ3,t = 4 fk (n) 4 fk (n) ≥ , 2 ∆k ∆2 k so that ½{Is =k}∩ξ3,s = 0. But then m n ∑ ½{I =k}∩ξ t t=1 3,t = ∑ ½{It =k}∩ξ3,t = t=1 4 fk (n) ≤ u, ∆2 k which is the contradiction expected. n n We also have ∑t=1 ½{It =k}\ξ3,t = ∑t=u+1 ½{It =k}\ξ3,t : since Tk (t − 1) ≤ t − 1, event ξ3,t always happens at time step t ∈ {1, . . . , u}. And then, since event {It = k} is included in ξ1,t ∪ ξ2,t ∪ ξ3,t : n Eθ ∑ ½{It =k}\ξ3,t ≤ Eθ t=u+1 n n t=u+1 t=u+1 ∑ ½ξ1,t ∪ξ2,t ≤ ∑ Pθ (ξ1,t ) + Pθ (ξ2,t ). 193 S ALOMON , AUDIBERT AND E L A LAOUI It remains to find upper bounds of Pθ (ξ1,t ) and Pθ (ξ2,t ). To this aim, we apply the peeling argument with a geometric grid over the time interval [1,t]: fk∗ (t) ≤ µ∗ Tk∗ (t − 1) ˆ Pθ (ξ1,t ) = Pθ Bk∗ ,Tk∗ (t−1),t ≤ µ∗ = Pθ Xk∗ ,Tk∗ (t−1) + ˆ ≤ Pθ ∃s ∈ {1, · · · ,t}, Xk∗ ,s + fk∗ (t) ≤ µ∗ s logt log(1/β) ≤ ∑ j=0 ˆ Pθ ∃s : {β j+1t < s ≤ β j t}, Xk∗ ,s + logt log(1/β) ≤ ∑ j=0 s Pθ ∃s : {β j+1t < s ≤ β j t}, logt log(1/β) ≤ ∑ j=0 ∑ j=0 ∑ (Xk ,l − µ∗ ) ≤ − ∗ s fk∗ (t) β j+1t fk∗ (t) l=1 ∑ (µ∗ − Xk ,l ) ≥ t < s ≤ β j t}, logt log(1/β) = fk∗ (t) ≤ µ∗ s j ∗ β j+1t fk∗ (t) l=1 s Pθ max ∑ (µ∗ − Xk∗ ,l ) ≥ s≤β j t l=1 β j+1t fk∗ (t) . As the range of the random variables (Xk∗ ,l )1≤l≤s is [0, 1], Hoeffding’s maximal inequality gives: 2 logt log(1/β) β j+1t fk∗ (t) 2 logt Pθ (ξ1,t ) ≤ + 1 e−2β fk∗ (t) . ≤ ∑ exp − jt β log(1/β) j=0 Similarly, we have: logt + 1 e−2β fk (t) , log(1/β) and the result follows from the combination of previous inequalities. Pθ (ξ2,t ) ≤ 2.2 Bounds on the Expected Regret of UCB(ρ), ρ ≤ We study the performances of UCB (ρ) 1 2 1 policy, with ρ ∈ (0, 2 ]. We recall that ρ logt s . UCB (ρ) is the UCB ˆ policy defined by fk (t) = ρ log(t) for all k, that is, Bk,s,t = Xk,s + Small values of ρ can be interpreted as a low level of experimentation in the balance between exploration and exploitation. 1 Precise regret bound orders of UCB(ρ) when ρ ∈ (0, 2 ] are not documented in the literature. We first give an upper bound of expected regret in simple environments, where it is supposed to perform well. As stated in the following proposition (which is a direct consequence of Lemma 1), the order of the bound is ρ log n . ∆ 194 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS Lemma 3 Let 0 ≤ b < a ≤ 1 and n ≥ 1. For θ = (δa , δb ), the random variable T2 (n) is uniformly ρ upper bounded by ∆2 log(n) + 1. Consequently, the expected regret of UCB(ρ) is upper bounded by ρ ∆ log(n) + 1. One can show that the expected regret of UCB(ρ) is actually equivalent to ρ log n as n goes to ∆ infinity. These good performances are compensated by poor results in more complex environments, as showed in the following theorem. We exhibit an expected regret upper bound which is valid for any θ ∈ Θ, and which is roughly of order n1−2ρ . We also show that this upper bound is asymptot1 ically optimal. Thus, with ρ ∈ (0, 2 ), UCB(ρ) does not perform enough exploration to achieve the logarithmic bound, as opposed to UCB(ρ) with ρ ∈ ( 1 , +∞). 2 1 Theorem 4 For any ρ ∈ (0, 2 ], any θ ∈ Θ and any β ∈ (0, 1), one has Eθ [Rn ] ≤ 4ρ log n ∑ ∆k + ∆k + 2∆k k:∆k >0 log n n1−2ρβ +1 . log(1/β) 1 − 2ρβ Moreover, if Θ has the Dirac/Bernoulli property, then for any ε > 0 there exists θ ∈ Θ such that Eθ [Rn ] lim n→+∞ n1−2ρ−ε = +∞. 1 1 The value ρ = 2 is critical, but we can deduce from the upper bound of this theorem that UCB( 2 ) is consistent in the classical sense of Lai and Robbins (1985) and of Burnetas and Katehakis (1996). log Proof We set u = 4ρ∆2 n . By Lemma 2 we get: k n Eθ [Tk (n)] ≤ u + 2 = u+2 ∑ logt + 1 e−2βρ log(t) log(1/β) ∑ logt 1 + 1 2ρβ log(1/β) t t=u+1 n t=u+1 n 1 ≤ u+2 log n +1 log(1/β) ≤ u+2 log n +1 log(1/β) 1+ ∑ ≤ u+2 log n +1 log(1/β) 1+ ≤ u+2 log n +1 . log(1/β) 1 − 2ρβ ∑ t 2ρβ t=1 n 1 t 2ρβ t=2 n−1 1 1−2ρβ n 1 t 2ρβ dt As usual, the upper bound of the expected regret follows from Formula (1). Now, let us show the lower bound. The result is obtained by considering an environment θ of the √ 1 form Ber( 1 ), δ 1 −∆ , where ∆ lies in (0, 2 ) and is such that 2ρ(1 + ∆)2 < 2ρ + ε. This notation is 2 2 obviously consistent with the definition of ∆ as an optimality gap. We set Tn := ⌈ ρ log n ⌉, and define ∆ the event ξn by: 1 1 ˆ ξn = X1,Tn < − (1 + √ )∆ . 2 ∆ 195 S ALOMON , AUDIBERT AND E L A LAOUI When event ξn occurs, one has for any t ∈ {Tn , . . . , n} ˆ X1,Tn + ρ logt Tn ˆ ≤ X1,Tn + ≤ √ ρ log n 1 1 < − (1 + √ )∆ + ∆ Tn 2 ∆ 1 − ∆, 2 so that arm 1 is chosen no more than Tn times by UCB(ρ) policy. Consequently: Eθ [T2 (n)] ≥ Pθ (ξn )(n − Tn ). We will now find a lower bound of the probability of ξn thanks to Berry-Esseen inequality. We denote by C the corresponding constant, and by Φ the c.d.f. of the standard normal distribution. For convenience, we also define the following quantities: σ := E X1,1 − Using the fact that Φ(−x) = e− √ 2 β(x) 2πx 1 2 2 1 = , M3 := E 2 X1,1 − 1 2 3 1 = . 8 x2 with β(x) − − → 1, we have: −− x→+∞ ˆ √ X1,Tn − 1 √ 1 2 Tn ≤ −2 1 + √ ∆ Tn σ ∆ √ √ CM3 Φ −2(∆ + ∆) Tn − 3 √ σ Tn √ 2 exp −2(∆ + ∆) Tn √ √ CM3 √ √ √ β 2(∆ + ∆) Tn − 3 √ σ Tn 2 2π(∆ + ∆) Tn √ 2 ρ log n exp −2(∆ + ∆) ( ∆ + 1) √ √ CM3 √ √ √ β 2(∆ + ∆) Tn − 3 √ σ Tn 2 2π(∆ + ∆) Tn √ √ −2ρ(1+ ∆)2 exp −2(∆ + ∆)2 √ √ CM3 n √ √ √ β 2(∆ + ∆) Tn − 3 √ . Tn σ Tn 2 2π(∆ + ∆) Pθ (ξn ) = Pθ ≥ ≥ ≥ ≥ Previous calculations and Formula (1) give Eθ [Rn ] = ∆Eθ [T2 (n)] ≥ ∆Pθ (ξn )(n − Tn ), √ 1−2ρ(1+ ∆)2 so that we finally obtain a lower bound of Eθ [Rn ] of order n √log n . Therefore, nEθ [Rn ] is at least 1−2ρ−ε √ 2 √ 2 n2ρ+ε−2ρ(1+ ∆) √ of order . Since 2ρ + ε − 2ρ(1 + ∆) > 0, the numerator goes to infinity, faster than log n √ log n. This concludes the proof. 196 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS 3. Bounds on the Class α-consistent Policies In this section, our aim is to find how the classical results of Lai and Robbins (1985) and of Burnetas and Katehakis (1996) can be generalised if we do not restrict the study to consistent policies. As a by-product, we will adapt their results to the present setting, which is simpler than their parametric frameworks. We recall that a policy is consistent if its expected regret is o(na ) for all a > 0 in all environments θ ∈ Θ. A natural way to relax this definition is the following. Definition 5 A policy is α-consistent if ∀a > α, ∀θ ∈ Θ, Eθ [Rn ] = o(na ). For example, we showed in the previous section that, for any ρ ∈ (0, 1 ], UCB(ρ) is (1−2ρ)-consistent 2 and not α-consistent if α < 1 − 2ρ. Note that the relevant range of α in this definition is [0, 1): the case α = 0 corresponds to the standard definition of consistency (so that throughout the paper the term ”consistent” also means ”0-consistent”), and any value α ≥ 1 is pointless as any policy is then α-consistent. Indeed, the expected regret of any policy is at most of order n. This also lead us to wonder what happens if we only require the expected regret to be o(n): ∀θ ∈ Θ, Eθ [Rn ] = o(n). This requirement corresponds to the definition of Hannan consistency. The class of Hannan consistent policies includes consistent policies and α-consistent policies for any α ∈ [0, 1). Some results about this class will be obtained in Section 5. We focus on regret lower bounds on α-consistent policies. We first show that the main result of Burnetas and Katehakis can be extended in the following way. Theorem 6 Assume that Θ has the product property. Fix an α-consistent policy and θ ∈ Θ. If ∆k > 0 and if 0 < Dk (θ) < ∞, then ∀ε > 0, lim Pθ Tk (n) ≥ (1 − ε) n→+∞ (1 − α) log n = 1. Dk (θ) Consequently lim inf n→+∞ 1−α Eθ [Tk (n)] ≥ . log n Dk (θ) Remind that the lower bound of the expected regret is then deduced from Formula (1), and that coefficient Dk (θ) is defined by: Dk (θ) := inf ˜ νk ∈Θk :E[˜ k ]>µ∗ ν ˜ KL(νk , νk ), where KL(ν, µ) denotes the Kullback-Leibler divergence of measures ν and µ. Note that, as opposed to Burnetas and Katehakis (1996), there is no optimal policy in general (i.e., a policy that would achieve the lower bound in all environment θ). This can be explained intuitively as follows. If by contradiction there existed such a policy, its expected regret would be of order log n and consequently it would be (0-)consistent. Then the lower bounds in the case of 197 S ALOMON , AUDIBERT AND E L A LAOUI 1−α 0-consistency would necessarily hold. This can not happen if α > 0 because Dk (θ) < Dk1 . (θ) Nevertheless, this argument is not rigorous because the fact that the regret would be of order log n is only valid for environments θ such that 0 < Dk (θ) < ∞. The non-existence of optimal policies is implied by a stronger result of the next section (yet, only if α > 0.2). Proof We adapt Proposition 1 in Burnetas and Katehakis (1996) and its proof. Let us denote θ = (ν1 , . . . , νK ). We fix ε > 0, and we want to show that: lim Pθ n→+∞ Set δ > 0 and δ′ > α such that ˜ that E[νk ] > µ∗ and 1−δ′ 1+δ Tk (n) (1 − ε)(1 − α) < log n Dk (θ) = 0. ˜ > (1 − ε)(1 − α). By definition of Dk (θ), there exists νk such ˜ Dk (θ) < KL(νk , νk ) < (1 + δ)Dk (θ). ˜ ˜ ˜ Let us set θ = (ν1 , . . . , νk−1 , νk , νk+1 , . . . , νK ). Environment θ lies in Θ by the product property and δ = KL(ν , ν ) and arm k is its best arm. Define I k ˜k ′ Aδ := n Tk (n) 1 − δ′ < δ log n I ′′ δ , Cn := log LTk (n) ≤ 1 − δ′′ log n , where δ′′ is such that α < δ′′ < δ′ and Lt is defined by log Lt = ∑ts=1 log δ′ δ′ δ′′ δ′ dνk ˜ d νk (Xk,s ) . δ′′ Now, we show that Pθ (An ) = Pθ (An ∩Cn ) + Pθ (An \Cn ) − − → 0. −− n→+∞ On the one hand, one has: ′′ ′′ ′ ′′ ′ δ δ Pθ (Aδ ∩Cn ) ≤ n1−δ Pθ (Aδ ∩Cn ) ˜ n n ′′ ′ (6) ′′ ≤ n1−δ Pθ (Aδ ) = n1−δ Pθ n − Tk (n) > n − ˜ ˜ n 1 − δ′ Iδ log n ′′ ≤ n1−δ Eθ [n − Tk (n)] ˜ (7) ′ n − 1−δ log n Iδ ′′ = n−δ Eθ ∑K Tℓ (n) − Tk (n) ˜ l=1 ′ n − 1−δ Iδ log n n ′′ ≤ ∑ℓ=k n−δ Eθ [Tℓ (n)] ˜ ′ 1 − 1−δ Iδ log n n − − → 0. −− (8) n→+∞ ′ Equation (6) results from a partition of Aδ into events {Tk (n) = a}, 0 ≤ a < n ′′ 1−δ′ Iδ log n . Each event ′′ δ {Tk (n) = a} ∩ Cn equals {Tk (n) = a} ∩ ∏a dνk (Xk,s ) ≤ n1−δ and is measurable with respect s=1 d νk ˜ to Xk,1 , . . . , Xk,a and to Xℓ,1 , . . . , Xℓ,n (ℓ = k). Thus, ½{Tk (n)=a}∩Cn ′′ can be written as a function f of δ the latter r.v. and we have: ′′ δ Pθ {Tk (n) = a} ∩Cn = f (xk,s )1≤s≤a , (xℓ,s )ℓ=k,1≤s≤n ∏ ℓ=k 1≤s≤n ≤ f (xk,s )1≤s≤a , (xℓ,s )ℓ=k,1≤s≤n ∏ ℓ=k 1≤s≤n ′′ ′′ δ = n1−δ Pθ {Tk (n) = a} ∩Cn ˜ 198 . dνℓ (xℓ,s ) ∏ dνk (xk,s ) 1≤s≤a ′′ dνℓ (xℓ,s )n1−δ ∏ 1≤s≤a ˜ d νk (xk,s ) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS Equation (7) is a consequence of Markov’s inequality, and the limit in (8) is a consequence of α-consistency. ′ On the other hand, we set bn := 1−δ log n, so that Iδ ′ ′′ δ Pθ (Aδ \Cn ) ≤ P n ≤ P max log L j > (1 − δ′′ ) log n j≤⌊bn ⌋ 1 1 − δ′′ max log L j > I δ bn j≤⌊bn ⌋ 1 − δ′ . This term tends to zero, as a consequence of the law of large numbers. ′ Now that Pθ (Aδ ) tends to zero, the conclusion results from n 1 − δ′ 1 − δ′ (1 − ε)(1 − α) > ≥ . δ (1 + δ)Dk (θ) Dk (θ) I The previous lower bound is asymptotically optimal with respect to its dependence in α, as claimed in the following proposition. Proposition 7 Assume that Θ has the Dirac/Bernoulli property. There exist θ ∈ Θ and a constant c > 0 such that, for any α ∈ [0, 1), there exists an α-consistent policy such that: lim inf n→+∞ Eθ [Tk (n)] ≤ c, (1 − α) log n for any k satisfying ∆k > 0. Proof In any environment of the form θ = (δa , δb ) with a = b, Lemma 3 implies the following estimate for UCB(ρ): Eθ Tk (n) ρ lim inf ≤ 2, n→+∞ log n ∆ where k = k∗ . Because 1−α ∈ (0, 1 ) and since UCB(ρ) is (1 − 2ρ)-consistent for any ρ ∈ (0, 1 ] (Theorem 4), we 2 2 2 1 obtain the result by choosing the α-consistent policy UCB( 1−α ) and by setting c = 2∆2 . 2 4. Selectivity In this section, we address the problem of selectivity. By selectivity, we mean the ability to adapt to the environment as and when rewards are observed. More precisely, a set of two (or more) policies is given. The one that performs the best depends on environment θ. We wonder if there exists an adaptive procedure that, given any environment θ, would be as good as any policy in the given set. Two major reasons motivate this study. On the one hand this question was answered by Burnetas and Katehakis within the class of consistent policies. They exhibits an asymptotically optimal policy, that is, that achieves the regret 199 S ALOMON , AUDIBERT AND E L A LAOUI lower bounds they have proven. The fact that a policy performs as best as any other one obviously solves the problem of selectivity. On the other hand, this problem has already been studied in the context of adversarial bandit by Auer et al. (2003). Their setting differs from our not only because their bandits are nonstochastic, but also because their adaptive procedure takes only into account a given number of recommendations, whereas in our setting the adaptation is supposed to come from observing rewards of the chosen arms (only one per time step). Nevertheless, one can wonder if an ”exponentially weighted forecasters” procedure like E XP 4 could be transposed to our context. The answer is negative, as stated in the following theorem. To avoid confusion, we make the notations of the regret and of sampling time more precise by adding the considered policy: under policy A , Rn and Tk (n) will be respectively denoted Rn (A ) and Tk (n, A ). ˜ Theorem 8 Let A be a consistent policy and let ρ be a real number in (0, 0.4). If Θ has the ˜ Dirac/Bernoulli property and the product property, there is no policy which can both beat A and UCB (ρ), that is: ∀A , ∃θ ∈ Θ, lim sup n→+∞ Eθ [Rn (A )] > 1. ˜ min(Eθ [Rn (A )], Eθ [Rn (UCB(ρ))]) Thus the existence of optimal policies does not hold when we extend the notion of consistency. Precisely, as UCB(ρ) is (1 − 2ρ)-consistent, we have shown that there is no optimal policy within the class of α-consistent policies, with α > 0.2. Consequently, there do not exist optimal policies in the class of Hannan consistent policies either. Moreover, Theorem 8 shows that methods that would be inspired by related literature in adversarial bandit can not apply to our framework. As we said, this impossibility may come from the fact that we can not observe at each time step the decisions and rewards of more than one algorithm. If we were able to observe a given set of policies from step to step, then it would be easy to beat them all: it would be sufficient to aggregate all the observations and simply pull the arm with the greater empiric mean. The case where we only observe decisions (and not rewards) of a set of policies may be interesting, but is left outside of the scope of this paper. Proof Assume by contradiction that ∃A , ∀θ ∈ Θ, lim sup un,θ ≤ 1, n→+∞ [Rn where un,θ = min(E [R (Eθ)],E(A )](UCB(ρ))]) . ˜ θ n A θ [Rn For any θ, we have Eθ [Rn (A )] = Eθ [Rn (A )] ˜ ˜ Eθ [Rn (A )] ≤ un,θ Eθ [Rn (A )], ˜ Eθ [Rn (A )] (9) ˜ so that the fact that A is a consistent policy implies that A is also consistent. Consequently the lower bound of Theorem 6 also holds for policy A . For the rest of the proof, we focus on environments of the form θ = (δ0 , δ∆ ) with ∆ > 0. In this case, arm 2 is the best arm, so that we have to compute D1 (θ). On the one hand, we have: D1 (θ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ν ]>µ∗ ˜ KL(ν1 , ν1 ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ]>∆ ν 200 ˜ KL(δ0 , ν1 ) = inf ˜ ν1 ∈Θ1 :E[˜ 1 ]>∆ ν log 1 . ˜ ν1 (0) L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS ˜ ˜ As E[ν1 ] ≤ 1 − ν1 (0), we get: D1 (θ) ≥ inf ˜ ν1 ∈Θ1 :1−˜ 1 (0)≥∆ ν log 1 ˜ ν1 (0) ≥ log 1 . 1−∆ One the other hand, we have for any ε > 0: D1 (θ) ≤ KL(δ0 , Ber(∆ + ε)) = log Consequently D1 (θ) = log 1 1−∆ 1 1−∆−ε , and the lower bound of Theorem 6 reads: lim inf n→+∞ 1 Eθ [T1 (n, A )] . ≥ 1 log n log 1−∆ Just like Equation (9), we have: Eθ [Rn (A )] ≤ un,θ Eθ [Rn (UCB(ρ))]. Moreover, Lemma 3 provides: Eθ [Rn (UCB(ρ))] ≤ 1 + ρ log n . ∆ Now, by gathering the three previous inequalities and Formula (1), we get: 1 log 1 1−∆ ≤ lim inf n→+∞ Eθ [T1 (n, A )] Eθ [Rn (A )] = lim inf n→+∞ log n ∆ log n un,θ Eθ [Rn (UCB(ρ))] un,θ ρ log n 1+ ≤ lim inf n→+∞ ∆ log n ∆ log n ∆ ρun,θ un,θ ρ ρ + lim inf 2 = 2 lim inf un,θ ≤ 2 lim sup un,θ ≤ lim inf n→+∞ ∆ n→+∞ ∆ log n ∆ n→+∞ ∆ n→+∞ ρ . ≤ ∆2 ≤ lim inf n→+∞ This means that ρ has to be lower bounded by ∆2 , 1 log( 1−∆ ) but this is greater than 0.4 if ∆ = 0.75, hence the contradiction. Note that this proof gives a simple alternative to Theorem 4 to show that UCB(ρ) is not consistent (if ρ ≤ 0.4). Indeed if it were consistent, then in environment θ = (δ0 , δ∆ ) the same contradiction between the lower bound of Theorem 6 and the upper bound of Lemma 3 would hold. 5. General Bounds In this section, we study lower bounds on the expected regret with few requirements on Θ and on the class of policies. With a simple property on Θ but without any assumption on the policy, we show that there always exist logarithmic lower bounds for some environments θ. Then, still with a 201 S ALOMON , AUDIBERT AND E L A LAOUI simple property on Θ, we show that there exists a Hannan consistent policy for which the expected regret is sub-logarithmic for some environment θ. Note that the policy that always pulls arm 1 has a 0 expected regret in environments where arm 1 has the best mean reward, and an expected regret of order n in other environments. So, for this policy, expected regret is sub-logarithmic in some environments. Nevertheless, this policy is not Hannan consistent because its expected regret is not always o(n). 5.1 The Necessity of a Logarithmic Regret in Some Environments The necessity of a logarithmic regret in some environments can be explained by a simple sketch proof. Assume that the agent knows the number of rounds n, and that he balances exploration and exploitation in the following way: he first pulls each arm s(n) times, and then selects the arm that has obtained the best empiric mean for the rest of the game. Denote by ps(n) the probability that the best arm does not have the best empiric mean after the exploration phase (i.e., after the first Ks(n) rounds). The expected regret is then of the form c1 (1 − ps(n) )s(n) + c2 ps(n) n. (10) Indeed, if the agent manages to match the best arm then he only suffers the pulls of suboptimal arms during the exploration phase. That represents an expected regret of order s(n). If not, the number of pulls of suboptimal arms is of order n, and so is the expected regret. Now, let us approximate ps(n) . It has the same order as the probability that the best arm gets X ∗ −µ∗ an empiric mean lower than the second best mean reward. Moreover, k ,s(n) s(n) (where σ is σ ∗ ,1 ) has approximately a standard normal distribution by the central limit theorem. the variance of Xk Therefore, we have: ps(n) ≈ Pθ (Xk∗ ,s(n) ≤ µ∗ − ∆) = Pθ ≈ ≈ σ 1 1 √ exp − 2 2π ∆ s(n) Xk∗ ,s(n) − µ∗ σ 2 ∆ s(n) σ s(n) ≤ − ∆ s(n) σ 1 σ ∆2 s(n) √ . exp − 2σ2 2π ∆ s(n) It follows that the expected regret has to be at least logarithmic. Indeed, to ensure that the second term c2 ps(n) n of Equation (10) is sub-logarithmic, s(n) has to be greater than log n. But then first term c1 (1 − ps(n) )s(n) is greater than log n. Actually, the necessity of a logarithmic regret can be written as a consequence of Theorem 6. n Indeed, if we assume by contradiction that lim supn→+∞ Eθ Rn = 0 for all θ (i.e., Eθ Rn = o(log n)), log the considered policy is consistent. Consequently, Theorem 6 implies that lim sup n→+∞ E θ Rn E θ Rn ≥ lim inf > 0. n→+∞ log n log n Yet, this reasoning needs Θ having the product property, and conditions of the form 0 < Dk (θ) < ∞ also have to hold. The following proposition is a rigorous version of our sketch, and it shows that the necessity of a logarithmic lower bound can be based on a simple property on Θ. 202 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS ˜ ˜ ˜ Proposition 9 Assume that there exist two environments θ = (ν1 , . . . , νK ) ∈ Θ, θ = (ν1 , . . . , νK ) ∈ Θ, and an arm k ∈ {1, . . . , K} such that 1. k has the best mean reward in environment θ, ˜ 2. k is not the winning arm in environment θ, ˜ 3. νk = νk and there exists η ∈ (0, 1) such that dνℓ ∏ d νℓ (Xℓ,1 ) ≥ η ˜ ℓ=k Pθ − a.s. ˜ (11) ˆ Then, for any policy, there exists θ ∈ Θ such that lim sup n→+∞ E θ Rn ˆ > 0. log n ˜ Let us explain the logic of the three conditions of the proposition. If νk = νk , and in case νk seems to be the reward distribution of arm k, then arm k has to be pulled often enough for the regret to be small if the environment is θ. Nevertheless, one has to explore other arms to know ˜ whether the environment is actually θ. Moreover, Inequality (11) makes sure that the distinction ˜ is tough to make: it ensures that pulling any arm ℓ = k gives a reward which is between θ and θ likely in both environments. Without such an assumption the problem may be very simple, and providing a logarithmic lower bound is hopeless. Indeed, the distinction between any pair of tricky ˜ environments (θ, θ) may be solved in only one pull of a given arm ℓ = k, that would almost surely give a reward that is possible in only one of the two environments. The third condition can be seen as an alternate version of condition 0 < Dk (θ) < ∞ in Theorem 6, though there is no logical connection with it. Finally, let us remark that one can check that any set Θ that has the Dirac/Bernoulli property satisfies the conditions of Proposition 9. Proof The proof consists in writing a proper version of Expression (10). To this aim we compute a lower bound of Eθ Rn , expressed as a function of Eθ Rn and of an arbitrary function g(n). ˜ ˜ ˜ In the following, ∆k denotes the optimality gap of arm k in environment θ. As event ∑ℓ=k Tℓ (n) ≤ g(n) is measurable with respect to Xℓ,1 , . . . , Xℓ,⌊g(n)⌋ (ℓ = k) and to Xk,1 , . . . , Xk,n , we also introduce the function q such that ½{∑ℓ=k Tℓ (n)≤g(n)} = q (Xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (Xk,s )s=1..n . 203 S ALOMON , AUDIBERT AND E L A LAOUI We have: ˜ ˜ ˜ Eθ Rn ≥ ∆k Eθ [Tk (n)] ≥ ∆k (n − g(n))Pθ (Tk (n) ≥ n − g(n)) ˜ ˜ (12) ˜ = ∆k (n − g(n))Pθ n − ∑ Tℓ (n) ≥ n − g(n) ˜ ℓ=k ˜ = ∆k (n − g(n))Pθ ˜ ˜ = ∆k (n − g(n)) ∑ Tℓ (n) ≤ g(n) ℓ=k q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n ˜ ˜ ∏ d νℓ (xℓ,s )∏d νk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ s=1..n ˜ ≥ ∆k (n − g(n)) q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n η⌊g(n)⌋∏ dνℓ (xℓ,s )∏dνk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ ˜ ≥ ∆k (n − g(n))ηg(n) q (xℓ,s )ℓ=k, s=1..⌊g(n)⌋ , (xk,s )s=1..n ∏ dνℓ (xℓ,s )∏dνk (xk,s ) ℓ=k s = 1..⌊g(n)⌋ ˜ = ∆k (n − g(n))ηg(n) Pθ (13) s=1..n s=1..n ∑ Tℓ (n) ≤ g(n) ℓ=k ˜ = ∆k (n − g(n))ηg(n) 1 − Pθ ∑ Tℓ (n) > g(n) ℓ=k ˜ ≥ ∆k (n − g(n))ηg(n) 1 − Eθ ∑ℓ=k Tℓ (n) g(n) (14) ˜ ≥ ∆k (n − g(n))ηg(n) 1 − Eθ ∑ℓ=k ∆ℓ Tℓ (n) ∆g(n) (15) E θ Rn ˜ ≥ ∆k (n − g(n))ηg(n) 1 − , ∆g(n) where the first inequality of (12) is a consequence of Formula (1), the second inequality of (12) and inequality (14) come from Markov’s inequality, Inequality (13) is a consequence of (11), and Inequality (15) results from the fact that ∆ℓ ≥ ∆ for all ℓ. n θ −− Now, let us conclude. If Eθ Rn − − → 0, we set g(n) = 2E∆Rn , so that log n→+∞ g(n) ≤ min n − log n 2 , 2 log η for n large enough. Then, we have: √ − log n ˜ k n − g(n) ηg(n) ≥ ∆k n η 2 log η = ∆k n . ˜ ˜ E θ Rn ≥ ∆ ˜ 2 4 4 In particular, Eθ Rn ˜ −− log n − − → n→+∞ +∞, and the result follows. 204 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS 5.2 Hannan Consistency We will prove that there exists a Hannan consistent policy such that there can not be a logarithmic lower bound for every environment θ of Θ. To this aim, we make use of general UCB policies again (cf. Section 2.1). Let us first give sufficient conditions on the fk for UCB policy to be Hannan consistent. Proposition 10 Assume that fk (n) = o(n) for all k ∈ {1, . . . , K}. Assume also that there exist γ > 1 2 and N ≥ 3 such that fk (n) ≥ γ log log n for all k ∈ {1, . . . , K} and for all n ≥ N. Then UCB is Hannan consistent. Proof Fix an arm k such that ∆k > 0 and choose β ∈ (0, 1) such that 2βγ > 1. By means of Lemma 2, we have for n large enough: n Eθ [Tk (n)] ≤ u + 2 ∑ 1+ t=u+1 logt 1 log( β ) e−2βγ log logt , k where u = 4 f∆(n) . 2 k Consequently, we have: n Eθ [Tk (n)] ≤ u + 2 ∑ t=2 1 1 1 + 1 (logt)2βγ−1 2βγ (logt) log( β ) . (16) n n 1 Sums of the form ∑t=2 (logt)c with c > 0 are equivalent to (log n)c as n goes to infinity. Indeed, on the one hand we have n n n 1 dx 1 ∑ (logt)c ≤ 2 (log x)c ≤ ∑ (logt)c , t=2 t=3 n 1 so that ∑t=2 (logt)c ∼ n dx 2 (log x)c . n 2 On the other hand, we have n x dx = c (log x) (log x)c n dx 2 (log x)c+1 n 1 n ∑t=2 (logt)c ∼ (log n)c n +c 2 2 dx . (log x)c+1 n dx 2 (log x)c n dx 2 (log x)c n (log n)c . As both integrals are divergent we have =o Combining the fact that constant C > 0 such that with Equation (16), we get the existence of a Eθ [Tk (n)] ≤ , so that ∼ Cn 4 fk (n) + . 2 ∆ (log n)2βγ−1 Since fk (n) = o(n) and 2βγ − 1 > 0, the latter inequality shows that Eθ [Tk (n)] = o(n). The result follows. We are now in the position to prove the main result of this section. Theorem 11 If Θ has the Dirac/Bernoulli property, there exist Hannan consistent policies for which the expected regret can not be lower bounded by a logarithmic function in all environments θ. 205 S ALOMON , AUDIBERT AND E L A LAOUI Proof If f1 (n) = f2 (n) = log log n for n ≥ 3, UCB is Hannan consistent by Proposition 10. According to Lemma 1, the expected regret is then of order log log n in environments of the form (δa , δb ), a = b. Hence the conclusion on the non-existence of logarithmic lower bounds. Thus we have obtained a lower bound of order log log n. This order is critical regarding the methods we used. Yet, we do not know if this order is optimal. Acknowledgments This work has been supported by the French National Research Agency (ANR) through the COSINUS program (ANR-08-COSI-004: EXPLO-RA project). References R. Agrawal. Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, 27:1054–1078, 1995. H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, volume 1, pages 267–281. Springer Verlag, 1973. J.-Y. Audibert, R. Munos, and C. Szepesv´ ri. Exploration-exploitation tradeoff using variance estia mates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002. P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. D. Bergemann and J. Valimaki. Bandit problems. In The New Palgrave Dictionary of Economics, 2nd ed. Macmillan Press, 2008. S. Bubeck. Bandits Games and Clustering Foundations. PhD thesis, Universit´ Lille 1, France, e 2010. S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems 21, pages 201–208. 2009. A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, 2006. N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485, 1997. 206 L OWER B OUNDS , W EAK C ONSISTENCY AND S ELECTIVITY IN BANDIT P ROBLEMS P.A. Coquelin and R. Munos. Bandit algorithms for tree search. In Uncertainty in Artificial Intelligence, 2007. S. Gelly and Y. Wang. Exploration exploitation in go: UCT for Monte-Carlo go. In Online Trading between Exploration and Exploitation Workshop, Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006), 2006. J. Honda and A. Takemura. An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the Twenty-Third Annual Conference on Learning Theory (COLT), 2010. R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681–690, 2008. R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems 17, pages 697–704. 2005. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. D. Lamberton, G. Pag` s, and P. Tarr` s. When can the two-armed bandit algorithm be trusted? e e Annals of Applied Probability, 14(3):1424–1454, 2004. C.L. Mallows. Some comments on cp. Technometrics, pages 661–675, 1973. H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. 207
6 0.29576042 15 jmlr-2013-Bayesian Canonical Correlation Analysis
7 0.29286823 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
8 0.28852063 90 jmlr-2013-Quasi-Newton Method: A New Direction
9 0.28481928 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
10 0.27581427 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
11 0.27536747 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
12 0.26435238 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
13 0.2640833 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
14 0.26331335 114 jmlr-2013-The Rate of Convergence of AdaBoost
15 0.26027608 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
16 0.25609067 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
17 0.25447834 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics
19 0.24528722 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
20 0.23540366 76 jmlr-2013-Nonparametric Sparsity and Regularization
topicId topicWeight
[(0, 0.035), (5, 0.149), (6, 0.04), (10, 0.066), (14, 0.01), (20, 0.016), (23, 0.031), (53, 0.013), (68, 0.027), (70, 0.048), (75, 0.042), (85, 0.034), (87, 0.023), (88, 0.385)]
simIndex simValue paperId paperTitle
1 0.80433345 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion
Author: Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou
Abstract: We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm when an approximation of R´ nyi’s entropy (of order 2) by Parzen windowing is e minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parameter is large. Consistency and explicit convergence rates are provided in terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the generalization error associated with R´ nyi’s entropy and a Parzen windowing function, to overcome e technical difficulties arising from the essential differences between the classical least squares problems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms. Keywords: minimum error entropy, learning theory, R´ nyi’s entropy, empirical risk minimization, e approximation error
same-paper 2 0.7097494 68 jmlr-2013-Machine Learning with Operational Costs
Author: Theja Tulabandhula, Cynthia Rudin
Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory
3 0.43567571 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.42651609 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
5 0.42370602 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
7 0.4222897 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
8 0.42218968 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
9 0.42078692 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
10 0.41913506 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
12 0.41853017 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
13 0.41836137 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
14 0.41749918 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
15 0.41652697 73 jmlr-2013-Multicategory Large-Margin Unified Machines
16 0.41423085 114 jmlr-2013-The Rate of Convergence of AdaBoost
17 0.41421798 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
18 0.41402099 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
19 0.41377798 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
20 0.41253236 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization