jmlr jmlr2012 jmlr2012-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
Reference: text
sentIndex sentText sentNum sentScore
1 We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. [sent-19, score-0.412]
2 In this article we consider an approach of Thompson (1933) which uses posterior distributions for the instantaneous value of each action to determine a probability distribution over the available actions. [sent-38, score-0.327]
3 Thompson considered only Bernoulli bandits, but in general the approach is to sample a value from the posterior distribution of the expected reward of each action, then select the action with the highest sample from the posterior. [sent-39, score-0.458]
4 When these posterior samples are represented as a sum of exploitative value and exploratory value, it becomes clear that LTS results in potentially negative exploratory values. [sent-43, score-0.388]
5 At each time step, t, a regressor, xt ∈ X , is observed. [sent-57, score-0.305]
6 The contextual bandit framework considered assumes that reward can be expressed as rt = fat (xt ) + zt,at where the zt,a are zero mean random variables with unknown distributions and fa : X → R is an unknown continuous function of the regressor specific to action a. [sent-62, score-0.941]
7 The stream of regressors xt is assumed not to be influenced by the actions or the rewards, and for simplicity we assume that these 2070 O PTIMISTIC BAYESIAN S AMPLING are drawn independently from some fixed distribution on X . [sent-63, score-0.425]
8 1 For our actions to be comparable, we assume that ∀a ∈ A , ∀t ∈ T , ∀x ∈ X , fa (x) + zt,a is supported on the same set, S . [sent-64, score-0.297]
9 Definition 1 The optimal expected reward function, f ∗ : X → R, is defined by f ∗ (x) = max fa (x). [sent-68, score-0.356]
10 a∈A A minimal requirement for any sensible bandit algorithm is the average reward convergence criterion of Yang and Zhu (2002), which identifies whether a sequence of actions receives, asymptotically, rewards that achieve this optimal expected reward. [sent-69, score-0.488]
11 Definition 2 The policy, πt (·) t∈T , is a sequence of conditional probability mass functions where πt (a) = P(at = a|It , xt ). [sent-94, score-0.286]
12 At each time step t, the policy maps It and xt to a probability mass function giving the probability of each action being selected. [sent-95, score-0.55]
13 The policy is constructed in advance of the process, using only I0 , and is the function used to map It and xt to action selection probabilities for each of the actions. [sent-96, score-0.531]
14 Instead we rely on Assumptions 1–5 placed on the Bayesian regression framework, given in Section 3, that will be satisfied by standard models for the xt , rt and prior information I0 . [sent-100, score-0.359]
15 Denote the agent’s estimated value of action a at time t when regressor x is presented as fˆt,a (x). [sent-118, score-0.314]
16 Since fˆt,a is intended to be an estimate of fa , it is desirable that the evaluation procedure is consistent, that is, ∀a ∈ A , ∀x ∈ X , fˆt,a (x) − fa (x) converges in some sense to 0 as nt,a → ∞, where nt,a is the number of times action a has been selected up to time t. [sent-119, score-0.603]
17 Once action value estimates are available, the agent must use an action selection scheme to decide which action to play. [sent-123, score-0.642]
18 So that the consistency of estimation is achieved, it is necessary that the action selection ensures that every action is selected infinitely often. [sent-124, score-0.412]
19 We note that formal optimality criteria are available, such as expected cumulative discounted reward (Gittins, 1979) and rate of regret accumulation (Auer et al. [sent-134, score-0.36]
20 In particular, consider the methodology of evaluating both an exploitative value estimate and an ‘exploratory bonus’ at each time step for each action, and then acting greedily based on the sums of exploitative and exploratory values (Meuleau and Bourgine, 1999). [sent-138, score-0.439]
21 An action’s exploitative value estimate corresponds to the expected immediate reward (i. [sent-139, score-0.346]
22 , expected reward for the current timestep) from selecting the action, given information obtained so far, and therefore the posterior expectation of expected immediate reward is the appropriate exploitative action value estimate. [sent-141, score-0.804]
23 Th Definition 3 Let pa (· | It , xt ) denote the posterior distribution of fa (xt ) given It and xt , and let Qt,a be a random variable with distribution pa (· | It , xt ). [sent-142, score-1.22]
24 The exploitative value, fˆt,a (xt ), of action a at time t is defined by Th fˆt,a (xt ) = E(Qt,a | It , xt ). [sent-143, score-0.69]
25 Thompson (1933) suggests selecting action at with probability equal to the probability that at is optimal, given It (there is no regressor in Thompson’s framework). [sent-144, score-0.295]
26 (2010), who implement the scheme by sampling, for each a, Qt,a from the Th posterior distribution pa (· | It , xt ) and selecting an action that maximises Qt,a . [sent-146, score-0.621]
27 This corresponds to Th Th using an exploratory value f˜t,a (xt ) := Qt,a − fˆt,a (xt ) which is sampled from the posterior distribution of the error in the exploitative action value estimate at the current regressor. [sent-147, score-0.532]
28 We name this scheme Local Thompson Sampling (LTS), where ‘local’ makes reference to the fact that action selection probabilities are the probabilities that each action is optimal at the current regressor. [sent-148, score-0.412]
29 Th However the exploratory value f˜t,a (xt ) under LTS has zero conditional expectation given It and xt (by Definition 3) and can take negative values. [sent-150, score-0.348]
30 This exploratory value has positive conditional expectation given It and xt and cannot take negative values. [sent-154, score-0.348]
31 In undirected exploration, the action selection distribution depends only on the values of the exploitative action value estimates. [sent-161, score-0.591]
32 There is also the issue of ‘incomplete learning’; Brezzi and Lai (2000) showed that, for standard bandit problems, Gittins’ index rule samples only one action infinitely often and that this action is sub-optimal with positive probability. [sent-168, score-0.557]
33 In myopic methods, the uncertainty of action value estimates is taken into account, although the impact of action selections on future rewards is not considered directly. [sent-172, score-0.497]
34 The exploratory value at time t for action a, which we denote f˜t,a , takes the simple form f˜t,a = 2 log(t − 1) . [sent-180, score-0.287]
35 2074 O PTIMISTIC BAYESIAN S AMPLING Infinite exploration is guaranteed by the method, since the exploratory value grows in periods in which the associated action is not selected. [sent-183, score-0.347]
36 In the parametric case, a total action value corresponds to the highest posterior mean associated with a posterior distribution that has KL divergence less than a pre-defined term increasing logarithmically with time. [sent-197, score-0.376]
37 They are UCB-type methods in which actions are selected greedily based on the upper bound of a confidence interval for the exploitative value estimate at a fixed significance level. [sent-203, score-0.287]
38 The width of the confidence interval at a particular point in the regressor space is expected to decrease the more times the action is selected. [sent-205, score-0.295]
39 The former work considers only the two-armed non-contextual Bernoulli bandit and proves that Thompson sampling (the Bayesian Learning Automaton, in their terminology) converges to only pulling the optimal action with probability one. [sent-214, score-0.375]
40 At each time t, the LTS algorithm requires a mechanism that can, for each action a ∈ A , be used to sample from the posterior distribution of fa (xt ) given regressor xt and information set It . [sent-219, score-0.874]
41 Recall that the density of this distribution is denoted as pa (·|It , xt ) and a random variable from the Th distribution as Qt,a . [sent-220, score-0.33]
42 Additionally, the OBS algorithm requires a mechanism for evaluating exploitative value fˆt,a (xt ), where exploitative value is taken to be the posterior expectation of fa (xt ) given It and xt . [sent-222, score-0.918]
43 Algorithm 2 Optimistic Bayesian Sampling (OBS) Input: Posterior distributions {pa (·|It , xt ) : a ∈ A } for a = 1 to A do Th Sample Qt,a ∼ pa (·|It , xt ) Th Evaluate fˆt,a (xt ) = E(Qt,a |It , xt ) Th , f (x )) Set Qt,a = max(Qt,a ˆt,a t end for Sample at uniformly from argmaxa∈A Qt,a 3. [sent-223, score-0.902]
44 1 LTS Algorithm Analysis We begin our convergence analysis by showing that the LTS algorithm explores all actions infinitely often, thus allowing a regression procedure to estimate all the functions fa . [sent-229, score-0.297]
45 It is also desirable that the posterior distributions remain fixed in periods of time in which the associated action is not selected, also a reasonable assumption if inference is independent for different actions. [sent-234, score-0.36]
46 ,t − 1}, and all xt ∈ X Th P(Qt,a > M|It , xt ) > ε and Th P(Qt,a < M|It , xt ) > ε. [sent-249, score-0.858]
47 Along with Assumption 1, we also assume that the posterior distributions concentrate on functions of the regressor bounded away from sup S as their associated actions are selected infinitely often. [sent-250, score-0.31]
48 Formally, we assume that: Assumption 2 For each action a ∈ A , there exist a function ga : X → (inf S , sup S ) such that P Th (i) Qt,a − ga (xt ) → 0 as nt,a → ∞, (ii) supx∈X ga (x) < sup S . [sent-251, score-0.457]
49 We do not take ga = fa since this allows us to prove infinite exploration even when our regression framework does not support the true functions (e. [sent-252, score-0.303]
50 By Assumption 2 and the infinite exploration of actions in A inf , we have that for all actions ainf ∈ A inf there exists a function gainf : X → (inf S , sup S ) such that P Th Qt,ainf − gainf (xt ) → 0 as t → ∞. [sent-275, score-0.406]
51 Therefore, for fixed δ > 0, there exists a finite random time, Tδ , that is the earliest time in T such that for all actions ainf ∈ A inf we have Th P |Qt,ainf − gainf (xt )| < δ It , xt ,t > Tδ > 1 − δ. [sent-276, score-0.484]
52 (4) Since all actions in A fin are selected finitely often, there exists some finite random time T f that is the earliest time in T such that no action in A fin is selected after T f . [sent-278, score-0.352]
53 Then the LTS action selection rule implies that δ δ Gt,afin Gt,1 ∩ δ Gt,ainf ∩ afin ∈A fin \1 ⊂ {at = a}, ainf ∈A inf so that δ δ Gt,afin P(at = 1|It , xt , T > t) ≥ P Gt,1 ∩ afin ∈A fin \1 δ Gt,ainf ∩ It , xt ,t > T . [sent-282, score-0.832]
54 , A} is a conditionally independent set of events given It and xt . [sent-289, score-0.286]
55 Therefore, by (3), (5) and (6), we have δ δ Gt,afin P Gt,1 ∩ afin ∈A fin \1 δ Gt,ainf ∩ It , xt ,t > T > εk−1 (1 − δ)A−k+1 (8) ainf ∈A inf where ε = minafin ∈A fin εafin . [sent-290, score-0.34]
56 Combining (7) and (8), it follows that P(at = 1|It , xt ,t > T ) > εk−1 (1 − δ)A−k+1 so that ∞ ∑ P(at = 1|It , xt ) ≥ t∈T = ∑ P(at = 1|It , xt ) ∑ P(at = 1|It , xt ,t > T ) ∑ εk−1 (1 − δ)A−k+1 = ∞ t=T +1 ∞ t=T +1 ∞ > t=T +1 since T is almost surely finite. [sent-291, score-1.144]
57 Since action 1 was chosen arbitrarily from the set A fin , 2079 M AY, KORDA , L EE AND L ESLIE / any action in A fin would cause a contradiction. [sent-293, score-0.412]
58 Assumption 2 only implies that the sum of exploitative and exploratory values tends to a particular function of the regressor and not that the exploratory values tend to zero. [sent-296, score-0.392]
59 Assumption 4 For all actions a ∈ A and regressors x ∈ X , ga (x) = fa (x). [sent-313, score-0.393]
60 Its proof uses the fact that, under the specified assumptions, Lemma 2 implies that, for all actions a ∈ A , P Th Qt,a − fa (xt ) → 0 as t → ∞. [sent-315, score-0.297]
61 Proof Recall that the optimal expected reward function is defined by f ∗ (x) = maxa∈A fa (x). [sent-318, score-0.356]
62 Denote the event Etδ = f ∗ (xt ) − fat (xt ) < 2δ so that Etδ is the event that true expected reward for the action selected at time t is within 2δ of the optimal expected reward at time t. [sent-320, score-0.596]
63 Therefore there exists a finite random time, Tδ , that is the earliest time in T such that ∀a ∈ A Th P |Qt,a − fa (xt )| < δ It , xt ,t > Tδ > 1−δ Th so that, after Tδ , all sampled Qt,a values are within δ of the true values with high probability. [sent-325, score-0.494]
64 2081 (11) M AY, KORDA , L EE AND L ESLIE δ Then {Ft,a : a ∈ A } is a conditionally independent set of events given It and xt , so that δ Ft,a |It , xt ,t > Tδ P = δ ∏ P(Ft,a |It , xt ,t > Tδ ) > (1 − δ)A (12) a∈A a∈A using inequality (11). [sent-327, score-0.858]
65 Note that, for any at∗ ∈ argmaxa∈A fa (xt ), δ Th Ft,a ⊂ {Qt,at∗ > f ∗ (xt ) − δ} (13) a∈A and, for any a′ ∈ {a ∈ A : f ∗ (xt ) − fa (xt ) > 2δ}, δ Th Ft,a ⊂ {Qt,a′ < f ∗ (xt ) − δ}. [sent-328, score-0.378]
66 (14) a∈A Th Since argmaxa∈A fa (xt ) is non-empty and the action selection rule is greedy on the Qt,a , statements (13) and (14) give δ Ft,a ⊂ { f ∗ (xt ) − fat (xt ) < 2δ} = Etδ . [sent-329, score-0.413]
67 a∈A and so δ Ft,a |It , xt P ≤ P(Etδ |It , xt ). [sent-330, score-0.572]
68 (15) a∈A Inequalities (12) and (15) imply that P(Etδ |It , xt ,t > Tδ ) > (1 − δ)A . [sent-331, score-0.286]
69 We have shown that the probability that the action selected at time t has a true expected reward that is within 2δ of that of the action with the highest true expected reward at time t tends to 1 as t → ∞. [sent-336, score-0.784]
70 We now face the difficulty that the strong law of large numbers cannot be used directly to establish a lower bound on limt→∞ 1 ∑ts=1 fas (xs ) since the t expected reward sequence fas (xs ) is a sequence of dependent random variables. [sent-337, score-0.329]
71 We similarly define bs based on the Us as bs = if Us < 1 − ε if Us > 1 − ε, argmina∈Asε fa (xs ) argmina∈A fa (xs ) again, with ties resolved using uniform sampling. [sent-347, score-0.444]
72 Note that by (16) there exists a finite random time Sε = sup t < ∞ : P(Etε |It , xt ) < 1 − ε . [sent-349, score-0.333]
73 ′ a ∈A (20) It follows from (19), (20) and the definition of f ∗ that {s > Sε } ⊂ { f ∗ (xs ) ≥ fas (xs ) ≥ fbs (xs )} and so 1 t 1 t 1 t ∗ f (xs ) ≥ ∑ fas (xs ) ≥ ∑ fbs (xs ). [sent-354, score-0.324]
74 t s=Sε lim t→∞ (24) Since (21), (23) and (24) hold, we have that EX f ∗ (·) ≥ lim 1 t ∑ fa (xs ) t s=Sε s ≥ lim 1 t ∑ fb (xs ) t s=Sε s t→∞ t→∞ > (1 − ε) EX f ∗ (·) − 2ε + εEX [ fbs (xs )|Us > 1 − ε]. [sent-360, score-0.367]
75 t s=Sε s (25) It is the case that 1 t 1 Sε −1 1 t fas (xs ) = lim ∑ fas (xs ) + lim ∑ fas (xs ) ∑ t→∞ t t→∞ t t→∞ t s=1 s=1 s=Sε lim = 0 + lim t→∞ 1 t ∑ fa (xs ) t s=Sε s as t → ∞ since Sε is finite and fas (xs ) ≤ supx∈X fa∗ (x) < ∞. [sent-362, score-0.613]
76 We assume that exploitative values are less than sup S by a constant for all regressor values during periods of time in which their associated actions are not selected. [sent-370, score-0.453]
77 Assumptions 2 and 3 imply that, for any action a ∈ A , P Qt,a − ga (xt ) → 0 as nt,a → ∞ so that OBS samples associated with actions assumed to be selected infinitely often can be treated in the same way as LTS samples are in the proof of Lemma 2. [sent-387, score-0.379]
78 In the proof of Lemma 2, g∗ (xt ) := maxa∈A ga (xt ) + δ is used as a target for samples associated with actions in afin ∈ A \1 to fall below and the sample associated with action 1 to fall above. [sent-392, score-0.379]
79 Assumptions are only made on total action value estimates, that is, the sum of exploitative and exploratory value, and it is not necessary that the exploratory value converges to zero. [sent-398, score-0.509]
80 The total action value can be equal to the exploitative value estimate so it is important that the exploitative estimate converges to the same value as the LTS samples. [sent-401, score-0.564]
81 Th Under Assumptions 1–5, we have that the LTS samples, Qt,a , and the exploitative values, fˆt,a (xt ) are consistent estimators of the true expected rewards, fa (xt ) and that infinite exploration is guaranteed by Lemma 4. [sent-405, score-0.417]
82 t=1 We compare the performance of OBS to that of LTS and various recently proposed action selection methods in simulated Bernoulli bandit and linear regression problem settings in §4. [sent-420, score-0.351]
83 If the agent chooses action a on any timestep then a reward of 1 is received with probability pa and 0 with probability 1 − pa . [sent-432, score-0.51]
84 The agent needs to explore in order to learn the probabilities of success for each action, so that the action yielding the highest expected reward can be identified. [sent-434, score-0.397]
85 • For each action a ∈ A , the prior distribution of fa is Beta(1, 1) (or equivalently U(0, 1)). [sent-442, score-0.395]
86 2 LTS AND OBS I MPLEMENTATION Let rτ,a denote the value of the reward received on the timestep where action a was picked for the ˜ τth time. [sent-445, score-0.398]
87 ,t − 1}, the posterior distribution of fa given It will be the same as the posterior distribution of fa given IT (since no further information about fa is contained in It ). [sent-465, score-0.737]
88 2 Linear Regression In this case, we study a form of the problem in which the expected reward for each action is a linear function of an observable scalar regressor and the reward noise terms are normally distributed. [sent-569, score-0.629]
89 The learning task becomes that of estimating both the intercept and slope coefficients for each of the actions, so that the action yielding the highest expected reward given the regressor can be identified. [sent-570, score-0.462]
90 • (∀a ∈ A )(∀t ∈ T ) fa (xt ) = β1,a + β2,a xt for β1,a , β2,a ∈ R unknown. [sent-578, score-0.475]
91 We define the LTS exploratory value as Th ˆ f˜t,a (xt ) := σt,a T xtT (Xt,a Xt,a )−1 xt Ut,a . [sent-623, score-0.348]
92 ,t − 1}, the posterior distribution of ba and σ2 given It will be the a same as that given IT (since no further information about fa is contained in It ). [sent-642, score-0.303]
93 P Lemma 11 The exploitative value estimate fˆt,a (xt ) → fa (xt ) as nt,a → ∞. [sent-653, score-0.368]
94 Proof Let xi,a denote the value of the regressor presented on the timestep where action a was picked ˜ for the ith time. [sent-654, score-0.32]
95 Using the facts that bt,a → ba as nt,a → ∞, zt,a = rt − fa (xt ) and E[zt,a ] = 0 we have that ˆ σt,a := P → 1 ˆ ˆ (rt,a − Xt,a bt,a )T (rt,a − Xt,a bt,a ) nt,a − 2 1 (rt,a − Xt,a ba )T (rt,a − Xt,a ba ) as nt,a → ∞ nt,a − 2 → a. [sent-666, score-0.349]
96 5 a • ∀a ∈ A and ∀t ∈ T evaluate potential reward rt,a = β1,a + β2,a xt + zt,a • record the regret incurred using various action selection methods. [sent-717, score-0.774]
97 Specifically, the action selection rule used is given by ˆ at = argmax fˆt,a (xt ) + σt,a a∈A T xtT (Xt,a Xt,a )−1 xt t1− λ 100 ,nt,a −2 where tγ,n denotes the quantile function of Student’s t distribution with n degrees of freedom evaluated at γ. [sent-722, score-0.492]
98 Under this procedure, if the action selected by the algorithm does not match the action selected in the data point, the current data point, and subsequent data points, are ignored until a data point which matches user data and action selection occurs. [sent-795, score-0.618]
99 It is assumed that there is an unknown weight vector, wa , for each article a ∈ A such that P(rt = 1|at = a, xt = x) = (1 + exp(−wT x))−1 . [sent-800, score-0.322]
100 It is worth noting that convergence criterion (2) is satisfied even when approximations to the posterior distributions and expectations for the fa (xt ) are used with the LTS and OBS algorithms, so long as the relevant assumptions are satisfied. [sent-845, score-0.322]
wordName wordTfidf (topN-words)
[('obs', 0.509), ('lts', 0.45), ('xt', 0.286), ('action', 0.206), ('fa', 0.189), ('exploitative', 0.179), ('reward', 0.167), ('th', 0.155), ('ucb', 0.148), ('bandit', 0.145), ('xs', 0.13), ('korda', 0.127), ('regret', 0.115), ('eslie', 0.11), ('actions', 0.108), ('ptimistic', 0.104), ('thompson', 0.099), ('regressor', 0.089), ('posterior', 0.085), ('moss', 0.084), ('fas', 0.081), ('fbs', 0.081), ('ay', 0.078), ('rt', 0.073), ('linucb', 0.069), ('ampling', 0.069), ('ga', 0.065), ('exploratory', 0.062), ('cumulative', 0.061), ('bernoulli', 0.055), ('contextual', 0.054), ('ie', 0.053), ('ee', 0.053), ('exploration', 0.049), ('auer', 0.045), ('pa', 0.044), ('rewards', 0.044), ('gittins', 0.041), ('myopic', 0.041), ('bandits', 0.04), ('argmaxa', 0.04), ('policy', 0.039), ('bayesian', 0.038), ('audibert', 0.036), ('article', 0.036), ('recommendation', 0.034), ('yahoo', 0.033), ('bs', 0.033), ('kl', 0.032), ('ex', 0.032), ('nitely', 0.032), ('regressors', 0.031), ('dilemma', 0.031), ('periods', 0.03), ('ctr', 0.03), ('lai', 0.03), ('xtt', 0.03), ('ba', 0.029), ('ainf', 0.029), ('bubeck', 0.029), ('capp', 0.029), ('garivier', 0.029), ('glie', 0.029), ('qth', 0.029), ('ucbv', 0.029), ('sup', 0.028), ('articles', 0.027), ('inf', 0.025), ('lim', 0.025), ('news', 0.025), ('timestep', 0.025), ('click', 0.025), ('sampling', 0.024), ('agent', 0.024), ('horizon', 0.024), ('averaged', 0.024), ('assumptions', 0.024), ('criterion', 0.024), ('li', 0.023), ('filippi', 0.023), ('pavlidis', 0.023), ('today', 0.023), ('graepel', 0.023), ('chapelle', 0.023), ('fb', 0.022), ('policies', 0.022), ('lemma', 0.021), ('assumption', 0.02), ('personalised', 0.02), ('robbins', 0.02), ('agrawal', 0.02), ('front', 0.02), ('module', 0.02), ('time', 0.019), ('fat', 0.018), ('bristol', 0.018), ('gainf', 0.017), ('optimistic', 0.017), ('discounted', 0.017), ('trials', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
2 0.18488498 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
3 0.11154883 84 jmlr-2012-Online Submodular Minimization
Author: Elad Hazan, Satyen Kale
Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and partial feedback settings. Keywords: submodular optimization, online learning, regret minimization
4 0.10934602 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy
Abstract: We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support. Keywords: reinforcement learning, dynamic programming, decision making, linear regression, preference elicitation
5 0.10345326 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan
Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efficient, crowdsourcing
6 0.085766062 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
7 0.078966364 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
8 0.077241696 97 jmlr-2012-Regularization Techniques for Learning with Matrices
9 0.071786501 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
10 0.068521358 34 jmlr-2012-Dynamic Policy Programming
11 0.059589159 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
12 0.055982113 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
13 0.054467443 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
14 0.053343348 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
15 0.051913396 54 jmlr-2012-Large-scale Linear Support Vector Regression
16 0.050202318 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
17 0.047010776 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
18 0.043081094 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
19 0.040879849 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
20 0.038166985 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
topicId topicWeight
[(0, -0.18), (1, -0.28), (2, 0.012), (3, -0.237), (4, 0.021), (5, -0.097), (6, -0.152), (7, -0.045), (8, -0.086), (9, 0.087), (10, -0.0), (11, -0.093), (12, 0.07), (13, -0.028), (14, 0.045), (15, -0.017), (16, -0.008), (17, 0.057), (18, 0.016), (19, -0.02), (20, 0.105), (21, 0.05), (22, 0.108), (23, 0.088), (24, 0.006), (25, 0.041), (26, -0.002), (27, 0.1), (28, 0.113), (29, 0.033), (30, 0.106), (31, 0.056), (32, 0.005), (33, -0.015), (34, -0.148), (35, -0.141), (36, 0.096), (37, 0.068), (38, 0.053), (39, 0.065), (40, -0.064), (41, 0.011), (42, 0.065), (43, 0.006), (44, -0.004), (45, 0.044), (46, -0.043), (47, 0.056), (48, 0.003), (49, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.95550084 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
2 0.55675924 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
3 0.49655524 84 jmlr-2012-Online Submodular Minimization
Author: Elad Hazan, Satyen Kale
Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and partial feedback settings. Keywords: submodular optimization, online learning, regret minimization
4 0.43498302 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
Author: Tobias Lang, Marc Toussaint, Kristian Kersting
Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics
5 0.42479488 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan
Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efficient, crowdsourcing
6 0.41241983 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
7 0.40632477 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
8 0.35525027 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
9 0.34332803 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
10 0.31849608 34 jmlr-2012-Dynamic Policy Programming
11 0.30864435 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
12 0.29653385 54 jmlr-2012-Large-scale Linear Support Vector Regression
13 0.28357163 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
14 0.25460866 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
15 0.24285156 97 jmlr-2012-Regularization Techniques for Learning with Matrices
16 0.22507992 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
17 0.22428089 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
18 0.21476024 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
19 0.20681307 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
20 0.19923717 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
topicId topicWeight
[(7, 0.024), (21, 0.029), (26, 0.017), (29, 0.042), (35, 0.013), (49, 0.023), (51, 0.394), (56, 0.014), (57, 0.03), (69, 0.015), (75, 0.068), (77, 0.013), (79, 0.025), (92, 0.105), (96, 0.086)]
simIndex simValue paperId paperTitle
same-paper 1 0.66589302 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
2 0.63032132 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
3 0.37184209 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
4 0.36990726 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
5 0.36944219 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
6 0.36729717 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
7 0.36328423 80 jmlr-2012-On Ranking and Generalization Bounds
8 0.36268103 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
9 0.36259723 13 jmlr-2012-Active Learning via Perfect Selective Classification
10 0.36245814 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
11 0.36238563 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
12 0.361287 111 jmlr-2012-Structured Sparsity and Generalization
13 0.35945249 82 jmlr-2012-On the Necessity of Irrelevant Variables
14 0.35841757 73 jmlr-2012-Multi-task Regression using Minimal Penalties
15 0.35742861 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
16 0.35736418 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
17 0.35688719 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
18 0.35628897 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
19 0.35559827 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
20 0.35460049 84 jmlr-2012-Online Submodular Minimization