nips nips2010 nips2010-183 knowledge-graph by maker-knowledge-mining

183 nips-2010-Non-Stochastic Bandit Slate Problems


Source: pdf

Author: Satyen Kale, Lev Reyzin, Robert E. Schapire

Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. [sent-9, score-0.411]

2 The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. [sent-10, score-0.849]

3 We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. [sent-11, score-0.58]

4 We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. [sent-12, score-0.972]

5 1 Introduction In traditional bandit models, the learner is presented with a set of K actions. [sent-14, score-0.284]

6 On each of T rounds, an adversary (or the world) first chooses rewards for each action, and afterwards the learner decides which action it wants to take. [sent-15, score-0.24]

7 The learner then receives the reward of its chosen action, but does not see the rewards of the other actions. [sent-16, score-0.195]

8 In the standard bandit setting, the learner’s goal is to compete with the best fixed arm in hindsight. [sent-17, score-0.238]

9 In the more general “experts setting,” each of N experts recommends an arm on each round, and the goal of the learner is to perform as well as the best expert in hindsight. [sent-18, score-0.199]

10 The bandit setting tackles many problems where a learner’s decisions reflect not only how well it performs but also the data it learns from — a good algorithm will balance exploiting actions it already knows to be good and exploring actions for which its estimates are less certain. [sent-19, score-0.413]

11 In this setting, the actions correspond to advertisements, and choosing an action means displaying the corresponding ad. [sent-21, score-0.122]

12 The rewards correspond to the payments from the advertiser to the publisher, and these rewards depend on the probability of users clicking on the ads. [sent-22, score-0.163]

13 Unfortunately, many real-world problems, including the computational advertising problem, do not fit so nicely into the traditional bandit framework. [sent-23, score-0.259]

14 Most of the time, advertisers have the ability to display more than one ad to users, and users can click on more than one of the ads displayed to them. [sent-24, score-0.169]

15 To capture this reality, in this paper we define the slate problem. [sent-25, score-0.566]

16 This setting is similar to the traditional bandit setting, except that here the advertiser selects a slate, or subset, of S actions. [sent-26, score-0.261]

17 In this paper we first consider the unordered slate problem, where the reward to the learning algorithm is the sum of the rewards of the chosen actions in the slate. [sent-27, score-0.99]

18 While this is a realistic assumption in certain settings, we also deal with the case when different positions in a slate have different importance. [sent-35, score-0.566]

19 One may plausibly assume that for every ad and every position that it can be shown in, there is a click-through-rate associated with the (ad, position) pair, which specifies the probability that a user will click on the ad if it is displayed in that position. [sent-39, score-0.22]

20 To abstract this, we turn to the ordered slate problem, where for each action and position in the ordering, the adversary specifies a reward for using the action in that position. [sent-41, score-0.876]

21 The reward to the learner then is the sum of the rewards of the (actions, position) pairs in the chosen ordered slate. [sent-42, score-0.325]

22 1 This setting is similar to that of Gy¨ rgy, Linder, Lugosi and Ottucs´ k [10] in that the cost of all actions in the o a chosen slate are revealed, rather than just the total cost of the slate. [sent-43, score-0.699]

23 Finally, we show how to tackle these problems in the experts setting, where instead of competing with the best slate in hindsight, the algorithm competes with the best expert, recommending different slates on different rounds. [sent-44, score-1.105]

24 One key idea appearing in our algorithms is to use a variant of the multiplicative weights expert algorithm for a restricted convex set of distributions. [sent-45, score-0.117]

25 In our case, the restricted set of distributions over actions corresponds to the one defined by the stipulation that the learner choose a slate instead of individual actions. [sent-46, score-0.747]

26 The multi-armed bandit problem, first studied by Lai and Robbins [15], is a classic problem which has had wide application. [sent-49, score-0.215]

27 , Lai and Robbins [15] and Auer, Cesa-Bianchi and Fischer [2] gave regret bounds of O(K ln(T )). [sent-53, score-0.28]

28 2 This non-stochastic setting of the multi-armed bandit problem is exactly the specific case of our problem when the slate size is 1, and hence our results generalize those of Auer et al. [sent-56, score-0.803]

29 Our problem is a special case of the more general online linear optimization with bandit feedback problem [1, 4, 5, 11]. [sent-58, score-0.238]

30 Specializing the best result in this series to our setting, we get worse regret bounds of O( T log(T )). [sent-59, score-0.28]

31 For a more specific comparison of regret bounds, see Section 2. [sent-61, score-0.24]

32 Our algorithms, being specialized for the slates problem, are simpler to implement as well, avoiding the sophisticated self-concordant barrier techniques of [1]. [sent-62, score-0.469]

33 Our work is also a special case of the Combinatorial Bandits setting of Cesa-Bianchi and Lugosi [9]; however, our algorithms obtain better regret bounds and are computationally more efficient. [sent-64, score-0.302]

34 Furthermore, the expertless, unordered slate problem is studied by Uchiya, Nakamura and Kudo [17] who obtain the same asymptotic bounds as appear in this paper, though using different techniques. [sent-66, score-0.816]

35 1 The unordered slate problem is a special case of the ordered slate problem for which all positional factors are equal. [sent-71, score-1.472]

36 However, the bound on the regret that we get when we consider the unordered slate problem ˜ √ separately is a factor of O( s) better than when we treat it as a special case of the ordered slate problem. [sent-72, score-1.743]

37 2 The difference in the regret bounds can be attributed to the definition of regret in the stochastic and nonstochastic settings. [sent-73, score-0.542]

38 In the stochastic setting, we compare the algorithm’s expected reward to that of the arm with the largest expected reward, with the expectation taken over the reward distribution. [sent-74, score-0.109]

39 , T , we are required to choose a slate from a base set A of K actions. [sent-84, score-0.566]

40 An unordered slate is a subset S ⊆ A of s out of the K actions. [sent-85, score-0.776]

41 An ordered slate is a slate together with an ordering over its s actions; thus, it is a one-to-one mapping π : {1, 2, . [sent-86, score-1.262]

42 Prior to the selection of the slate, the adversary chooses losses3 for the actions in the slates. [sent-90, score-0.165]

43 Once the slate is chosen, the cost of only the actions in the chosen slate is revealed. [sent-91, score-1.243]

44 The adversary chooses a loss vector (t) ∈ RK which specifies a loss j (t) ∈ [−1, 1] for every action j ∈ A. [sent-93, score-0.208]

45 For a chosen slate S, only the coordinates j (t) for j ∈ S are revealed, and the cost incurred for choosing S is j∈S j (t). [sent-94, score-0.63]

46 The adversary chooses a loss matrix L(t) ∈ Rs×K which specifies a loss Lij (t) ∈ [−1, 1] for every action j ∈ A and every position i, 1 ≤ i ≤ s, in the ordering on the slate. [sent-96, score-0.254]

47 For a chosen slate π, the entries Li,π(i) (t) for every position i are revealed, and s the cost incurred for choosing π is i=1 Li,π(i) (t). [sent-97, score-0.654]

48 In the unordered slate problem, if slate S(t) is chosen in round t, for t = 1, 2, . [sent-98, score-1.414]

49 , T , then the regret of the algorithm is defined to be T T j (t) RegretT = − min S t=1 j∈S(t) j (t). [sent-101, score-0.24]

50 t=1 j∈S Here, the subscript S is used as a shorthand for ranging over all slates S. [sent-102, score-0.469]

51 The regret for the ordered slate problem is defined analogously. [sent-103, score-0.936]

52 Our goal is to design a randomized algorithm for online slate selection such that E[RegretT ] = o(T ), where the expectation is taken over the internal randomization of the algorithm. [sent-104, score-0.611]

53 Frequently in applications we have access to N policies which are algorithms that recommend slates to use in every round. [sent-106, score-0.638]

54 These policies might leverage extra information that we have about the losses in the next round. [sent-107, score-0.146]

55 It is therefore beneficial to devise algorithms that have low regret with respect to the best policy in the pool in hindsight, where regret is defined as: T RegretT = T j (t) t=1 j∈S(t) − min ρ j (t). [sent-108, score-0.544]

56 There are efficient (running in poly(s, K) time in the no-policies case, and in poly(s, K, N ) time with N policies) randomized algorithms achieving the following regret bounds: Unordered slates Ordered slates No policies 4 sK ln(K/s)T (Sec. [sent-115, score-1.346]

57 2) To compare, the best bounds obtained for the no-policies case using the more general algorithms [1] and [9] are O( s3 K ln(K/s)T ) in the unordered slates problem, and O(s2 K ln(K)T ) in the ordered slates problem. [sent-123, score-1.318]

58 It is also possible, in the no-policies setting, to devise algorithms that have √ regret bounded by O( T ) with high probability, using the upper confidence bounds technique of [3]. [sent-124, score-0.28]

59 Compute the probability vector p(t + 1) using the following multiplicative update rule: for every expert i, pi (t + 1) = pi (t) exp(−η i (t))/Z(t) ˆ (1) where Z(t) = i pi (t) exp(−η i (t)) is the normalization factor. [sent-135, score-0.185]

60 1 Algorithms for the slate problems with no policies Main algorithmic ideas Our starting point is the Hedge algorithm for learning online with expert advice. [sent-141, score-0.767]

61 In this setting, on each round t, the learner chooses a probability distribution p(t) over experts, each of which then suffers a (fully observable) loss represented by the vector (t). [sent-142, score-0.186]

62 The main idea of our approach is to apply Hedge (and ideas from bandit variants of it, especially Exp3 [3]) by associating the probability distributions that it selects with mixtures of (ordered or unordered) slates, and thus with the randomized choice of a slate. [sent-144, score-0.261]

63 The goal then is to minimize regret relative to an arbitrary distribution p ∈ P. [sent-147, score-0.24]

64 2 Unordered slates with no policies To apply the approach described above, we need a way to compactly represent the set of distributions over slates. [sent-159, score-0.639]

65 We do this by embedding slates as points in some high-dimensional Euclidean space, and then giving a compact representation of the convex hull of the embedded points. [sent-160, score-0.572]

66 Specifically, we represent an unordered slate S by its indicator vector 1S ∈ RK , which is 1 for all coordinates j ∈ S, and 0 for all others. [sent-161, score-0.817]

67 The convex hull X of all such 1S vectors can be succinctly described [18] as the K convex polytope defined by the linear constraints j=1 xj = s and xj ≥ 0 for j = 1, . [sent-162, score-0.189]

68 An algorithm is given in [18] (Algorithm 2) to decompose any vector x ∈ X into a convex combination of at most K indicator vectors 1S . [sent-166, score-0.132]

69 We embed the convex hull X of all the 1S vectors in the simplex of distributions over the K actions simply by scaling down all coordinates by s so that they sum to 1. [sent-167, score-0.297]

70 Decompose sp (t) as a convex combination of slate vectors 1S corresponding to slates S as sp (t) = S qS 1S , where qS > 0 and S qS = 1. [sent-184, score-1.412]

71 Choose a slate S to display with probability qS , and obtain the loss j (t) for all j ∈ S. [sent-186, score-0.633]

72 Figure 2: The Bandit Algorithm with Unordered Slates We now prove the regret bound of Theorem 2. [sent-191, score-0.271]

73 We note the following facts: j (t) Et [ ˆj (t)] = S j qS · spj (t) = j (t), since pj (t) = S j qS · 1 . [sent-194, score-0.147]

74 Note that if we decompose a distribution p ∈ P as a convex combination of 1 1S vectors and rans domly choose a slate S according to its weight in the combination, then the expected loss, averaged over the s actions chosen, is (t) · p. [sent-196, score-0.786]

75 We can bound the difference between the expected loss (averaged over the s actions) in round t suffered by the algorithm, (t) · p (t), and (t) · p(t) as follows: (t) · p (t) − (t) · p(t) = j (t)(pj (t) − pj (t)) ≤ j t · j Using this bound and Theorem 3. [sent-197, score-0.222]

76 We note that the leading factor of 1 on the expected regret is due to the averaging over the s positions. [sent-200, score-0.24]

77 First, we have   ( j (t))2 pj (t)  2 qS ·  E[(ˆ(t)) · p(t)] = (spj (t))2 t S = j because pj (t) pj (t) ≤ 1 1−γ , j∈S ( j (t))2 pj (t) · (spj (t))2 j S j ( j (t))2 pj (t) K · spj (t) ≤ , 2 (spj (t)) s(1 − γ) and all | j (t)| ≤ 1. [sent-202, score-0.443]

78 Decompose sp (t) as a convex combination of Mπ matrices corresponding to ordered slates π as sp (t) = π qπ Mπ , where qπ > 0 and π qπ = 1. [sent-217, score-0.977]

79 3 Ordered slates with no policies A similar approach can be used for ordered slates. [sent-227, score-0.745]

80 Here, we represent an ordered slate π by the subpermutation matrix Mπ ∈ Rs×K which is defined as follows: for i = 1, 2, . [sent-228, score-0.74]

81 In [7, 16], it is shown that the convex hull M of all the Mπ K matrices is the convex polytope defined by the linear constraints: j=1 Mij = 1 for i = 1, . [sent-232, score-0.19]

82 To complete the characterization of the convex hull, we can show (details omitted) that given any matrix M ∈ M, we can efficiently decompose it into a convex combination of at most K 2 subpermutation matrices. [sent-246, score-0.205]

83 ˆ The regret bound analysis is similar to that of Section 3. [sent-259, score-0.271]

84 We can show sp ij also that L(t) • p (t) − L(t) • p(t) ≤ γ. [sent-262, score-0.144]

85 Obtain the distribution over policies r(t) from MW, and the recommended distribution over slates φρ (t) ∈ P for each policy ρ. [sent-272, score-0.708]

86 Decompose sp (t) as a convex combination of slate vectors 1S corresponding to slates S as sp (t) = S qS 1S , where qS > 0 and S qS = 1. [sent-279, score-1.412]

87 Choose a slate S to display with probability qS , and obtain the loss j (t) for all j ∈ S. [sent-281, score-0.633]

88 Figure 4: Bandit Algorithm for Unordered Slates With Policies s K = i=1 j=1 because pij (t) pij (t) ≤ 1 1−γ , K (Lij (t))2 pij (t) · spij (t) ≤ , 2 (spij (t)) 1−γ all |Lij (t)| ≤ 1. [sent-286, score-0.158]

89 Competing with a set of policies Unordered Slates with N Policies In each round, every policy ρ recommends a distribution over slates φρ (t) ∈ P, where P is the X scaled down by s as in Section 3. [sent-292, score-0.757]

90 Again the regret bound analysis is along the lines of Section 3. [sent-295, score-0.271]

91 We have for any j, Et [ ˆj (t)] = j (t) S j qS · sp (t) = j (t). [sent-297, score-0.144]

92 Obtain the distribution over policies r(t) from MW, and the recommended distribution over ordered slates φρ (t) ∈ P for each policy ρ. [sent-310, score-0.838]

93 Decompose sp (t) as a convex combination of Mπ matrices corresponding to ordered slates π as sp (t) = π qπ Mπ , where qπ > 0 and π qπ = 1. [sent-317, score-0.977]

94 Plugging these bounds into the bound above, we get the stated regret bound from Theorem 2. [sent-329, score-0.342]

95 2 (1−γ)s ln(N ) KT and γ = (K/s) ln(N ) , T which satisfy the necessary technical condi- Ordered Slates with N Policies In each round, every policy ρ recommends a distribution over ordered slates φρ (t) ∈ P, where P is M scaled down by s as in Section 3. [sent-332, score-0.741]

96 ˆ The regret bound analysis is exactly along the lines of that in Section 4. [sent-335, score-0.271]

97 The first direction is to handle other user models for the loss matrices, such as models incorporating the following sort of interaction between the chosen actions: if two very similar ads are shown, and the user clicks on one, then the user is less likely to click on the other. [sent-343, score-0.211]

98 √ The second direction is to derive high probability O( T ) regret bounds for the slate problems in the presence of policies. [sent-345, score-0.846]

99 Competing in the dark: An efficient algorithm for bandit linear optimization. [sent-350, score-0.215]

100 Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. [sent-453, score-0.28]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('slate', 0.566), ('slates', 0.469), ('regret', 0.24), ('bandit', 0.215), ('unordered', 0.21), ('mw', 0.177), ('ln', 0.169), ('policies', 0.146), ('sp', 0.144), ('qs', 0.136), ('ordered', 0.13), ('regrett', 0.129), ('actions', 0.088), ('kt', 0.085), ('lij', 0.077), ('pj', 0.074), ('spj', 0.073), ('learner', 0.069), ('policy', 0.064), ('rewards', 0.06), ('hedge', 0.059), ('hull', 0.055), ('auer', 0.05), ('sk', 0.049), ('round', 0.049), ('convex', 0.048), ('warmuth', 0.046), ('adversary', 0.046), ('advertising', 0.044), ('esa', 0.044), ('ianchi', 0.044), ('rsk', 0.044), ('spi', 0.044), ('spij', 0.044), ('subpermutation', 0.044), ('reward', 0.043), ('decompose', 0.043), ('coordinates', 0.041), ('experts', 0.04), ('bounds', 0.04), ('pij', 0.038), ('multiplicative', 0.037), ('loss', 0.037), ('rs', 0.036), ('recommends', 0.035), ('ad', 0.035), ('action', 0.034), ('robbins', 0.033), ('ads', 0.033), ('hindsight', 0.033), ('expert', 0.032), ('click', 0.031), ('bound', 0.031), ('chooses', 0.031), ('pi', 0.031), ('competing', 0.03), ('display', 0.03), ('bregman', 0.03), ('akhlin', 0.029), ('azan', 0.029), ('koolen', 0.029), ('ottucs', 0.029), ('reyzin', 0.029), ('ugosi', 0.029), ('re', 0.029), ('user', 0.029), ('recommended', 0.029), ('mij', 0.028), ('colt', 0.027), ('yahoo', 0.026), ('li', 0.026), ('lev', 0.026), ('schapire', 0.025), ('distributions', 0.024), ('initialization', 0.024), ('poly', 0.024), ('nakamura', 0.024), ('advertiser', 0.024), ('arm', 0.023), ('online', 0.023), ('chosen', 0.023), ('position', 0.023), ('every', 0.023), ('simplex', 0.022), ('setting', 0.022), ('randomized', 0.022), ('lai', 0.022), ('nonstochastic', 0.022), ('minp', 0.022), ('combination', 0.022), ('displayed', 0.021), ('multiarmed', 0.021), ('scaled', 0.02), ('matrices', 0.02), ('users', 0.019), ('send', 0.019), ('vectors', 0.019), ('entries', 0.019), ('polytope', 0.019), ('lugosi', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 183 nips-2010-Non-Stochastic Bandit Slate Problems

Author: Satyen Kale, Lev Reyzin, Robert E. Schapire

Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1

2 0.20751224 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári

Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1

3 0.18869936 203 nips-2010-Parametric Bandits: The Generalized Linear Case

Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári

Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1

4 0.16755262 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

5 0.13270809 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1

6 0.12275576 192 nips-2010-Online Classification with Specificity Constraints

7 0.10741532 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

8 0.086800255 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

9 0.0752462 38 nips-2010-Batch Bayesian Optimization via Simulation Matching

10 0.074968107 4 nips-2010-A Computational Decision Theory for Interactive Assistants

11 0.073295131 217 nips-2010-Probabilistic Multi-Task Feature Selection

12 0.072254404 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

13 0.068641037 226 nips-2010-Repeated Games against Budgeted Adversaries

14 0.068469629 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

15 0.06753619 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

16 0.064132884 43 nips-2010-Bootstrapping Apprenticeship Learning

17 0.06066478 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

18 0.053546891 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

19 0.053431164 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

20 0.052530166 243 nips-2010-Smoothness, Low Noise and Fast Rates


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.135), (1, -0.162), (2, 0.059), (3, -0.013), (4, 0.019), (5, 0.044), (6, -0.082), (7, -0.025), (8, -0.053), (9, 0.185), (10, 0.009), (11, 0.049), (12, -0.019), (13, -0.0), (14, 0.011), (15, 0.025), (16, -0.084), (17, -0.01), (18, 0.08), (19, 0.005), (20, 0.043), (21, -0.061), (22, -0.077), (23, 0.056), (24, -0.031), (25, 0.035), (26, 0.056), (27, -0.068), (28, 0.041), (29, 0.088), (30, 0.033), (31, -0.101), (32, -0.018), (33, -0.103), (34, 0.05), (35, -0.072), (36, -0.023), (37, -0.088), (38, -0.056), (39, 0.08), (40, 0.011), (41, 0.028), (42, -0.122), (43, 0.043), (44, 0.1), (45, -0.026), (46, 0.072), (47, 0.017), (48, 0.085), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93015862 183 nips-2010-Non-Stochastic Bandit Slate Problems

Author: Satyen Kale, Lev Reyzin, Robert E. Schapire

Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1

2 0.73527306 203 nips-2010-Parametric Bandits: The Generalized Linear Case

Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári

Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1

3 0.70753318 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári

Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1

4 0.62242311 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1

5 0.56211841 192 nips-2010-Online Classification with Specificity Constraints

Author: Andrey Bernstein, Shie Mannor, Nahum Shimkin

Abstract: We consider the online binary classification problem, where we are given m classifiers. At each stage, the classifiers map the input to the probability that the input belongs to the positive class. An online classification meta-algorithm is an algorithm that combines the outputs of the classifiers in order to attain a certain goal, without having prior knowledge on the form and statistics of the input, and without prior knowledge on the performance of the given classifiers. In this paper, we use sensitivity and specificity as the performance metrics of the meta-algorithm. In particular, our goal is to design an algorithm that satisfies the following two properties (asymptotically): (i) its average false positive rate (fp-rate) is under some given threshold; and (ii) its average true positive rate (tp-rate) is not worse than the tp-rate of the best convex combination of the m given classifiers that satisfies fprate constraint, in hindsight. We show that this problem is in fact a special case of the regret minimization problem with constraints, and therefore the above goal is not attainable. Hence, we pose a relaxed goal and propose a corresponding practical online learning meta-algorithm that attains it. In the case of two classifiers, we show that this algorithm takes a very simple form. To our best knowledge, this is the first algorithm that addresses the problem of the average tp-rate maximization under average fp-rate constraints in the online setting. 1

6 0.52269953 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

7 0.51906967 4 nips-2010-A Computational Decision Theory for Interactive Assistants

8 0.44203976 152 nips-2010-Learning from Logged Implicit Exploration Data

9 0.4133167 229 nips-2010-Reward Design via Online Gradient Ascent

10 0.40352008 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

11 0.39004815 38 nips-2010-Batch Bayesian Optimization via Simulation Matching

12 0.37903157 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

13 0.36983603 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

14 0.35441276 202 nips-2010-Parallelized Stochastic Gradient Descent

15 0.34547302 66 nips-2010-Double Q-learning

16 0.32132652 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

17 0.29590362 269 nips-2010-Throttling Poisson Processes

18 0.29560447 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices

19 0.29225293 226 nips-2010-Repeated Games against Budgeted Adversaries

20 0.28165659 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.072), (24, 0.337), (27, 0.048), (30, 0.067), (45, 0.138), (50, 0.032), (52, 0.022), (53, 0.023), (60, 0.018), (67, 0.013), (77, 0.039), (78, 0.02), (90, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7217527 183 nips-2010-Non-Stochastic Bandit Slate Problems

Author: Satyen Kale, Lev Reyzin, Robert E. Schapire

Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1

2 0.60646588 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods

Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb

Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1

3 0.55747974 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

Author: Mahdi M. Fard, Joelle Pineau

Abstract: This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. 1

4 0.53445202 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma

Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1

5 0.47870111 117 nips-2010-Identifying graph-structured activation patterns in networks

Author: James Sharpnack, Aarti Singh

Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1

6 0.47710139 222 nips-2010-Random Walk Approach to Regret Minimization

7 0.47493774 226 nips-2010-Repeated Games against Budgeted Adversaries

8 0.47222996 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

9 0.47183007 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

10 0.4699387 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine

11 0.46810931 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

12 0.46786946 265 nips-2010-The LASSO risk: asymptotic results and real world examples

13 0.46708131 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

14 0.4656288 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

15 0.4644407 280 nips-2010-Unsupervised Kernel Dimension Reduction

16 0.46417719 261 nips-2010-Supervised Clustering

17 0.46396667 221 nips-2010-Random Projections for $k$-means Clustering

18 0.46321058 148 nips-2010-Learning Networks of Stochastic Differential Equations

19 0.46308309 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

20 0.46304283 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression