nips nips2007 nips2007-199 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. [sent-7, score-0.328]
2 We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). [sent-8, score-0.584]
3 In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. [sent-9, score-1.034]
4 For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. [sent-10, score-0.549]
5 For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. [sent-11, score-0.731]
6 It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. [sent-12, score-0.66]
7 The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. [sent-15, score-0.367]
8 1 Introduction In the online linear optimization problem (as in Kalai and Vempala [2005]), at each timestep the learner chooses a decision xt from a decision space D ⊂ Rn and incurs a cost Lt · xt , where the loss vector Lt is in Rn . [sent-16, score-0.895]
9 This paper considers the case where the sequence of loss vectors L1 , . [sent-17, score-0.212]
10 The goal of the learner is to minimize her regret, the difference between the incurred loss on the sequence and the loss of the best single decision in hindsight. [sent-21, score-0.551]
11 column is the stochastic setting (where the loss vectors are drawn from some fixed underlying distribution) and the result are from Dani et al. [sent-34, score-0.241]
12 The expectation column refers to the expected regret for an arbitrary sequence of loss vectors (considered in this paper). [sent-36, score-0.646]
13 [2007]; these results also hold in the adaptive adversary setting, where the loss vectors could change in response to the learner’s previous decisions. [sent-38, score-0.233]
14 This paper focuses on the fundamental regrets achievable for the online linear optimization problem in both the full and partial information feedback settings, as functions of both the dimensionality n and the time horizon T . [sent-42, score-0.401]
15 In particular, this paper is concerned with what might be termed the price of bandit information — how much worse the regret is in the partial information case as compared to the full information case. [sent-43, score-0.945]
16 In the K-arm case (where D is the set of K choices), much work has gone into obtaining sharp regret bounds. [sent-44, score-0.489]
17 For the full information case, the exponential weights algorithm, Hedge, of Freund and Schapire [1997] provides the regret listed. [sent-46, score-0.511]
18 For the partial information case, there is a long history of sharp regret bounds in various settings (particularly in statistical settings where i. [sent-47, score-0.656]
19 [1998] provides the regret listed in Table 1 for the partial information setting. [sent-51, score-0.509]
20 There are a number of issues that we must address in obtaining sharp convergence for the online linear optimization problem. [sent-53, score-0.167]
21 The first issue to address is in understanding what are the natural quantities to state upper and lower bounds in terms of. [sent-54, score-0.138]
22 It is natural to consider the case where the loss is uniformly bounded (say in [0, 1]). [sent-55, score-0.136]
23 For the full information case, all previous bounds (see, e. [sent-57, score-0.148]
24 , Kalai and Vempala [2005]) also have dependencies on the diameter of the decision and cost spaces. [sent-59, score-0.14]
25 It turns out that these are extraneous quantities — with the bounded loss assumption, one need not explicitly consider diameters of the decision and cost spaces. [sent-60, score-0.318]
26 Hence, even in the full information case, to obtain a sharp upper bound we need a new argument to get an upper bound that is stated only in terms of n and T (and we do this via a relatively straightforward appeal to Hedge). [sent-61, score-0.371]
27 The second (and more technically demanding) issue is to obtain a√ sharp bound for the partial information case. [sent-62, score-0.211]
28 Here, for the K-arm bandit case, the regret is O∗ ( KT ). [sent-63, score-0.66]
29 Trivially, we can appeal to this result in the linear optimization case to obtain a |D|T regret by setting K to be the size of D. [sent-64, score-0.504]
30 However, the regret could have a very poor n dependence, as |D| could be exponential in n (or worse). [sent-65, score-0.434]
31 In contrast, note that in the full information case, we could appeal to the K-arm case to obtain O( T log |D|) regret, which in many cases is acceptable (such as when D is exponential in n). [sent-66, score-0.13]
32 The goal here is provide a regret that is O∗ (poly(n) T ). [sent-71, score-0.413]
33 The current paper provides the first √ O∗ (poly(n) T ) regret for the general online linear optimization √ problem with scalar feedback — in particular, our algorithm has an expected regret that is O∗ (n3/2 T ). [sent-77, score-1.036]
34 This paper provides lower bounds for both the full and partial information case. [sent-79, score-0.249]
35 One striking result is that the price of bandit information is relatively small — the upper bound is only a factor of n worse than in the full information case. [sent-83, score-0.513]
36 In fact, the lower bounds suggest the partial √ feedback case is only worse by a factor of n. [sent-84, score-0.269]
37 As we believe that the lower bounds are sharp, we conjecture that the price of bandit information √ is only n. [sent-86, score-0.408]
38 case (where the linear loss functions are sampled from a √ fixed, time invariant distribution) — there, we provided an upper bound on the regret of only O∗ (n T ). [sent-91, score-0.629]
39 In much of the previous work in the literature (for both the full and partial information case), the algorithms can be implemented efficiently provided access to a certain optimization oracle. [sent-98, score-0.237]
40 Then in Section 3 we present upper bounds for both the full information and bandit settings. [sent-106, score-0.436]
41 2 Preliminaries Let D ⊂ Rn denote the decision space. [sent-109, score-0.117]
42 The learner plays the following T -round game against an oblivious adversary. [sent-110, score-0.12]
43 We assume that the loss vectors are admissible, meaning they satisfy the boundedness property that for each t and for all x ∈ D, 0 ≤ Lt ·x = L† x ≤ 1. [sent-115, score-0.189]
44 On each round t, the learner must choose a decision t xt in D, which results in a loss of t = L† xt . [sent-116, score-0.667]
45 In the full information case, Lt is revealed to the learner after time t. [sent-118, score-0.198]
46 In the partial information case, only the incurred loss t (and not the vector Lt ) is revealed. [sent-119, score-0.274]
47 , xT are the decisions the learner makes in the game, then the total loss is cumulative regret is defined by T T L† xt − min t R= t=1 x∈D 3 L† x t t=1 T t=1 L† xt . [sent-123, score-0.928]
48 The t In other words, the learner’s loss is compared to the loss of the best single decision in hindsight. [sent-124, score-0.389]
49 The goal of the learner is to make a sequence of decisions that guarantees low regret. [sent-125, score-0.12]
50 For the partial information case, our upper bounds on the regret are only statements that hold in expectation (with respect to the learner’s randomness). [sent-126, score-0.621]
51 This paper also assumes the learner has access to a barycentric spanner (as defined by Awerbuch and Kleinberg [2004]) of the decision region — such a spanner is useful for exploration. [sent-128, score-0.511]
52 This is a subset of n linearly independent vectors of the decision space, such that every vector in the decision space can be expressed as a linear combination of elements of the spanner with coefficients in [−1, 1]. [sent-129, score-0.399]
53 Furthermore, an almost barycentric spanner (where the coefficients are in [−2, 2]) can be found efficiently (with certain oracle access). [sent-131, score-0.19]
54 In view of these remarks, we assume without loss of generality, that D contains the standard basis vectors e1 . [sent-132, score-0.189]
55 Then there is a set D ⊂ D of size n/2 at most (4nT ) such that for every sequence of admissible loss vectors, the optimal loss for D is √ within an additive nT of the optimal loss for D. [sent-146, score-0.474]
56 √ Since an admissible loss vector has norm at most √ summing over the T rounds of the game, we n, see that the optimal loss for D is within an additive nT of the optimal loss for D. [sent-151, score-0.472]
57 For implementation purposes, it may be impractical to store the decision set (or the covering net of the decision set) explicitly as a list of points. [sent-152, score-0.234]
58 Furthermore, in many cases of interest the full decision set is finite and exponential in n, so we can directly work with D (rather than a cover of D). [sent-154, score-0.194]
59 1 With Full Information In the full information setting, the algorithm Hedge of Freund and Schapire [1997] guarantees a regret at most of O( T log |D|). [sent-157, score-0.49]
60 Since we may modify D so that log |D| is O(n log n log T ), √ this gives us regret O∗ ( nT ). [sent-158, score-0.413]
61 Note that we are only concerned with the regret here. [sent-159, score-0.436]
62 √ √ However, it appears that that their regret is O(n T ) rather than O( nT ) — their regret bounds are stated in terms of diameters of the decision and cost spaces, but we can bound these in terms of n, √ which leads to the O(n T ) regret for their algorithm. [sent-163, score-1.557]
63 1) that achieves low expected regret for the setting where only the observed loss, t = Lt · xt , is received as feedback. [sent-166, score-0.598]
64 On each round t, Lt is an unbiased estimator for the true loss vector Lt . [sent-175, score-0.171]
65 Lt = −1 t C t xt −1 −1 = (Lt · xt )Ct xt = Ct xt (x† Lt ). [sent-177, score-0.564]
66 Therefore t −1 −1 −1 E [Lt ] = E [Ct xt (x† Lt )] = Ct E [xt x† ]Lt = Ct Ct Lt = Lt t t where all the expectations are over the random choice of xt drawn from pt . [sent-178, score-0.537]
67 Note that if |D| is exponential in the dimension n then in general, maintaining and sampling from the distributions pt and pt is very expensive in terms of running time. [sent-185, score-0.531]
68 , LT of admissible loss vectors, let R denote the regret of Algorithm 3. [sent-195, score-0.592]
69 Then √ √ ER ≤ ln |D| nT + 2n3/2 T As before, since we may replace D with a set of size O((nT )n/2 ) for an additional regret of only √ √ nT , the regret is O∗ (n3/2 T ). [sent-197, score-0.879]
70 We start by providing the following bound on the sizes of the estimated loss vectors used by Algorithm 3. [sent-203, score-0.228]
71 For each x ∈ D and 1 ≤ t ≤ T , the estimated loss vector Lt satisfies |Lt · x| ≤ 5 n2 γ Proof. [sent-207, score-0.136]
72 Since Ct := Ept [xx† ] and λi = vi Ct vi , we have b † λi = vi E [xx† ]vi = pt (x)(x · vi )2 pt b (1) x∈D and so n pt (x)(x · vi )2 ≥ λi = x∈D It follows that the eigenvalues pt (x)(x · vi )2 ≥ x∈spanner −1 λ1 , . [sent-216, score-1.555]
73 λ−1 n j=1 of −1 Ct γ γ (ej · vi )2 = vi n n 2 = γ n are each at most n . [sent-219, score-0.17]
74 γ Hence, for each x n n2 | t | xt 2 x 2 ≤ γ γ √ where we have used the upper bound on the eigenvalues and the upper bound of n for x ∈ D. [sent-220, score-0.326]
75 −1 |Lt · x| = | t Ct xt · x| ≤ The following proposition is Theorem 3. [sent-221, score-0.141]
76 [1998])For every x∗ ∈ D, the sequence of estimated loss vectors L1 , . [sent-229, score-0.212]
77 pT satisfy T T Lt · x∗ + pt (x)Lt · x ≤ t=1 x∈D t=1 ln |D| ΦM (η) + η η T pt (x)(Lt · x)2 t=1 x∈D 2 where M = n /γ is an upper bound on |L · x|. [sent-235, score-0.643]
78 For each x ∈ D and 1 ≤ t ≤ T , E xt ∼bt p −1 (Lt · x)2 ≤ x† Ct x Proof. [sent-239, score-0.141]
79 Using that E (Lt · x)2 = x† E Lt L† x, we have t x† E Lt L† x = x† E t † −1 2 −1 t Ct xt xt Ct −1 −1 −1 x ≤ x† Ct E xt x† Ct x = x† Ct x t Lemma 3. [sent-240, score-0.423]
80 For each 1 ≤ t ≤ T , −1 pt (x)x† Ct x = n x∈D −1 Proof. [sent-242, score-0.255]
81 Using Equation 1, it follows that i n −1 pt (x)x† Ct x = x∈D n λ−1 (x · vi )2 = i pt (x) i=1 x∈D n λ−1 i i=1 pt (x)(x · vi )2 = 1=n i=1 x∈D We are now ready to complete the proof of Theorem 3. [sent-245, score-0.956]
82 With the above, we have ΦM (η) = T pt (x)Lt · x ≤ E t,x √ √ Lt · x∗ + ln |D| nT + 2n3/2 T t=1 The proof is completed by noting that pt (x)Lt · x = E E t,x pt (x)E Lt | Ht · x = E t,x pt (x)Lt · x = E t,x Lt · xt t is the expected total loss of the algorithm. [sent-253, score-1.392]
83 , n} and 0 < ε < 1, we define a random loss vector L as follows. [sent-266, score-0.136]
84 Suppose the decision set D is the unit hypercube {−1, 1}n . [sent-275, score-0.145]
85 according to DS,√n/T , the expected regret is √ Ω( nT ). [sent-285, score-0.434]
86 Clearly, for each S and ε, the optimal decision vector for loss vectors sampled i. [sent-287, score-0.306]
87 , xn ) where for each i, the sign of xi is the same as the minority of past occurrences of loss vectors ±ei (in case of a tie, the value of xi doesn’t matter). [sent-298, score-0.217]
88 Note that at every time step when the empirical minority incorrectly predicts the bias for coordinate i, the optimal algorithm incurs expected regret Ω(ε/n). [sent-299, score-0.52]
89 Summing over the n arms, the overall expected regret is thus at least Ω(n(ε/n) min{T, n/ε2 } = Ω(min{εT, n/ε}). [sent-302, score-0.434]
90 2 With Bandit Information Next we prove that the same decision set {0, 1}n and family of distributions DS,ε can be used to √ establish an Ω(n T ) lower bound in the bandit setting. [sent-305, score-0.429]
91 Suppose the decision set D is the unit hypercube {0, 1}n . [sent-308, score-0.145]
92 For any bandit linear optimization algorithm A, and for any positive integer T , there exists S ⊆ {1, . [sent-309, score-0.283]
93 according to DS,n/√T , the expected regret is Ω(n T ). [sent-318, score-0.434]
94 Again, for each S and ε, the optimal decision vector for loss vectors sampled i. [sent-320, score-0.306]
95 Thus we have shown that the expected contribution of coordinate i to the total regret is Ω(min{T, (n/ε)2 }ε/n). [sent-338, score-0.46]
96 Summing over the n arms gives an overall √ expected regret of Ω(min{εT, n2 /ε}. [sent-339, score-0.456]
97 Gambling in a rigged casino: the adversarial multiarmed bandit problem. [sent-347, score-0.267]
98 High probability regret bounds for online optimization (working title). [sent-367, score-0.575]
99 Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. [sent-379, score-0.553]
100 Online geometric optimization in the bandit setting against an adaptive adversary. [sent-419, score-0.355]
wordName wordTfidf (topN-words)
[('lt', 0.569), ('regret', 0.413), ('pt', 0.255), ('bandit', 0.247), ('ct', 0.237), ('nt', 0.161), ('xt', 0.141), ('loss', 0.136), ('dani', 0.128), ('hedge', 0.128), ('decision', 0.117), ('spanner', 0.112), ('learner', 0.097), ('awerbuch', 0.096), ('vi', 0.085), ('vempala', 0.08), ('sharp', 0.076), ('partial', 0.075), ('feedback', 0.072), ('bounds', 0.071), ('kleinberg', 0.07), ('hayes', 0.07), ('auer', 0.069), ('kalai', 0.067), ('ej', 0.064), ('full', 0.056), ('online', 0.055), ('poly', 0.053), ('ln', 0.053), ('vectors', 0.053), ('planning', 0.05), ('barycentric', 0.048), ('gyorgy', 0.048), ('regrets', 0.048), ('varsha', 0.048), ('freund', 0.047), ('path', 0.046), ('price', 0.043), ('admissible', 0.043), ('diameters', 0.042), ('incurred', 0.042), ('upper', 0.041), ('chicago', 0.041), ('schapire', 0.041), ('losses', 0.039), ('bound', 0.039), ('achievable', 0.038), ('ds', 0.037), ('optimization', 0.036), ('round', 0.035), ('routing', 0.034), ('rn', 0.033), ('lemma', 0.032), ('extant', 0.032), ('incurs', 0.032), ('sham', 0.032), ('takimoto', 0.032), ('technological', 0.032), ('appeal', 0.032), ('il', 0.031), ('oracle', 0.03), ('et', 0.029), ('geometric', 0.029), ('hypercube', 0.028), ('toyota', 0.028), ('yes', 0.028), ('issn', 0.028), ('minority', 0.028), ('mcmahan', 0.028), ('ept', 0.028), ('implementations', 0.028), ('scalar', 0.026), ('lower', 0.026), ('stated', 0.026), ('coordinate', 0.026), ('algo', 0.026), ('lai', 0.026), ('access', 0.025), ('xx', 0.025), ('worse', 0.025), ('eigenvalues', 0.025), ('implemented', 0.024), ('revealed', 0.024), ('adversary', 0.024), ('ei', 0.023), ('game', 0.023), ('setting', 0.023), ('cost', 0.023), ('sequence', 0.023), ('concerned', 0.023), ('kakade', 0.022), ('arms', 0.022), ('expected', 0.021), ('proof', 0.021), ('summing', 0.021), ('information', 0.021), ('exponential', 0.021), ('adversarial', 0.02), ('striking', 0.02), ('adaptive', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
2 0.31088647 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
Author: Elad Hazan, Satyen Kale
Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1
3 0.28367138 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
Author: John Langford, Tong Zhang
Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
4 0.2708852 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
5 0.25476146 21 nips-2007-Adaptive Online Gradient Descent
Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1
6 0.23194358 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
7 0.21389547 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
8 0.20130266 165 nips-2007-Regret Minimization in Games with Incomplete Information
9 0.13515013 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
10 0.098126628 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
11 0.0883549 69 nips-2007-Discriminative Batch Mode Active Learning
12 0.087325618 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
13 0.077085517 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
14 0.07673642 70 nips-2007-Discriminative K-means for Clustering
15 0.076709047 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
16 0.075541534 146 nips-2007-On higher-order perceptron algorithms
17 0.070807241 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
18 0.05926403 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
19 0.050383076 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
20 0.050139405 213 nips-2007-Variational Inference for Diffusion Processes
topicId topicWeight
[(0, -0.188), (1, -0.285), (2, 0.042), (3, 0.112), (4, 0.46), (5, -0.092), (6, -0.095), (7, 0.039), (8, 0.005), (9, -0.051), (10, 0.053), (11, 0.132), (12, -0.057), (13, 0.049), (14, -0.053), (15, -0.003), (16, 0.042), (17, -0.143), (18, 0.067), (19, -0.047), (20, -0.105), (21, -0.101), (22, 0.027), (23, -0.13), (24, 0.019), (25, -0.117), (26, -0.113), (27, -0.135), (28, -0.058), (29, 0.128), (30, -0.025), (31, -0.061), (32, 0.145), (33, -0.006), (34, 0.044), (35, -0.042), (36, 0.084), (37, 0.008), (38, 0.043), (39, 0.095), (40, 0.046), (41, 0.047), (42, -0.017), (43, -0.03), (44, -0.021), (45, -0.061), (46, 0.041), (47, -0.01), (48, 0.014), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.96826613 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
2 0.75385916 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
Author: John Langford, Tong Zhang
Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
3 0.68361962 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
Author: Elad Hazan, Satyen Kale
Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1
4 0.6558044 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
5 0.63111866 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
Author: Peter Frazier, Angela J. Yu
Abstract: Most models of decision-making in neuroscience assume an infinite horizon, which yields an optimal solution that integrates evidence up to a fixed decision threshold; however, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some finite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution.
6 0.58516514 21 nips-2007-Adaptive Online Gradient Descent
7 0.37972891 165 nips-2007-Regret Minimization in Games with Incomplete Information
8 0.32715443 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
9 0.29781348 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
10 0.29402444 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
11 0.27397302 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
12 0.27152887 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
13 0.24111716 70 nips-2007-Discriminative K-means for Clustering
14 0.23389073 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
15 0.21252207 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
16 0.21141841 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
17 0.20132846 15 nips-2007-A general agnostic active learning algorithm
18 0.2001155 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking
19 0.19973108 62 nips-2007-Convex Learning with Invariances
20 0.19035982 214 nips-2007-Variational inference for Markov jump processes
topicId topicWeight
[(5, 0.032), (13, 0.023), (16, 0.015), (18, 0.012), (19, 0.019), (21, 0.045), (31, 0.015), (34, 0.015), (35, 0.027), (46, 0.426), (47, 0.069), (49, 0.024), (83, 0.131), (90, 0.042)]
simIndex simValue paperId paperTitle
1 0.76395047 81 nips-2007-Estimating disparity with confidence from energy neurons
Author: Eric K. Tsang, Bertram E. Shi
Abstract: The peak location in a population of phase-tuned neurons has been shown to be a more reliable estimator for disparity than the peak location in a population of position-tuned neurons. Unfortunately, the disparity range covered by a phasetuned population is limited by phase wraparound. Thus, a single population cannot cover the large range of disparities encountered in natural scenes unless the scale of the receptive fields is chosen to be very large, which results in very low resolution depth estimates. Here we describe a biologically plausible measure of the confidence that the stimulus disparity is inside the range covered by a population of phase-tuned neurons. Based upon this confidence measure, we propose an algorithm for disparity estimation that uses many populations of high-resolution phase-tuned neurons that are biased to different disparity ranges via position shifts between the left and right eye receptive fields. The population with the highest confidence is used to estimate the stimulus disparity. We show that this algorithm outperforms a previously proposed coarse-to-fine algorithm for disparity estimation, which uses disparity estimates from coarse scales to select the populations used at finer scales and can effectively detect occlusions.
same-paper 2 0.73797888 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
3 0.72502792 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
4 0.50627923 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
Author: Jonathan W. Pillow, Peter E. Latham
Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more flexible for fitting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.
5 0.48624173 189 nips-2007-Supervised Topic Models
Author: Jon D. Mcauliffe, David M. Blei
Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1
6 0.39816341 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
7 0.39604187 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
8 0.39478964 7 nips-2007-A Kernel Statistical Test of Independence
9 0.38763985 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
10 0.38300323 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
11 0.37876162 70 nips-2007-Discriminative K-means for Clustering
12 0.36360961 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
13 0.36274308 63 nips-2007-Convex Relaxations of Latent Variable Training
14 0.36273646 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
15 0.36271632 43 nips-2007-Catching Change-points with Lasso
16 0.36246723 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
17 0.36092463 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
18 0.36080867 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
19 0.36019966 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
20 0.36009398 187 nips-2007-Structured Learning with Approximate Inference