nips nips2011 nips2011-128 knowledge-graph by maker-knowledge-mining

128 nips-2011-Improved Algorithms for Linear Stochastic Bandits


Source: pdf

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 of Computing Science University of Alberta Abstract We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. [sent-7, score-0.559]

2 More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. [sent-9, score-0.494]

3 Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. [sent-12, score-0.542]

4 For their construction we use a novel tail inequality for vector-valued martingales. [sent-14, score-0.201]

5 1 Introduction Linear stochastic bandit problem is a sequential decision-making problem where in each time step we have to choose an action, and as a response we receive a stochastic reward, expected value of which is an unknown linear function of the action. [sent-15, score-0.359]

6 The goal is to collect as much reward as possible over the course of n time steps. [sent-16, score-0.134]

7 For example, the standard stochastic d-armed bandit problem, introduced by Robbins (1952) and then studied by Lai and Robbins (1985), is a special case of linear stochastic bandit problem where the set of available actions in each round is the standard orthonormal basis of Rd . [sent-20, score-0.688]

8 Another variant, studied by Auer (2002) under the name “linear reinforcement learning”, and later in the context of web advertisement by Li et al. [sent-21, score-0.126]

9 Another variant dubbed “sleeping bandits”, studied by Kleinberg et al. [sent-24, score-0.156]

10 In every round, the algorithm chooses an estimate from the confidence set and an action so that the predicted reward is maximized, i. [sent-37, score-0.28]

11 This is not an easy problem, because the future actions are not independent of the actions taken in the past (since the algorithm’s choices of future actions depend on the random confidence set constructed from past data). [sent-42, score-0.219]

12 The smaller confidence sets one is able to construct, the better regret bounds one obtains for the resulting algorithm, and, more importantly, the better the algorithm performs empirically. [sent-47, score-0.565]

13 With our new technique, we vastly reduce the size of the confidence sets of Dani et al. [sent-48, score-0.133]

14 Second, our confidence sets are “more empirical” in the sense that some worst-case quantities from the old bounds are replaced by empirical quantities that are always smaller, sometimes substantially. [sent-51, score-0.141]

15 As a result, our experiments show an order-of-magnitude improvement over the C ONFIDENCE BALL algorithm of Dani et al. [sent-52, score-0.122]

16 To construct our confidence sets, we prove a new martingale tail inequality. [sent-54, score-0.168]

17 The new inequality is derived using techniques from the theory of self-normalized processes (de la Pe˜ a et al. [sent-55, score-0.203]

18 n Using our confidence sets, we modify the UCB algorithm (Auer, 2002) for the d-armed bandit problem and show that with probability 1 , the regret of the modified algorithm is O(d log(1/ )/ ) where is the difference between the expected rewards of the best and the second best action. [sent-57, score-0.866]

19 In particular, note that the regret does not depend on n. [sent-58, score-0.45]

20 This seemingly contradicts the result of Lai and Robbins (1985) who showed that the expected regret of any algorithm is at least P ( i6=i⇤ 1/D(pj | pi⇤ ) o(1)) log n where pi⇤ and pi are the reward distributions of the optimal arm and arm i respectively and D is the Kullback-Leibler divergence. [sent-59, score-0.937]

21 However, our algorithm receives as an input, and thus its expected regret depends on . [sent-60, score-0.505]

22 With = 1/n our algorithm has the same expected regret bound, O((d log n)/ ), as Auer (2002) has shown for UCB. [sent-61, score-0.607]

23 For the general linear stochastic bandit problem, we improve regret of the C ONFIDENCE BALL p algorithm of Dani et al. [sent-62, score-0.866]

24 They showed that its regret is at most O(d log(n) n log(n/ )) with probability at least 1 . [sent-64, score-0.512]

25 We modify their algorithm so that it uses our new confidence p p sets and we show that its regret is p most O(d log(n) n + dn log(n/ )) which is roughly at improvement a multiplicative factor log(n). [sent-65, score-0.561]

26 (2008) prove also a problem dependent 2 regret bound. [sent-67, score-0.45]

27 Namely, they show that the regret of their algorithm is O( d log(n/ ) log2 (n)) where is the “gap” as defined in (Dani et al. [sent-68, score-0.572]

28 For our modified algorithm we prove an improved O( log(1/ ) (log(n) + d log log n)2 ) bound. [sent-70, score-0.23]

29 2 The Learning Model In each round t, the learner is given a decision set Dt ✓ Rd from which he has to choose an action Xt . [sent-81, score-0.179]

30 Subsequently he observes reward Yt = hXt , ✓⇤ i + ⌘t where ✓⇤ 2 Rd is an unknown parameter and ⌘t is a random noise satisfying E[⌘t | X1:t , ⌘1:t 1 ] = 0 and some tail-constraints, to be specified soon. [sent-82, score-0.168]

31 Pn The goal of the learner is to maximize his total reward t=1 hXt , ✓⇤ i accumulated over the course of n rounds. [sent-83, score-0.171]

32 This strategy would accumulate total t Pn reward t=1 hx⇤ , ✓⇤ i. [sent-85, score-0.134]

33 The t difference of the learner’s total reward and the total reward of the optimal strategy is called the 2 for t := 1, 2, . [sent-87, score-0.268]

34 do e (Xt , ✓t ) = argmax(x,✓)2Dt ⇥Ct Play Xt and observe reward Yt Update Ct end for 1 hx, ✓i Figure 1: OFUL ALGORITHM pseudo-regret (Audibert et al. [sent-90, score-0.23]

35 ) In what follows, for simplicity we use the word regret instead of the more precise pseudo-regret in connection to Rn . [sent-96, score-0.45]

36 The goal of the algorithm is to keep the regret Rn as low as possible. [sent-97, score-0.476]

37 2 Optimism in the Face of Uncertainty A natural and successful way to design an algorithm is the optimism in the face of uncertainty principle (OFU). [sent-111, score-0.109]

38 The algorithm chooses an optimistic D E e e estimate ✓t = argmax✓2Ct 1 (maxx2Dt hx, ✓i) and then chooses action Xt = argmaxx2Dt x, ✓t e which maximizes the reward according to the estimate ✓t . [sent-120, score-0.313]

39 Equivalently, and more compactly, the algorithm chooses the pair e (Xt , ✓t ) = argmax hx, ✓i , (x,✓)2Dt ⇥Ct 1 which jointly maximizes the reward. [sent-121, score-0.104]

40 We call the resulting algorithm the OFUL ALGORITHM for “optimism in the face of uncertainty linear bandit algorithm”. [sent-122, score-0.252]

41 3 Self-Normalized Tail Inequality for Vector-Valued Martingales Since the decision sets {Dt }1 can be arbitrary, the sequence of actions Xt 2 Dt is arbitrary as t=1 well. [sent-126, score-0.11]

42 Pt The sequence {St }1 , St = s=1 ⌘t Xt , is a martingale with respect {Ft }1 which happens to t=0 t=0 be crucial for the construction of the confidence sets for ✓⇤ . [sent-138, score-0.17]

43 The following theorem shows that with high probability the martingale stays close to zero. [sent-139, score-0.167]

44 t X > 0, with probability at least 1 , for all t 0, ✓ det(V t )1/2 det(V ) 2 kSt kV 1  2R2 log 2 Note that the deviation of the martingale kSt kV Vt 4 ⌘s Xs . [sent-147, score-0.249]

45 The b following theorem shows that ✓⇤ lies with high probability in an ellipsoid with center at ✓t . [sent-157, score-0.157]

46 Then, for any > 0, with probability at least 1 , for all t 0, ✓⇤ lies in the set 8 9 s ✓ ◆ < = 1/2 det( I) 1/2 det(V t ) b C t = ✓ 2 Rd : ✓ t ✓  R 2 log + 1/2 S . [sent-161, score-0.203]

47 : ; Vt Furthermore, if for all t in the set ( 0 Ct = 1, kXt k2  L then with probability at least 1 ✓ 2 Rd : b ✓t ✓ Vt R s 4 d log ✓ 1 + tL2 / ◆ + , for all t 1/2 S ) . [sent-162, score-0.164]

48 0, ✓⇤ lies The above bound could be compared with a similar bound of Dani et al. [sent-163, score-0.319]

49 (2008) whose bound, under identical conditions, states that (with appropriate initialization) with probability 1 , (s ✓ 2◆ ✓ 2 ◆) t 8 t b for all t large enough ✓t ✓⇤  R max 128 d log(t) log , log , (2) 3 Vt p where large enough means that t satisfies 0 < < t2 e 1/16 . [sent-164, score-0.236]

50 The restriction on t comes from the fact that t ( ) 2d(1 + 2 log(t)) is needed in the proof of the last inequality of their Theorem 5. [sent-166, score-0.104]

51 On the other hand, Rusmevichientong and Tsitsiklis (2010) proved that for any fixed t 2, for any 0 < < 1, with probability at least 1 , p p 2 b ✓t ✓⇤  2  R log t d log(t) + log(1/ ) + 1/2 S , Vt p 2 + trace(V ))/ . [sent-167, score-0.164]

52 To get a uniform bound one can use a union bound where  = 3 + 2 log((L P1 2 with t = /t2 . [sent-168, score-0.227]

53 This thus gives that for any 0 < < 1, with 6 probability at least 1 , p p b 8t 2, ✓t ✓⇤  22 R log t d log(t) + log(t2 / ) + 1/2 S , Vt This is tighter than (2), but is still lagging behind the result of Theorem 2. [sent-170, score-0.164]

54 However, one can speed up the computation by using the matrix determinant lemma, exploiting that the matrix whose determinant is needed is obtained via a rank-one update (cf. [sent-172, score-0.106]

55 5 Regret Analysis of the OFUL ALGORITHM We now give a bound on the regret of the OFUL algorithm when run with confidence sets Cn constructed in Theorem 2 in the previous section. [sent-175, score-0.605]

56 We can view this as a bound on ✓⇤ and the bound on the decision sets Dt . [sent-177, score-0.221]

57 The next theorem states a bound on the regret of the algorithm. [sent-178, score-0.592]

58 Then, with probability at least 1 , the regret of the OFUL algorithm satisfies ⇣ ⌘ p p 1/2 8n 0, Rn  4 nd log( + nL/d) S + R 2 log(1/ ) + d log(1 + nL/( d)) . [sent-182, score-0.538]

59 The regret of OFUL is significantly better compared to the regret of C ONFIDENCE BALL of Dani et al. [sent-184, score-0.996]

60 The figure also shows a version of the algorithm that has a similar regret to the algorithm with the new bound, but spends about 350 times less computation in this experiment. [sent-186, score-0.502]

61 As the next theorem shows its regret bound is essentially the same as the regret for OFUL. [sent-192, score-1.042]

62 Under the same assumptions as in Theorem 3, with probability at least 1 , for all n 0, the regret of the RARELY SWITCHING OFUL ALGORITHM satisfies s s ) r ✓ ◆( ✓ ◆ p nL nL 1 n Rn  4 (1 + C)nd log + S + R d log 1 + + 2 log + 4 d log . [sent-194, score-0.92]

63 5 3000 New bound Old bound New bound with rare switching 2500 Regret 2000 1500 1000 500 0 0 2000 4000 Time 6000 8000 10000 Figure 2: The application of the new bound to a linear bandit problem. [sent-198, score-0.624]

64 The regret of OFUL is significantly better compared to the regret of C ONFIDENCE BALL of Dani et al. [sent-200, score-0.996]

65 end for Figure 3: The RARELY SWITCHING OFUL ALGORITHM Average regret 0. [sent-214, score-0.45]

66 We fixed the number of times the algorithm is allowed to update its action in OFUL. [sent-223, score-0.113]

67 For larger values of C, the algorithm changes action less frequently, hence, will play for a longer time period. [sent-224, score-0.139]

68 The figure shows the average regret obtained during the given time periods for the different values of C. [sent-225, score-0.45]

69 Thus, we see that by increasing C, one can actually lower the average regret per time step for a given fixed computation budget. [sent-226, score-0.45]

70 (Intuitively, t is the difference between the rewards of the best and the “second best” action in the decision set Dt . [sent-232, score-0.142]

71 The regret of OFUL can be upper bounded in terms of ( ¯ n )n as follows. [sent-235, score-0.45]

72 Assume that 1 and k✓⇤ k2  S where S all n 1, the regret of the OFUL satisfies 16R2 S 2 ⇣ 64R2 S 2 L Rn  log(Ln) + (d 1) log ¯ ¯2 n + 2(d 1. [sent-237, score-0.552]

73 With probability at least 1 , for n ✓ ◆2 ⌘ d + nL2 1) log d log + 2 log(1/ ) + 2 log(1/ ) . [sent-238, score-0.266]

74 , 2008) scales like O( d log3 n), while our bound scales like O( 1 (log2 n + d log n + d2 log log n)), where = inf n ¯ n . [sent-241, score-0.398]

75 Let µi be the expected reward of action i = 1, 2, . [sent-243, score-0.25]

76 Let µ⇤ = max1id µi be the expected reward of the best arm, and let i = µ⇤ µi , i = 1, 2, . [sent-247, score-0.163]

77 We assume that if we choose action It in round t we obtain reward µIt + ⌘t . [sent-251, score-0.276]

78 Let Ni,t denote the number of times that we have played action i up to time t, and X i,t denote the average of the rewards received by action i up to time t. [sent-252, score-0.229]

79 We construct confidence intervals for the expected rewards µi based on X i,t in the following lemma. [sent-253, score-0.11]

80 (3) Using these confidence intervals, we modify the UCB algorithm of Auer et al. [sent-261, score-0.17]

81 Hence, at time t, we choose the action (4) It = argmax X i,t + ci,t . [sent-263, score-0.132]

82 This allows us to prove the following result that the regret of UCB( ) is constant. [sent-266, score-0.45]

83 Assume that the noise ⌘t is conditionally 1-sub-Gaussian, with probability at least 1 , the total regret of the UCB( ) is bounded as ◆ X ✓ 16 2d Rn  3 i+ log . [sent-269, score-0.686]

84 i: i i >0 i Lai and Robbins (1985) prove that for any suboptimal arm j, E Ni,t log t , D(pj , p⇤ ) where, p⇤ and pj are the reward density of the optimal arm and arm j respectively, and D is the KL-divergence. [sent-270, score-0.527]

85 The results are shown for a 10-armed bandit problem, where the mean value of each arm is fixed to some values in [0, 1]. [sent-272, score-0.283]

86 The regret of UCB( ) is improved with the new bound. [sent-273, score-0.45]

87 Because with probability , the regret in time t can be t, on expectation, the algorithm might have a regret of t . [sent-281, score-0.958]

88 Now if we select = 1/t, then we get O(log t) upper bound on the expected regret. [sent-282, score-0.121]

89 If one is interested in an average regret result, then, with slight modification of the proof technique one can obtain an identical result to what Auer et al. [sent-283, score-0.58]

90 Figure 5 shows the regret of UCB( ) when it uses either the confidence bound based on Hoeffding’s inequality, or the bound in (3). [sent-285, score-0.634]

91 As can be seen, the regret of UCB( ) is improved with the new bound. [sent-286, score-0.45]

92 (2009) prove similar high-probability constant regret bounds for variations of the UCB algorithm. [sent-288, score-0.502]

93 Compared to their bounds, our bound is tighter thanks to that with the new self-normalized tail inequality we can avoid one union bound. [sent-289, score-0.288]

94 7 Conclusions In this paper, we showed how a novel tail inequality for vector-valued martingales allows one to improve both the theoretical analysis and empirical performance of algorithms for various stochastic bandit problems. [sent-291, score-0.512]

95 Further, we modify and improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. [sent-293, score-0.494]

96 Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement, stemming from the construction of smaller confidence sets. [sent-296, score-0.59]

97 We expect that the novel tail inequality will also be useful in a number of other situations thanks to its self-normalized form and that it holds for stopped martingales and thus can be used to derive bounds that hold uniformly in time. [sent-299, score-0.27]

98 In general, the new inequality can be used to improve deviation bounds which use a union bound (over time). [sent-300, score-0.286]

99 Just to mention a few examples, the new inequality could be used to improve the computational complexity of the HOO algorithm Bubeck et al. [sent-302, score-0.221]

100 (2008) (when it is used with a fixed , by avoiding union bounds, or the need to know the horizon, or the doubling trick) or to improve the bounds derived by Garivier and Moulines (2008) for UCB for changing environments, or the stopping rules and racing algorithms of Mnih et al. [sent-303, score-0.264]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('regret', 0.45), ('oful', 0.354), ('ucb', 0.247), ('dani', 0.218), ('dence', 0.215), ('bandit', 0.2), ('ft', 0.178), ('auer', 0.146), ('hx', 0.146), ('xt', 0.136), ('reward', 0.134), ('hxt', 0.118), ('rusmevichientong', 0.107), ('lai', 0.105), ('log', 0.102), ('con', 0.099), ('et', 0.096), ('robbins', 0.095), ('onfidence', 0.094), ('dt', 0.094), ('bound', 0.092), ('ct', 0.09), ('det', 0.089), ('vt', 0.088), ('action', 0.087), ('martingale', 0.085), ('arm', 0.083), ('tail', 0.083), ('tsitsiklis', 0.081), ('yt', 0.075), ('actions', 0.073), ('szepesv', 0.072), ('ofu', 0.071), ('inequality', 0.07), ('appendix', 0.068), ('stochastic', 0.065), ('martingales', 0.065), ('rd', 0.064), ('optimism', 0.057), ('walsh', 0.057), ('coquelin', 0.057), ('switching', 0.056), ('rewards', 0.055), ('round', 0.055), ('determinant', 0.053), ('bounds', 0.052), ('old', 0.052), ('theorem', 0.05), ('munos', 0.049), ('alberta', 0.049), ('modify', 0.048), ('construction', 0.048), ('garivier', 0.047), ('argmax', 0.045), ('rarely', 0.044), ('doubling', 0.044), ('union', 0.043), ('rn', 0.043), ('pj', 0.042), ('kst', 0.042), ('sleeping', 0.042), ('chu', 0.041), ('pe', 0.04), ('audibert', 0.039), ('lies', 0.039), ('conditionally', 0.038), ('kv', 0.038), ('bandits', 0.037), ('sets', 0.037), ('learner', 0.037), ('la', 0.037), ('recompute', 0.036), ('abe', 0.036), ('ellipsoid', 0.036), ('modi', 0.036), ('xs', 0.035), ('noise', 0.034), ('proof', 0.034), ('antos', 0.034), ('li', 0.034), ('chooses', 0.033), ('ball', 0.033), ('nl', 0.032), ('csaba', 0.032), ('mnih', 0.032), ('probability', 0.032), ('saving', 0.031), ('variant', 0.03), ('dekel', 0.03), ('least', 0.03), ('studied', 0.03), ('expected', 0.029), ('bubeck', 0.029), ('improve', 0.029), ('hoeffding', 0.028), ('kleinberg', 0.027), ('algorithm', 0.026), ('play', 0.026), ('intervals', 0.026), ('face', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

2 0.35031316 220 nips-2011-Prediction strategies without loss

Author: Michael Kapralov, Rina Panigrahy

Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1

3 0.29409346 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1

4 0.29095763 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1

5 0.25858468 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

6 0.25058243 32 nips-2011-An Empirical Evaluation of Thompson Sampling

7 0.24276848 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

8 0.23865575 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

9 0.21821743 56 nips-2011-Committing Bandits

10 0.20507455 80 nips-2011-Efficient Online Learning via Randomized Rounding

11 0.19534433 272 nips-2011-Stochastic convex optimization with bandit feedback

12 0.19098271 145 nips-2011-Learning Eigenvectors for Free

13 0.18684818 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

14 0.17599438 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

15 0.15549728 25 nips-2011-Adaptive Hedge

16 0.15216275 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

17 0.14446883 177 nips-2011-Multi-armed bandits on implicit metric spaces

18 0.13966581 202 nips-2011-On the Universality of Online Mirror Descent

19 0.11635502 175 nips-2011-Multi-Bandit Best Arm Identification

20 0.11341536 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.248), (1, -0.476), (2, 0.036), (3, 0.035), (4, 0.302), (5, -0.029), (6, 0.084), (7, 0.102), (8, 0.131), (9, 0.015), (10, -0.055), (11, 0.047), (12, -0.058), (13, 0.012), (14, -0.001), (15, 0.005), (16, -0.029), (17, -0.049), (18, -0.017), (19, -0.012), (20, -0.016), (21, 0.054), (22, -0.038), (23, -0.025), (24, 0.014), (25, -0.043), (26, -0.005), (27, -0.018), (28, 0.011), (29, 0.035), (30, 0.083), (31, 0.024), (32, -0.021), (33, -0.028), (34, -0.083), (35, -0.001), (36, -0.014), (37, 0.008), (38, -0.015), (39, -0.007), (40, 0.013), (41, -0.03), (42, 0.012), (43, -0.002), (44, 0.011), (45, 0.003), (46, 0.016), (47, -0.026), (48, 0.075), (49, -0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95969892 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

2 0.84210008 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

3 0.82028037 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1

4 0.81592268 220 nips-2011-Prediction strategies without loss

Author: Michael Kapralov, Rina Panigrahy

Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1

5 0.80625892 272 nips-2011-Stochastic convex optimization with bandit feedback

Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin

Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1

6 0.77457941 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

7 0.76823753 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

8 0.75201917 56 nips-2011-Committing Bandits

9 0.7498138 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

10 0.7157954 61 nips-2011-Contextual Gaussian Process Bandit Optimization

11 0.67146766 80 nips-2011-Efficient Online Learning via Randomized Rounding

12 0.64786524 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

13 0.62558484 145 nips-2011-Learning Eigenvectors for Free

14 0.58977515 25 nips-2011-Adaptive Hedge

15 0.57188177 177 nips-2011-Multi-armed bandits on implicit metric spaces

16 0.52802521 175 nips-2011-Multi-Bandit Best Arm Identification

17 0.50731528 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

18 0.4611761 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

19 0.45528784 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

20 0.43772084 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (4, 0.047), (14, 0.017), (20, 0.023), (26, 0.035), (31, 0.094), (33, 0.011), (43, 0.061), (45, 0.149), (57, 0.045), (74, 0.045), (79, 0.261), (83, 0.051), (99, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92293006 175 nips-2011-Multi-Bandit Best Arm Identification

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck

Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

2 0.88273758 85 nips-2011-Emergence of Multiplication in a Biophysical Model of a Wide-Field Visual Neuron for Computing Object Approaches: Dynamics, Peaks, & Fits

Author: Matthias S. Keil

Abstract: Many species show avoidance reactions in response to looming object approaches. In locusts, the corresponding escape behavior correlates with the activity of the lobula giant movement detector (LGMD) neuron. During an object approach, its firing rate was reported to gradually increase until a peak is reached, and then it declines quickly. The η-function predicts that the LGMD activity is a product ˙ between an exponential function of angular size exp(−Θ) and angular velocity Θ, and that peak activity is reached before time-to-contact (ttc). The η-function has become the prevailing LGMD model because it reproduces many experimental observations, and even experimental evidence for the multiplicative operation was reported. Several inconsistencies remain unresolved, though. Here we address ˙ these issues with a new model (ψ-model), which explicitly connects Θ and Θ to biophysical quantities. The ψ-model avoids biophysical problems associated with implementing exp(·), implements the multiplicative operation of η via divisive inhibition, and explains why activity peaks could occur after ttc. It consistently predicts response features of the LGMD, and provides excellent fits to published experimental data, with goodness of fit measures comparable to corresponding fits with the η-function. 1 Introduction: τ and η Collision sensitive neurons were reported in species such different as monkeys [5, 4], pigeons [36, 34], frogs [16, 20], and insects [33, 26, 27, 10, 38]. This indicates a high ecological relevance, and raises the question about how neurons compute a signal that eventually triggers corresponding movement patterns (e.g. escape behavior or interceptive actions). Here, we will focus on visual stimulation. Consider, for simplicity, a circular object (diameter 2l), which approaches the eye at a collision course with constant velocity v. If we do not have any a priori knowledge about the object in question (e.g. its typical size or speed), then we will be able to access only two information sources. These information sources can be measured at the retina and are called optical variables (OVs). The first is the visual angle Θ, which can be derived from the number of stimulated photore˙ ˙ ceptors (spatial contrast). The second is its rate of change dΘ(t)/dt ≡ Θ(t). Angular velocity Θ is related to temporal contrast. ˙ How should we combine Θ and Θ in order to track an imminent collision? The perhaps simplest ˙ combination is τ (t) ≡ Θ(t)/Θ(t) [13, 18]. If the object hit us at time tc , then τ (t) ≈ tc − t will ∗ Also: www.ir3c.ub.edu, Research Institute for Brain, Cognition, and Behaviour (IR3C) Edifici de Ponent, Campus Mundet, Universitat de Barcelona, Passeig Vall d’Hebron, 171. E-08035 Barcelona 1 give us a running estimation of the time that is left until contact1 . Moreover, we do not need to know anything about the approaching object: The ttc estimation computed by τ is practically independent of object size and velocity. Neurons with τ -like responses were indeed identified in the nucleus retundus of the pigeon brain [34]. In humans, only fast interceptive actions seem to rely exclusively on τ [37, 35]. Accurate ttc estimation, however, seems to involve further mechanisms (rate of disparity change [31]). ˙ Another function of OVs with biological relevance is η ≡ Θ exp(−αΘ), with α = const. [10]. While η-type neurons were found again in pigeons [34] and bullfrogs [20], most data were gathered from the LGMD2 in locusts (e.g. [10, 9, 7, 23]). The η-function is a phenomenological model for the LGMD, and implies three principal hypothesis: (i) An implementation of an exponential function exp(·). Exponentation is thought to take place in the LGMD axon, via active membrane conductances [8]. Experimental data, though, seem to favor a third-power law rather than exp(·). (ii) The LGMD carries out biophysical computations for implementing the multiplicative operation. It has been suggested that multiplication is done within the LGMD itself, by subtracting the loga˙ rithmically encoded variables log Θ − αΘ [10, 8]. (iii) The peak of the η-function occurs before ˆ ttc, at visual angle Θ(t) = 2 arctan(1/α) [9]. It follows ttc for certain stimulus configurations (e.g. ˆ l/|v| 5ms). In principle, t > tc can be accounted for by η(t + δ) with a fixed delay δ < 0 (e.g. −27ms). But other researchers observed that LGMD activity continuous to rise after ttc even for l/|v| 5ms [28]. These discrepancies remain unexplained so far [29], but stimulation dynamics perhaps plays a role. We we will address these three issues by comparing the novel function “ψ” with the η-function. LGMD computations with the ψ-function: No multiplication, no exponentiation 2 A circular object which starts its approach at distance x0 and with speed v projects a visual angle Θ(t) = 2 arctan[l/(x0 − vt)] on the retina [34, 9]. The kinematics is hence entirely specified by the ˙ half-size-to-velocity ratio l/|v|, and x0 . Furthermore, Θ(t) = 2lv/((x0 − vt)2 + l2 ). In order to define ψ, we consider at first the LGMD neuron as an RC-circuit with membrane potential3 V [17] dV Cm = β (Vrest − V ) + gexc (Vexc − V ) + ginh (Vinh − V ) (1) dt 4 Cm = membrane capacity ; β ≡ 1/Rm denotes leakage conductance across the cell membrane (Rm : membrane resistance); gexc and ginh are excitatory and inhibitory inputs. Each conductance gi (i = exc, inh ) can drive the membrane potential to its associated reversal potential Vi (usually Vinh ≤ Vexc ). Shunting inhibition means Vinh = Vrest . Shunting inhibition lurks “silently” because it gets effective only if the neuron is driven away from its resting potential. With synaptic input, the neuron decays into its equilibrium state Vrest β + Vexc gexc + Vinh ginh V∞ ≡ (2) β + gexc + ginh according to V (t) = V∞ (1 − exp(−t/τm )). Without external input, V (t 1) → Vrest . The time scale is set by τm . Without synaptic input τm ≡ Cm /β. Slowly varying inputs gexc , ginh > 0 modify the time scale to approximately τm /(1 + (gexc + ginh )/β). For highly dynamic inputs, such as in late phase of the object approach, the time scale gets dynamical as well. The ψ-model assigns synaptic inputs5 ˙ ˙ ˙ ˙ gexc (t) = ϑ(t), ϑ(t) = ζ1 ϑ(t − ∆tstim ) + (1 − ζ1 )Θ(t) (3a) e ginh (t) = [γϑ(t)] , ϑ(t) = ζ0 ϑ(t − ∆tstim ) + (1 − ζ0 )Θ(t) 1 (3b) This linear approximation gets worse with increasing Θ, but turns out to work well until short before ttc (τ adopts a minimum at tc − 0.428978 · l/|v|). 2 LGMD activity is usually monitored via its postsynaptic neuron, the Descending Contralateral Movement Detector (DCMD) neuron. This represents no problem as LGMD spikes follow DCMD spikes 1:1 under visual stimulation [22] from 300Hz [21] to at least 400Hz [24]. 3 Here we assume that the membrane potential serves as a predictor for the LGMD’s mean firing rate. 4 Set to unity for all simulations. 5 LGMD receives also inhibition from a laterally acting network [21]. The η-function considers only direct feedforward inhibition [22, 6], and so do we. 2 Θ ∈ [7.63°, 180.00°[ temporal resolution ∆ tstim=1.0ms l/|v|=20.00ms, β=1.00, γ=7.50, e=3.00, ζ0=0.90, ζ1=0.99, nrelax=25 0.04 scaled dΘ/dt continuous discretized 0.035 0.03 Θ(t) (input) ϑ(t) (filtered) voltage V(t) (output) t = 56ms max t =300ms c 0.025 0 10 2 η(t): α=3.29, R =1.00 n =10 → t =37ms log Θ(t) amplitude relax max 0.02 0.015 0.01 0.005 0 −0.005 0 50 100 150 200 250 300 −0.01 0 350 time [ms] 50 100 150 200 250 300 350 time [ms] (b) ψ versus η (a) discretized optical variables Figure 1: (a) The continuous visual angle of an approaching object is shown along with its discretized version. Discretization transforms angular velocity from a continuous variable into a series of “spikes” (rescaled). (b) The ψ function with the inputs shown in a, with nrelax = 25 relaxation time steps. Its peak occurs tmax = 56ms before ttc (tc = 300ms). An η function (α = 3.29) that was fitted to ψ shows good agreement. For continuous optical variables, the peak would occur 4ms earlier, and η would have α = 4.44 with R2 = 1. For nrelax = 10, ψ is farther away from its equilibrium at V∞ , and its peak moves 19ms closer to ttc. t =500ms, dia=12.0cm, ∆t c =1.00ms, dt=10.00µs, discrete=1 stim 250 n relax = 50 2 200 α=4.66, R =0.99 [normal] n = 25 relax 2 α=3.91, R =1.00 [normal] n =0 relax tmax [ms] 150 2 α=1.15, R =0.99 [normal] 100 50 0 β=1.00, γ=7.50, e=3.00, V =−0.001, ζ =0.90, ζ =0.99 inh −50 5 10 15 20 25 30 0 35 1 40 45 50 l/|v| [ms] (a) different nrelax (b) different ∆tstim ˆ ˆ Figure 2: The figures plot the relative time tmax ≡ tc − t of the response peak of ψ, V (t), as a function of half-size-to-velocity ratio (points). Line fits with slope α and intercept δ were added (lines). The predicted linear relationship in all cases is consistent with experimental evidence [9]. (a) The stimulus time scale is held constant at ∆tstim = 1ms, and several LGMD time scales are defined by nrelax (= number of intercalated relaxation steps for each integration time step). Bigger values of nrelax move V (t) closer to its equilibrium V∞ (t), implying higher slopes α in turn. (b) LGMD time scale is fixed at nrelax = 25, and ∆tstim is manipulated. Because of the discretization of optical variables (OVs) in our simulation, increasing ∆tstim translates to an overall smaller number of jumps in OVs, but each with higher amplitude. Thus, we say ψ(t) ≡ V (t) if and only if gexc and ginh are defined with the last equation. The time ˙ scale of stimulation is defined by ∆tstim (by default 1ms). The variables ϑ and ϑ are lowpass filtered angular size and rate of expansion, respectively. The amount of filtering is defined by memory constants ζ0 and ζ1 (no filtering if zero). The idea is to continue with generating synaptic input ˙ after ttc, where Θ(t > tc ) = const and thus Θ(t > tc ) = 0. Inhibition is first weighted by γ, and then potentiated by the exponent e. Hodgkin-Huxley potentiates gating variables n, m ∈ [0, 1] instead (potassium ∝ n4 , sodium ∝ m3 , [12]) and multiplies them with conductances. Gabbiani and co-workers found that the function which transforms membrane potential to firing rate is better described by a power function with e = 3 than by exp(·) (Figure 4d in [8]). 3 Dynamics of the ψ-function 3 Discretization. In a typical experiment, a monitor is placed a short distance away from the insect’s eye, and an approaching object is displayed. Computer screens have a fixed spatial resolution, and as a consequence size increments of the displayed object proceed in discrete jumps. The locust retina is furthermore composed of a discrete array of ommatidia units. We therefore can expect a corresponding step-wise increment of Θ with time, although optical and neuronal filtering may ˙ smooth Θ to some extent again, resulting in ϑ (figure 1). Discretization renders Θ discontinuous, ˙ For simulating the dynamics of ψ, we discretized angular size what again will be alleviated in ϑ. ˙ with floor(Θ), and Θ(t) ≈ [Θ(t + ∆tstim ) − Θ(t)]/∆tstim . Discretized optical variables (OVs) were re-normalized to match the range of original (i.e. continuous) OVs. To peak, or not to peak? Rind & Simmons reject the hypothesis that the activity peak signals impending collision on grounds of two arguments [28]: (i) If Θ(t + ∆tstim ) − Θ(t) 3o in consecutively displayed stimulus frames, the illusion of an object approach would be lost. Such stimulation would rather be perceived as a sequence of rapidly appearing (but static) objects, causing reduced responses. (ii) After the last stimulation frame has been displayed (that is Θ = const), LGMD responses keep on building up beyond ttc. This behavior clearly depends on l/|v|, also according to their own data (e.g. Figure 4 in [26]): Response build up after ttc is typically observed for suffi˙ ciently small values of l/|v|. Input into ψ in situations where Θ = const and Θ = 0, respectively, ˙ is accommodated by ϑ and ϑ, respectively. We simulated (i) by setting ∆tstim = 5ms, thus producing larger and more infrequent jumps in discrete OVs than with ∆tstim = 1ms (default). As a consequence, ϑ(t) grows more slowly (deˆ layed build up of inhibition), and the peak occurs later (tmax ≡ tc − t = 10ms with everything else ˆ ˆ identical with figure 1b). The peak amplitude V = V (t) decreases nearly sixfold with respect to default. Our model thus predicts the reduced responses observed by Rind & Simmons [28]. Linearity. Time of peak firing rate is linearly related to l/|v| [10, 9]. The η-function is consistent ˆ with this experimental evidence: t = tc − αl/|v| + δ (e.g. α = 4.7, δ = −27ms). The ψ-function reproduces this relationship as well (figure 2), where α depends critically on the time scale of biophysical processes in the LGMD. We studied the impact of this time scale by choosing 10µs for the numerical integration of equation 1 (algorithm: 4th order Runge-Kutta). Apart from improving the numerical stability of the integration algorithm, ψ is far from its equilibrium V∞ (t) in every moment ˙ t, given the stimulation time scale ∆tstim = 1ms 6 . Now, at each value of Θ(t) and Θ(t), respectively, we intercalated nrelax iterations for integrating ψ. Each iteration takes V (t) asymptotically closer to V∞ (t), and limnrelax 1 V (t) = V∞ (t). If the internal processes in the LGMD cannot keep up with stimulation (nrelax = 0), we obtain slopes values that underestimate experimentally found values (figure 2a). In contrast, for nrelax 25 we get an excellent agreement with the experimentally determined α. This means that – under the reported experimental stimulation conditions (e.g. [9]) – the LGMD would operate relatively close to its steady state7 . Now we fix nrelax at 25 and manipulate ∆tstim instead (figure 2b). The default value ∆tstim = 1ms corresponds to α = 3.91. Slightly bigger values of ∆tstim (2.5ms and 5ms) underestimate the experimental α. In addition, the line fits also return smaller intercept values then. We see tmax < 0 up to l/|v| ≈ 13.5ms – LGMD activity peaks after ttc! Or, in other words, LGMD activity continues to increase after ttc. In the limit, where stimulus dynamics is extremely fast, and LGMD processes are kept far from equilibrium at each instant of the approach, α gets very small. As a consequence, tmax gets largely independent of l/|v|: The activity peak would cling to tmax although we varied l/|v|. 4 Freeze! Experimental data versus steady state of “psi” In the previous section, experimentally plausible values for α were obtained if ψ is close to equilibrium at each instant of time during stimulation. In this section we will thus introduce a steady-state 6 Assuming one ∆tstim for each integration time step. This means that by default stimulation and biophysical dynamics will proceed at identical time scales. 7 Notice that in this moment we can only make relative statements - we do not have data at hand for defining absolute time scales 4 tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 300 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 350 300 β=10.00 β=5.00 norm. rmse = 0.058...0.153 correlation (β,α)=−0.90 (n=4) ∞ β=1.00 e=4.00 norm. |η−ψ | = 0.009...0.114 e=3.00 300 norm. rmse = 0.014...0.160 correlation (e,α)=0.98 (n=4) ∞ e=2.50 250 250 norm. |η−ψ | = 0.043...0.241 ∞ norm. rmse = 0.085...0.315 correlation (γ,α)=1.00 (n=5) 150 tmax [ms] 200 tmax [ms] 200 tmax [ms] γ=5.00 γ=2.50 γ=1.00 γ=0.50 γ=0.25 e=5.00 norm. |η−ψ | = 0.020...0.128 β=2.50 250 200 150 100 150 100 100 50 50 50 0 5 10 15 20 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] (a) β varies 25 30 35 40 45 50 l/|v| [ms] (b) e varies (c) γ varies ˆ ˆ Figure 3: Each curve shows how the peak ψ∞ ≡ ψ∞ (t) depends on the half-size-to-velocity ratio. In each display, one parameter of ψ∞ is varied (legend), while the others are held constant (figure title). Line slopes vary according to parameter values. Symbol sizes are scaled according to rmse (see also figure 4). Rmse was calculated between normalized ψ∞ (t) & normalized η(t) (i.e. both functions ∈ [0, 1] with original minimum and maximum indicated by the textbox). To this end, the ˆ peak of the η-function was placed at tc , by choosing, at each parameter value, α = |v| · (tc − t)/l (for determining correlation, the mean value of α was taken across l/|v|). tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 0.25 β=5.00 0.12 β=2.50 β=1.00 0.1 0.08 (normalized η, ψ∞) 0.12 β=10.00 (normalized η, ψ∞) (normalized η, ψ∞) 0.14 0.1 0.08 γ=5.00 γ=2.50 0.2 γ=1.00 γ=0.50 γ=0.25 0.15 0.06 0.04 0.02 0 5 10 15 20 25 30 35 40 45 50 meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| 0.06 0.04 e=5.00 e=4.00 e=3.00 0.02 e=2.50 10 l/|v| [ms] 15 20 25 30 35 40 45 50 l/|v| [ms] (a) β varies (b) e varies 0.1 0.05 0 5 10 15 20 25 30 35 40 45 50 l/|v| [ms] (c) γ varies Figure 4: This figure complements figure 3. It visualizes the time averaged absolute difference between normalized ψ∞ (t) & normalized η(t). For η, its value of α was chosen such that the maxima of both functions coincide. Although not being a fit, it gives a rough estimate on how the shape of both curves deviate from each other. The maximum possible difference would be one. version of ψ (i.e. equation 2 with Vrest = 0, Vexc = 1, and equations 3 plugged in), ψ∞ (t) ≡ e ˙ Θ(t) + Vinh [γΘ(t)] e ˙ β + Θ(t) + [γΘ(t)] (4) (Here we use continuous versions of angular size and rate of expansion). The ψ∞ -function makes life easier when it comes to fitting experimental data. However, it has its limitations, because we brushed the whole dynamic of ψ under the carpet. Figure 3 illustrates how the linˆ ear relationship (=“linearity”) between tmax ≡ tc − t and l/|v| is influenced by changes in parameter values. Changing any of the values of e, β, γ predominantly causes variation in line slopes. The smallest slope changes are obtained by varying Vinh (data not shown; we checked Vinh = 0, −0.001, −0.01, −0.1). For Vinh −0.01, linearity is getting slightly compromised, as slope increases with l/|v| (e.g. Vinh = −1 α ∈ [4.2, 4.7]). In order to get a notion about how well the shape of ψ∞ (t) matches η(t), we computed timeaveraged difference measures between normalized versions of both functions (details: figure 3 & 4). Bigger values of β match η better at smaller, but worse at bigger values of l/|v| (figure 4a). Smaller β cause less variation across l/|v|. As to variation of e, overall curve shapes seem to be best aligned with e = 3 to e = 4 (figure 4b). Furthermore, better matches between ψ∞ (t) and η(t) correspond to bigger values of γ (figure 4c). And finally, Vinh marches again to a different tune (data not shown). Vinh = −0.1 leads to the best agreement (≈ 0.04 across l/|v|) of all Vinh , quite different from the other considered values. For the rest, ψ∞ (t) and η(t) align the same (all have maximum 0.094), 5 ˙ (a) Θ = 126o /s ˙ (b) Θ = 63o /s Figure 5: The original data (legend label “HaGaLa95”) were resampled from ref. [10] and show ˙ DCMD responses to an object approach with Θ = const. Thus, Θ increases linearly with time. The η-function (fitting function: Aη(t+δ)+o) and ψ∞ (fitting function: Aψ∞ (t)+o) were fitted to these data: (a) (Figure 3 Di in [10]) Good fits for ψ∞ are obtained with e = 5 or higher (e = 3 R2 = 0.35 and rmse = 0.644; e = 4 R2 = 0.45 and rmse = 0.592). “Psi” adopts a sigmoid-like curve form which (subjectively) appears to fit the original data better than η. (b) (Figure 3 Dii in [10]) “Psi” yields an excellent fit for e = 3. RoHaTo10 gregarious locust LV=0.03s Θ(t), lv=30ms e011pos014 sgolay with 100 t =107ms max ttc=5.00s ψ adj.R2 0.95 (LM:3) ∞ η(t) adj.R2 1 (TR::1) 2 ψ : R =0.95, rmse=0.004, 3 coefficients ∞ → β=2.22, γ=0.70, e=3.00, V =−0.001, A=0.07, o=0.02, δ=0.00ms inh η: R2=1.00, rmse=0.001 → α=3.30, A=0.08, o=0.0, δ=−10.5ms 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 time [s] (b) α versus β (a) spike trace Figure 6: (a) DCMD activity in response to a black square (l/|v| = 30ms, legend label “e011pos14”, ref. [30]) approaching to the eye center of a gregarious locust (final visual angle 50o ). Data show the first stimulation so habituation is minimal. The spike trace (sampled at 104 Hz) was full wave rectified, lowpass filtered, and sub-sampled to 1ms resolution. Firing rate was estimated with Savitzky-Golay filtering (“sgolay”). The fits of the η-function (Aη(t + δ) + o; 4 coefficients) and ψ∞ -function (Aψ∞ (t) with fixed e, o, δ, Vinh ; 3 coefficients) provide both excellent fits to firing rate. (b) Fitting coefficient α (→ η-function) inversely correlates with β (→ ψ∞ ) when fitting firing rates of another 5 trials as just described (continuous line = line fit to the data points). Similar correlation values would be obtained if e is fixed at values e = 2.5, 4, 5 c = −0.95, −0.96, −0.91. If o was determined by the fitting algorithm, then c = −0.70. No clear correlations with α were obtained for γ. despite of covering different orders of magnitude with Vinh = 0, −0.001, −0.01. Decelerating approach. Hatsopoulos et al. [10] recorded DCMD activity in response to an ap˙ proaching object which projected image edges on the retina moving at constant velocity: Θ = const. ˙ This “linear approach” is perceived as if the object is getting increasingly implies Θ(t) = Θ0 + Θt. slower. But what appears a relatively unnatural movement pattern serves as a test for the functions η & ψ∞ . Figure 5 illustrates that ψ∞ passes the test, and consistently predicts that activity sharply rises in the initial approach phase, and subsequently declines (η passed this test already in the year 1995). 6 Spike traces. We re-sampled about 30 curves obtained from LGMD recordings from a variety of publications, and fitted η & ψ∞ -functions. We cannot show the results here, but in terms of goodness of fit measures, both functions are in the same ballbark. Rather, figure 6a shows a representative example [30]. When α and β are plotted against each other for five trials, we see a strong inverse correlation (figure 6b). Although five data points are by no means a firm statistical sample, the strong correlation could indicate that β and α play similar roles in both functions. Biophysically, β is the leakage conductance, which determines the (passive) membrane time constant τm ∝ 1/β of the neuron. Voltage drops within τm to exp(−1) times its initial value. Bigger values of β mean shorter τm (i.e., “faster neurons”). Getting back to η, this would suggest α ∝ τm , such that higher (absolute) values for α would possibly indicate a slower dynamic of the underlying processes. 5 Discussion (“The Good, the Bad, and the Ugly”) Up to now, mainly two classes of LGMD models existed: The phenomenological η-function on the one hand, and computational models with neuronal layers presynaptic to the LGMD on the other (e.g. [25, 15]; real-world video sequences & robotics: e.g. [3, 14, 32, 2]). Computational models predict that LGMD response features originate from excitatory and inhibitory interactions in – and between – presynaptic neuronal layers. Put differently, non-linear operations are generated in the presynaptic network, and can be a function of many (model) parameters (e.g. synaptic weights, time constants, etc.). In contrast, the η-function assigns concrete nonlinear operations to the LGMD [7]. The η-function is accessible to mathematical analysis, whereas computational models have to be probed with videos or artificial stimulus sequences. The η-function is vague about biophysical parameters, whereas (good) computational models need to be precise at each (model) parameter value. The η-function establishes a clear link between physical stimulus attributes and LGMD activity: It postulates what is to be computed from the optical variables (OVs). But in computational models, such a clear understanding of LGMD inputs cannot always be expected: Presynaptic processing may strongly transform OVs. The ψ function thus represents an intermediate model class: It takes OVs as input, and connects them with biophysical parameters of the LGMD. For the neurophysiologist, the situation could hardly be any better. Psi implements the multiplicative operation of the η-function by shunting inhibition (equation 1: Vexc ≈ Vrest and Vinh ≈ Vrest ). The η-function fits ψ very well according to our dynamical simulations (figure 1), and satisfactory by the approximate criterion of figure 4. We can conclude that ψ implements the η-function in a biophysically plausible way. However, ψ does neither explicitly specify η’s multiplicative operation, nor its exponential function exp(·). Instead we have an interaction between shunting inhibition and a power law (·)e , with e ≈ 3. So what about power laws in neurons? Because of e > 1, we have an expansive nonlinearity. Expansive power-law nonlinearities are well established in phenomenological models of simple cells of the primate visual cortex [1, 11]. Such models approximate a simple cell’s instantaneous firing rate r from linear filtering of a stimulus (say Y ) by r ∝ ([Y ]+ )e , where [·]+ sets all negative values to zero and lets all positive pass. Although experimental evidence favors linear thresholding operations like r ∝ [Y − Ythres ]+ , neuronal responses can behave according to power law functions if Y includes stimulus-independent noise [19]. Given this evidence, the power-law function of the inhibitory input into ψ could possibly be interpreted as a phenomenological description of presynaptic processes. The power law would also be the critical feature by means of which the neurophysiologist could distinguish between the η function and ψ. A study of Gabbiani et al. aimed to provide direct evidence for a neuronal implementation of the η-function [8]. Consequently, the study would be an evidence ˙ for a biophysical implementation of “direct” multiplication via log Θ − αΘ. Their experimental evidence fell somewhat short in the last part, where “exponentation through active membrane conductances” should invert logarithmic encoding. Specifically, the authors observed that “In 7 out of 10 neurons, a third-order power law best described the data” (sixth-order in one animal). Alea iacta est. Acknowledgments MSK likes to thank Stephen M. Rogers for kindly providing the recording data for compiling figure 6. MSK furthermore acknowledges support from the Spanish Government, by the Ramon and Cajal program and the research grant DPI2010-21513. 7 References [1] D.G. Albrecht and D.B. Hamilton, Striate cortex of monkey and cat: contrast response function, Journal of Neurophysiology 48 (1982), 217–237. [2] S. Bermudez i Badia, U. Bernardet, and P.F.M.J. Verschure, Non-linear neuronal responses as an emergent property of afferent networks: A case study of the locust lobula giant movemement detector, PLoS Computational Biology 6 (2010), no. 3, e1000701. [3] M. Blanchard, F.C. Rind, and F.M.J. Verschure, Collision avoidance using a model of locust LGMD neuron, Robotics and Autonomous Systems 30 (2000), 17–38. [4] D.F. Cooke and M.S.A. Graziano, Super-flinchers and nerves of steel: Defensive movements altered by chemical manipulation of a cortical motor area, Neuron 43 (2004), no. 4, 585–593. [5] L. Fogassi, V. Gallese, L. Fadiga, G. Luppino, M. Matelli, and G. Rizzolatti, Coding of peripersonal space in inferior premotor cortex (area f4), Journal of Neurophysiology 76 (1996), 141–157. [6] F. Gabbiani, I. Cohen, and G. Laurent, Time-dependent activation of feed-forward inhibition in a looming sensitive neuron, Journal of Neurophysiology 94 (2005), 2150–2161. [7] F. Gabbiani, H.G. Krapp, N. Hatsopolous, C.H. Mo, C. Koch, and G. Laurent, Multiplication and stimulus invariance in a looming-sensitive neuron, Journal of Physiology - Paris 98 (2004), 19–34. [8] F. Gabbiani, H.G. Krapp, C. Koch, and G. Laurent, Multiplicative computation in a visual neuron sensitive to looming, Nature 420 (2002), 320–324. [9] F. Gabbiani, H.G. Krapp, and G. Laurent, Computation of object approach by a wide-field, motionsensitive neuron, Journal of Neuroscience 19 (1999), no. 3, 1122–1141. [10] N. Hatsopoulos, F. Gabbiani, and G. Laurent, Elementary computation of object approach by a wide-field visual neuron, Science 270 (1995), 1000–1003. [11] D.J. Heeger, Modeling simple-cell direction selectivity with normalized, half-squared, linear operators, Journal of Neurophysiology 70 (1993), 1885–1898. [12] A.L. Hodkin and A.F. Huxley, A quantitative description of membrane current and its application to conduction and excitation in nerve, Journal of Physiology 117 (1952), 500–544. [13] F. Hoyle, The black cloud, Pinguin Books, London, 1957. [14] M.S. Keil, E. Roca-Morena, and A. Rodr´guez-V´ zquez, A neural model of the locust visual system for ı a detection of object approaches with real-world scenes, Proceedings of the Fourth IASTED International Conference (Marbella, Spain), vol. 5119, 6-8 September 2004, pp. 340–345. [15] M.S. Keil and A. Rodr´guez-V´ zquez, Towards a computational approach for collision avoidance with ı a real-world scenes, Proceedings of SPIE: Bioengineered and Bioinspired Systems (Maspalomas, Gran Canaria, Canary Islands, Spain) (A. Rodr´guez-V´ zquez, D. Abbot, and R. Carmona, eds.), vol. 5119, ı a SPIE - The International Society for Optical Engineering, 19-21 May 2003, pp. 285–296. [16] J.G. King, J.Y. Lettvin, and E.R. Gruberg, Selective, unilateral, reversible loss of behavioral responses to looming stimuli after injection of tetrodotoxin or cadmium chloride into the frog optic nerve, Brain Research 841 (1999), no. 1-2, 20–26. [17] C. Koch, Biophysics of computation: information processing in single neurons, Oxford University Press, New York, 1999. [18] D.N. Lee, A theory of visual control of braking based on information about time-to-collision, Perception 5 (1976), 437–459. [19] K.D. Miller and T.W. Troyer, Neural noise can explain expansive, power-law nonlinearities in neuronal response functions, Journal of Neurophysiology 87 (2002), 653–659. [20] Hideki Nakagawa and Kang Hongjian, Collision-sensitive neurons in the optic tectum of the bullfrog, rana catesbeiana, Journal of Neurophysiology 104 (2010), no. 5, 2487–2499. [21] M. O’Shea and C.H.F. Rowell, Projection from habituation by lateral inhibition, Nature 254 (1975), 53– 55. [22] M. O’Shea and J.L.D. Williams, The anatomy and output connection of a locust visual interneurone: the lobula giant movement detector (lgmd) neurone, Journal of Comparative Physiology 91 (1974), 257–266. [23] S. Peron and F. Gabbiani, Spike frequency adaptation mediates looming stimulus selectivity, Nature Neuroscience 12 (2009), no. 3, 318–326. [24] F.C. Rind, A chemical synapse between two motion detecting neurones in the locust brain, Journal of Experimental Biology 110 (1984), 143–167. [25] F.C. Rind and D.I. Bramwell, Neural network based on the input organization of an identified neuron signaling implending collision, Journal of Neurophysiology 75 (1996), no. 3, 967–985. 8 [26] F.C. Rind and P.J. Simmons, Orthopteran DCMD neuron: a reevaluation of responses to moving objects. I. Selective responses to approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1654–1666. [27] , Orthopteran DCMD neuron: a reevaluation of responses to moving objects. II. Critical cues for detecting approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1667–1682. [28] , Signaling of object approach by the dcmd neuron of the locust, Journal of Neurophysiology 77 (1997), 1029–1033. [29] , Reply, Trends in Neuroscience 22 (1999), no. 5, 438. [30] S.M. Roger, G.W.J. Harston, F. Kilburn-Toppin, T. Matheson, M. Burrows, F. Gabbiani, and H.G. Krapp, Spatiotemporal receptive field properties of a looming-sensitive neuron in solitarious and gregarious phases of desert locust, Journal of Neurophysiology 103 (2010), 779–792. [31] S.K. Rushton and J.P. Wann, Weighted combination of size and disparity: a computational model for timing ball catch, Nature Neuroscience 2 (1999), no. 2, 186–190. [32] Yue. S., Rind. F.C., M.S. Keil, J. Cuadri, and R. Stafford, A bio-inspired visual collision detection mechanism for cars: Optimisation of a model of a locust neuron to a novel environment, Neurocomputing 69 (2006), 1591–1598. [33] G.R. Schlotterer, Response of the locust descending movement detector neuron to rapidly approaching and withdrawing visual stimuli, Canadian Journal of Zoology 55 (1977), 1372–1376. [34] H. Sun and B.J. Frost, Computation of different optical variables of looming objects in pigeon nucleus rotundus neurons, Nature Neuroscience 1 (1998), no. 4, 296–303. [35] J.R. Tresilian, Visually timed action: time-out for ’tau’?, Trends in Cognitive Sciences 3 (1999), no. 8, 1999. [36] Y. Wang and B.J. Frost, Time to collision is signalled by neurons in the nucleus rotundus of pigeons, Nature 356 (1992), 236–238. [37] J.P. Wann, Anticipating arrival: is the tau-margin a specious theory?, Journal of Experimental Psychology and Human Perceptual Performance 22 (1979), 1031–1048. [38] M. Wicklein and N.J. Strausfeld, Organization and significance of neurons that detect change of visual depth in the hawk moth manduca sexta, The Journal of Comparative Neurology 424 (2000), no. 2, 356– 376. 9

same-paper 3 0.80735797 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

4 0.74253422 56 nips-2011-Committing Bandits

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

5 0.70643705 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

Author: Alexandra Carpentier, Rémi Munos

Abstract: We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution� dependent bound O(n−3/2 ) that depends on a measure of the disparity of � the strata, and a distribution-free bound O(n−4/3 ) that does not. 1

6 0.69022864 32 nips-2011-An Empirical Evaluation of Thompson Sampling

7 0.68413782 24 nips-2011-Active learning of neural response functions with Gaussian processes

8 0.67410666 177 nips-2011-Multi-armed bandits on implicit metric spaces

9 0.65620345 61 nips-2011-Contextual Gaussian Process Bandit Optimization

10 0.64796686 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

11 0.63794798 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

12 0.63647252 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

13 0.63540965 283 nips-2011-The Fixed Points of Off-Policy TD

14 0.63469529 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

15 0.63414824 271 nips-2011-Statistical Tests for Optimization Efficiency

16 0.63295901 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

17 0.6287095 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

18 0.62730855 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

19 0.62713981 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

20 0.62677979 231 nips-2011-Randomized Algorithms for Comparison-based Search