nips nips2011 nips2011-32 knowledge-graph by maker-knowledge-mining

32 nips-2011-An Empirical Evaluation of Thompson Sampling


Source: pdf

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. [sent-5, score-0.343]

2 We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. [sent-6, score-0.126]

3 1 Introduction Various algorithms have been proposed to solve exploration / exploitation or bandit problems. [sent-8, score-0.314]

4 One of the most popular is Upper Confidence Bound or UCB [7, 3], for which strong theoretical guarantees on the regret can be proved. [sent-9, score-0.225]

5 The idea of Thompson sampling is to randomly draw each arm according to its probability of being optimal. [sent-14, score-0.24]

6 In contrast to a full Bayesian method like Gittins index, one can often implement Thompson sampling efficiently. [sent-15, score-0.102]

7 Recent results using Thompson sampling seem promising [5, 6, 14, 12]. [sent-16, score-0.102]

8 In this work, we present some empirical results, first on a simulated problem and then on two realworld ones: display advertisement selection and news article recommendation. [sent-19, score-0.285]

9 In all cases, despite its simplicity, Thompson sampling achieves state-of-the-art results, and in some cases significantly outperforms other alternatives like UCB. [sent-20, score-0.102]

10 The findings suggest the necessity to include Thompson sampling as part of the standard baselines to compare against, and to develop finite-time regret bound for this empirically successful algorithm. [sent-21, score-0.357]

11 2 Algorithm The contextual bandit setting is as follows. [sent-22, score-0.114]

12 After choosing an action a ∈ A, we observe a reward r. [sent-24, score-0.144]

13 The goal is to find a policy that selects actions such that the cumulative reward is as large as possible. [sent-25, score-0.189]

14 Thompson sampling is best understood in a Bayesian setting as follows. [sent-26, score-0.102]

15 1 In the realizable case, the reward is a stochastic function of the action, context and the unknown, true parameter θ∗ . [sent-29, score-0.118]

16 If we are just interested in maximizing the immediate reward (exploitation), then one should choose the action that maximizes E(r|a, x) = E(r|a, x, θ)P (θ|D)dθ. [sent-32, score-0.144]

17 But in an exploration / exploitation setting, the probability matching heuristic consists in randomly selecting an action a according to its probability of being optimal. [sent-33, score-0.288]

18 , T do Receive context xt Draw θt according to P (θ|D) Select at = arg maxa Er (r|xt , a, θt ) Observe reward rt D = D ∪ (xt , at , rt ) end for In the standard K-armed Bernoulli bandit, each action corresponds to the choice of an arm. [sent-40, score-0.217]

19 The ∗ reward of the i-th arm follows a Bernoulli distribution with mean θi . [sent-41, score-0.198]

20 It is standard to model the mean reward of each arm using a Beta distribution since it is the conjugate distribution of the binomial distribution. [sent-42, score-0.198]

21 The instantiation of Thompson sampling for the Bernoulli bandit is given in algorithm 2. [sent-43, score-0.216]

22 Algorithm 2 Thompson sampling for the Bernoulli bandit Require: α, β prior parameters of a Beta distribution Si = 0, Fi = 0, ∀i. [sent-45, score-0.244]

23 end for Draw arm ˆ = arg maxi θi and observe reward r ı if r = 1 then Sˆ = Sˆ + 1 ı ı else Fˆ = Fˆ + 1 ı ı end if end for 3 Simulations We present some simulation results with Thompson sampling for the Bernoulli bandit problem and compare them to the UCB algorithm. [sent-53, score-0.434]

24 The reward probability of each of the K arms is modeled by a Beta distribution which is updated after an arm is selected (see algorithm 2). [sent-54, score-0.262]

25 Specifically, we chose the arm for which the following upper 2 confidence bound [8, page 278] is maximum: k 2 m log m k + m 1 δ + 2 log 1 δ , m 1 , t δ= (1) where m is the number of times the arm has been selected and k its total reward. [sent-57, score-0.262]

26 In this simulation, the best arm has a reward probability of 0. [sent-59, score-0.198]

27 The regret as a function of T for various settings is plotted in figure 1. [sent-63, score-0.225]

28 An asymptotic lower bound has been established in [7] for the regret of a bandit algorithm: K R(T ) ≥ log(T ) i=1 p∗ − pi + o(1) , D(pi ||p∗ ) (2) where pi is the reward probability of the i-th arm, p∗ = max pi and D is the Kullback-Leibler divergence. [sent-64, score-0.636]

29 The plots in figure 1 show that the regrets are indeed logarithmic in T (the linear trend on the right hand side) and it turns out that the observed constants (slope of the lines) are close to the optimal constants given by the lower bound (2). [sent-66, score-0.116]

30 1 10000 900 800 700 Thompson UCB Asymptotic lower bound Thompson UCB Asymptotic lower bound 8000 Regret Regret 600 500 400 6000 4000 300 200 2000 100 0 2 10 3 4 10 5 10 T 0 2 10 6 10 10 K=10, ε=0. [sent-71, score-0.122]

31 As with any Bayesian algorithm, one can wonder about the robustness of Thompson sampling to prior mismatch. [sent-78, score-0.13]

32 In particular, when the reward probability of the best arm is 0. [sent-83, score-0.198]

33 We can thus conclude that in these simulations, Thompson sampling is asymptotically optimal and achieves a smaller regret than the popular UCB algorithm. [sent-86, score-0.351]

34 Optimistic Thompson sampling The intuition behind UCB and Thompson sampling is that, for the purpose of exploration, it is beneficial to boost the predictions of actions for which we are uncertain. [sent-88, score-0.227]

35 But Thompson sampling modifies the predictions in both directions and there is apparently no benefit in decreasing a prediction. [sent-89, score-0.102]

36 This observation led to a recently proposed algorithm called Optimistic Bayesian sampling [11] in which the modified score is never smaller than the mean. [sent-90, score-0.102]

37 Simulations in [12] showed some gains using this optimistic version of Thompson sampling. [sent-92, score-0.112]

38 A possible explanation is that when the number of arms is large, it is likely that, in standard Thompson sampling, the selected arm has a already a boosted score. [sent-96, score-0.165]

39 Posterior reshaping Thompson sampling is a heuristic advocating to draw samples from the posterior, but one might consider changing that heuristic to draw samples from a modified distribution. [sent-97, score-0.258]

40 In particular, sharpening the posterior would have the effect of increasing exploitation while widening it would favor exploration. [sent-98, score-0.155]

41 Figure 3 shows the average and distribution of regret for different values of α. [sent-101, score-0.225]

42 Values of α smaller than 1 decrease the amount of exploration and often result in lower regret. [sent-102, score-0.13]

43 But the price to pay is a higher variance: in some runs, the regret is very large. [sent-103, score-0.225]

44 The average regret is asymptotically not as good as with α = 1, but tends to be better in the non-asymptotic regime. [sent-104, score-0.249]

45 Impact of delay In a real world system, the feedback is typically not processed immediately because of various runtime constraints. [sent-105, score-0.12]

46 We now try to quantify the impact of this delay by doing some simulations that mimic the problem of news articles recommendation [9] that will be described in section 5. [sent-107, score-0.302]

47 25 T Figure 3: Thompson sampling where the parameters of the Beta posterior distribution have been divided by α. [sent-112, score-0.156]

48 Table 1: Influence of the delay: regret when the feedback is provided every δ steps. [sent-117, score-0.255]

49 Table 1 shows the average regret (over 100 repetitions) of Thompson sampling and UCB at T = 106 . [sent-129, score-0.327]

50 An interesting quantity in this simulation is the relative regret of UCB and Thompson sampling. [sent-130, score-0.245]

51 It appears that Thompson sampling is more robust than UCB when the delay is long. [sent-131, score-0.192]

52 Thompson sampling alleviates the influence of delayed feedback by randomizing over actions; on the other hand, UCB is deterministic and suffers a larger regret in case of a sub-optimal choice. [sent-132, score-0.377]

53 Given a user visiting a publisher page, the problem is to select the best advertisement for that user. [sent-134, score-0.145]

54 A key element in this matching problem is the click-through rate (CTR) estimation: what is the probability that a given ad will be clicked given some context (user, page visited)? [sent-135, score-0.19]

55 Indeed, in a cost-per-click (CPC) campaign, the advertiser only pays when his ad gets clicked. [sent-136, score-0.158]

56 There is of course a fundamental exploration / exploitation dilemma here: in order to learn the CTR of an ad, it needs to be displayed, leading to a potential loss of short-term revenue. [sent-138, score-0.2]

57 {Each weight wi has an independent prior N (mi , qi )} for t = 1, . [sent-151, score-0.107]

58 Find w as the minimizer of: mi = wi 1 2 d n qi (wi − mi )2 + i=1 log(1 + exp(−yj w xj )). [sent-158, score-0.139]

59 j=1 n x2 pj (1 − pj ), pj = (1 + exp(−w xj ))−1 {Laplace approximation} ij qi = qi + j=1 end for Evaluating an explore / exploit policy is difficult because we typically do not know the reward of an action that was not chosen. [sent-159, score-0.427]

60 A possible solution, as we shall see in section 5, is to use a replayer in which previous, randomized exploration data can be used to produce an unbiased offline estimator of the new policy [10]. [sent-160, score-0.207]

61 [15] studies another promising approach using the idea of importance weighting, but the method applies only when the policy is static, which is not the case for online bandit algorithms that constantly adapt to its history. [sent-162, score-0.162]

62 More precisely, the context and the ads are real, but the clicks are simulated using a weight vector w∗ . [sent-164, score-0.191]

63 About 13,000 contexts, representing a small random subset of the total traffic, are presented every hour to the policy which has to choose an ad among a set of eligible ads. [sent-167, score-0.276]

64 The number of eligible ads for each context depends on numerous constraints set by the advertiser and the publisher. [sent-168, score-0.184]

65 Note that in this experiment, the number of eligible ads is smaller than what we would observe in live traffic because we restricted the set of advertisers. [sent-170, score-0.115]

66 A feature vector is constructed for every (context, ad) pair and the policy decides which ad to show. [sent-172, score-0.158]

67 A click for that ad is then generated with probability (1 + exp(−w∗ x))−1 . [sent-173, score-0.148]

68 The total number of clicks received during this one hour period is the reward. [sent-175, score-0.147]

69 These strategies are: Thompson sampling This is algorithm 1 where each weight is drawn independently according to −1 its Gaussian posterior approximation N (mi , qi ) (see algorithm 3). [sent-178, score-0.235]

70 It selects the ad based on mean and standard deviation. [sent-184, score-0.11]

71 It also has a factor α to control the exploration / exploitation trade-off. [sent-185, score-0.2]

72 More precisely, LinUCB selects the ad for which d i=1 d −1 2 mi xi + α i=1 qi xi is maximum. [sent-186, score-0.207]

73 45 Table 2: CTR regrets on the display advertising data. [sent-191, score-0.145]

74 01 0 0 10 20 30 40 50 Hour 60 70 80 90 Figure 4: CTR regret over the 4 days test period for 3 algorithms: Thompson sampling with α = 0. [sent-217, score-0.371]

75 ε-greedy Mix between exploitation and random: with ε probability, select a random ad; otherwise, select the one with the highest mean. [sent-221, score-0.155]

76 Thompson sampling achieves the best regret and interestingly the modified version with α = 0. [sent-227, score-0.327]

77 Also, the fact that exploit-only is so much better than random might explain why ε-greedy does not beat it: whenever this strategy chooses a random action, it suffers a large regret in average which is not compensated by its exploration benefit. [sent-233, score-0.324]

78 Finally figure 4 shows the regret of three algorithms across time. [sent-234, score-0.225]

79 As expected, the regret has a decreasing trend over time. [sent-235, score-0.225]

80 5 News Article Recommendation In this section, we consider another application of Thompson sampling in personalized news article recommendation on Yahoo! [sent-236, score-0.38]

81 Each time a user visits the portal, a news article out of a small pool of hand-picked candidates is recommended. [sent-238, score-0.32]

82 The candidate pool is dynamic: old articles may retire and new articles may be added in. [sent-239, score-0.165]

83 The goal is to choose the most interesting article to users, or formally, maximize the total number of clicks on the recommended articles. [sent-241, score-0.213]

84 In this case, we treat articles as arms, and define the payoff to be 1 if the article is clicked on and 0 otherwise. [sent-242, score-0.241]

85 2 1 10 30 60 Delay (min) Figure 5: Normalized CTRs of various algorithm on the news article recommendation data with different update delays: {10, 30, 60} minutes. [sent-252, score-0.254]

86 Each user was associated with a binary raw feature vector of over 1000 dimension, which indicates information of the user like age, gender, geographical location, behavioral targeting, etc. [sent-254, score-0.116]

87 Here, we adopted the simpler principal component analysis (PCA), which did not appear to affect the bandit algorithms much in our experience. [sent-257, score-0.114]

88 We use logistic regression, as in Algorithm 3, to model article CTRs: given a user feature vector x ∈ 21 , the probability of click on an article a is (1 + exp(−x wa ))−1 for some weight vector wa ∈ 21 to be learned. [sent-261, score-0.432]

89 Furthermore, given the size of data, we have not found article features to be helpful. [sent-264, score-0.135]

90 Indeed, it is shown in our previous work [9, Figure 5] that article features are helpful in this domain only when data are highly sparse. [sent-265, score-0.135]

91 Given the small size of candidate pool, we adopt the unbiased offline evaluation method of [10] to compare various bandit algorithms. [sent-266, score-0.139]

92 In particular, we collected randomized serving events for a random fraction of user visits; in other words, these random users were recommended an article chosen uniformly from the candidate pool. [sent-267, score-0.307]

93 From 7 days in June 2009, over 34M randomized serving events were obtained. [sent-268, score-0.113]

94 In contrast, randomized algorithms are more robust to delay, and when there is a one-hour delay, (optimistic) Thompson sampling is significant better than others (given the size of our data). [sent-274, score-0.137]

95 6 Conclusion The extensive experimental evaluation carried out in this paper reveals that Thompson sampling is a very effective heuristic for addressing the exploration / exploitation trade-off. [sent-275, score-0.343]

96 In its simplest form, it does not have any parameter to tune, but our results show that tweaking the posterior to reduce exploration can be beneficial. [sent-276, score-0.185]

97 First, it would hopefully contribute to make Thompson sampling as popular as other algorithms for which regret bounds exist. [sent-281, score-0.327]

98 Solving two-armed bernoulli bandit problems using a bayesian learning automaton. [sent-310, score-0.181]

99 Unbiased offline evaluation of contextual-banditbased news article recommendation algorithms. [sent-335, score-0.254]

100 Simulation studies in optimistic Bayesian sampling in contextual-bandit problems. [sent-346, score-0.214]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('thompson', 0.77), ('regret', 0.225), ('ucb', 0.223), ('ctr', 0.144), ('article', 0.135), ('bandit', 0.114), ('optimistic', 0.112), ('ad', 0.11), ('sampling', 0.102), ('arm', 0.101), ('exploitation', 0.101), ('exploration', 0.099), ('reward', 0.097), ('delay', 0.09), ('beta', 0.087), ('linucb', 0.074), ('hour', 0.07), ('ads', 0.067), ('news', 0.066), ('arms', 0.064), ('advertising', 0.058), ('user', 0.058), ('articles', 0.056), ('clicks', 0.055), ('regrets', 0.055), ('qi', 0.055), ('ctrs', 0.055), ('posterior', 0.054), ('recommendation', 0.053), ('policy', 0.048), ('advertiser', 0.048), ('eligible', 0.048), ('action', 0.047), ('mi', 0.042), ('gure', 0.042), ('bernoulli', 0.041), ('heuristic', 0.041), ('asymptotic', 0.04), ('lihong', 0.039), ('click', 0.038), ('yahoo', 0.038), ('draw', 0.037), ('simulations', 0.037), ('gittins', 0.036), ('serving', 0.036), ('dence', 0.035), ('randomized', 0.035), ('tried', 0.034), ('pool', 0.033), ('pi', 0.033), ('pj', 0.033), ('display', 0.032), ('ots', 0.032), ('benedict', 0.032), ('bristol', 0.032), ('publisher', 0.032), ('tweaking', 0.032), ('langford', 0.031), ('lower', 0.031), ('bound', 0.03), ('page', 0.03), ('feedback', 0.03), ('xt', 0.03), ('clicked', 0.029), ('visits', 0.028), ('advertisement', 0.028), ('clara', 0.028), ('prior', 0.028), ('select', 0.027), ('exploit', 0.026), ('bayesian', 0.026), ('eg', 0.026), ('traf', 0.026), ('ine', 0.026), ('unbiased', 0.025), ('asymptotically', 0.024), ('weight', 0.024), ('personalized', 0.024), ('simulated', 0.024), ('recommended', 0.023), ('santa', 0.023), ('variance', 0.023), ('actions', 0.023), ('maxa', 0.022), ('period', 0.022), ('days', 0.022), ('mismatch', 0.022), ('context', 0.021), ('payoff', 0.021), ('shifted', 0.021), ('chu', 0.021), ('cumulative', 0.021), ('wa', 0.021), ('events', 0.02), ('simulation', 0.02), ('old', 0.02), ('delayed', 0.02), ('visited', 0.02), ('repetitions', 0.019), ('con', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

2 0.25058243 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

3 0.20986629 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1

4 0.17577206 56 nips-2011-Committing Bandits

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

5 0.16420226 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

6 0.15035516 220 nips-2011-Prediction strategies without loss

7 0.14399269 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

8 0.13536413 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

9 0.13359478 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

10 0.10730389 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

11 0.10567774 175 nips-2011-Multi-Bandit Best Arm Identification

12 0.10407271 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

13 0.093443751 177 nips-2011-Multi-armed bandits on implicit metric spaces

14 0.090963863 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

15 0.089149065 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

16 0.086339183 272 nips-2011-Stochastic convex optimization with bandit feedback

17 0.08463496 145 nips-2011-Learning Eigenvectors for Free

18 0.082262807 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

19 0.074620605 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

20 0.074362956 80 nips-2011-Efficient Online Learning via Randomized Rounding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, -0.271), (2, 0.05), (3, 0.096), (4, 0.128), (5, -0.054), (6, 0.041), (7, 0.126), (8, 0.066), (9, 0.071), (10, -0.147), (11, 0.026), (12, -0.108), (13, -0.019), (14, 0.014), (15, 0.047), (16, -0.07), (17, -0.07), (18, 0.007), (19, -0.009), (20, -0.005), (21, 0.084), (22, 0.007), (23, -0.023), (24, 0.009), (25, -0.019), (26, -0.055), (27, -0.054), (28, 0.027), (29, 0.068), (30, 0.097), (31, -0.006), (32, -0.018), (33, -0.05), (34, -0.15), (35, 0.058), (36, -0.024), (37, -0.004), (38, -0.052), (39, -0.022), (40, -0.005), (41, 0.011), (42, -0.021), (43, 0.012), (44, -0.009), (45, 0.077), (46, 0.013), (47, -0.052), (48, 0.019), (49, -0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94350618 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

2 0.80749899 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1

3 0.77773905 56 nips-2011-Committing Bandits

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

4 0.76597267 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

5 0.69378436 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

Author: Shie Mannor, Ohad Shamir

Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1

6 0.68628973 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

7 0.66552591 175 nips-2011-Multi-Bandit Best Arm Identification

8 0.65980184 272 nips-2011-Stochastic convex optimization with bandit feedback

9 0.63328689 61 nips-2011-Contextual Gaussian Process Bandit Optimization

10 0.59784222 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

11 0.53792715 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

12 0.50741565 220 nips-2011-Prediction strategies without loss

13 0.46640211 177 nips-2011-Multi-armed bandits on implicit metric spaces

14 0.43958482 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

15 0.4066875 25 nips-2011-Adaptive Hedge

16 0.39573222 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

17 0.34695563 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

18 0.34495559 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

19 0.33470613 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

20 0.32853389 80 nips-2011-Efficient Online Learning via Randomized Rounding


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.021), (4, 0.027), (7, 0.13), (10, 0.03), (20, 0.032), (26, 0.025), (31, 0.094), (33, 0.012), (43, 0.047), (45, 0.136), (47, 0.037), (57, 0.067), (74, 0.046), (78, 0.016), (79, 0.055), (83, 0.047), (84, 0.012), (99, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88094956 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

2 0.82054472 86 nips-2011-Empirical models of spiking in neural populations

Author: Jakob H. Macke, Lars Buesing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani

Abstract: Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-offit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the non-Gaussian model provides slightly better goodness-of-fit and more realistic population spike counts. 1

3 0.81198335 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

4 0.8086009 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur

Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.

5 0.80640101 85 nips-2011-Emergence of Multiplication in a Biophysical Model of a Wide-Field Visual Neuron for Computing Object Approaches: Dynamics, Peaks, & Fits

Author: Matthias S. Keil

Abstract: Many species show avoidance reactions in response to looming object approaches. In locusts, the corresponding escape behavior correlates with the activity of the lobula giant movement detector (LGMD) neuron. During an object approach, its firing rate was reported to gradually increase until a peak is reached, and then it declines quickly. The η-function predicts that the LGMD activity is a product ˙ between an exponential function of angular size exp(−Θ) and angular velocity Θ, and that peak activity is reached before time-to-contact (ttc). The η-function has become the prevailing LGMD model because it reproduces many experimental observations, and even experimental evidence for the multiplicative operation was reported. Several inconsistencies remain unresolved, though. Here we address ˙ these issues with a new model (ψ-model), which explicitly connects Θ and Θ to biophysical quantities. The ψ-model avoids biophysical problems associated with implementing exp(·), implements the multiplicative operation of η via divisive inhibition, and explains why activity peaks could occur after ttc. It consistently predicts response features of the LGMD, and provides excellent fits to published experimental data, with goodness of fit measures comparable to corresponding fits with the η-function. 1 Introduction: τ and η Collision sensitive neurons were reported in species such different as monkeys [5, 4], pigeons [36, 34], frogs [16, 20], and insects [33, 26, 27, 10, 38]. This indicates a high ecological relevance, and raises the question about how neurons compute a signal that eventually triggers corresponding movement patterns (e.g. escape behavior or interceptive actions). Here, we will focus on visual stimulation. Consider, for simplicity, a circular object (diameter 2l), which approaches the eye at a collision course with constant velocity v. If we do not have any a priori knowledge about the object in question (e.g. its typical size or speed), then we will be able to access only two information sources. These information sources can be measured at the retina and are called optical variables (OVs). The first is the visual angle Θ, which can be derived from the number of stimulated photore˙ ˙ ceptors (spatial contrast). The second is its rate of change dΘ(t)/dt ≡ Θ(t). Angular velocity Θ is related to temporal contrast. ˙ How should we combine Θ and Θ in order to track an imminent collision? The perhaps simplest ˙ combination is τ (t) ≡ Θ(t)/Θ(t) [13, 18]. If the object hit us at time tc , then τ (t) ≈ tc − t will ∗ Also: www.ir3c.ub.edu, Research Institute for Brain, Cognition, and Behaviour (IR3C) Edifici de Ponent, Campus Mundet, Universitat de Barcelona, Passeig Vall d’Hebron, 171. E-08035 Barcelona 1 give us a running estimation of the time that is left until contact1 . Moreover, we do not need to know anything about the approaching object: The ttc estimation computed by τ is practically independent of object size and velocity. Neurons with τ -like responses were indeed identified in the nucleus retundus of the pigeon brain [34]. In humans, only fast interceptive actions seem to rely exclusively on τ [37, 35]. Accurate ttc estimation, however, seems to involve further mechanisms (rate of disparity change [31]). ˙ Another function of OVs with biological relevance is η ≡ Θ exp(−αΘ), with α = const. [10]. While η-type neurons were found again in pigeons [34] and bullfrogs [20], most data were gathered from the LGMD2 in locusts (e.g. [10, 9, 7, 23]). The η-function is a phenomenological model for the LGMD, and implies three principal hypothesis: (i) An implementation of an exponential function exp(·). Exponentation is thought to take place in the LGMD axon, via active membrane conductances [8]. Experimental data, though, seem to favor a third-power law rather than exp(·). (ii) The LGMD carries out biophysical computations for implementing the multiplicative operation. It has been suggested that multiplication is done within the LGMD itself, by subtracting the loga˙ rithmically encoded variables log Θ − αΘ [10, 8]. (iii) The peak of the η-function occurs before ˆ ttc, at visual angle Θ(t) = 2 arctan(1/α) [9]. It follows ttc for certain stimulus configurations (e.g. ˆ l/|v| 5ms). In principle, t > tc can be accounted for by η(t + δ) with a fixed delay δ < 0 (e.g. −27ms). But other researchers observed that LGMD activity continuous to rise after ttc even for l/|v| 5ms [28]. These discrepancies remain unexplained so far [29], but stimulation dynamics perhaps plays a role. We we will address these three issues by comparing the novel function “ψ” with the η-function. LGMD computations with the ψ-function: No multiplication, no exponentiation 2 A circular object which starts its approach at distance x0 and with speed v projects a visual angle Θ(t) = 2 arctan[l/(x0 − vt)] on the retina [34, 9]. The kinematics is hence entirely specified by the ˙ half-size-to-velocity ratio l/|v|, and x0 . Furthermore, Θ(t) = 2lv/((x0 − vt)2 + l2 ). In order to define ψ, we consider at first the LGMD neuron as an RC-circuit with membrane potential3 V [17] dV Cm = β (Vrest − V ) + gexc (Vexc − V ) + ginh (Vinh − V ) (1) dt 4 Cm = membrane capacity ; β ≡ 1/Rm denotes leakage conductance across the cell membrane (Rm : membrane resistance); gexc and ginh are excitatory and inhibitory inputs. Each conductance gi (i = exc, inh ) can drive the membrane potential to its associated reversal potential Vi (usually Vinh ≤ Vexc ). Shunting inhibition means Vinh = Vrest . Shunting inhibition lurks “silently” because it gets effective only if the neuron is driven away from its resting potential. With synaptic input, the neuron decays into its equilibrium state Vrest β + Vexc gexc + Vinh ginh V∞ ≡ (2) β + gexc + ginh according to V (t) = V∞ (1 − exp(−t/τm )). Without external input, V (t 1) → Vrest . The time scale is set by τm . Without synaptic input τm ≡ Cm /β. Slowly varying inputs gexc , ginh > 0 modify the time scale to approximately τm /(1 + (gexc + ginh )/β). For highly dynamic inputs, such as in late phase of the object approach, the time scale gets dynamical as well. The ψ-model assigns synaptic inputs5 ˙ ˙ ˙ ˙ gexc (t) = ϑ(t), ϑ(t) = ζ1 ϑ(t − ∆tstim ) + (1 − ζ1 )Θ(t) (3a) e ginh (t) = [γϑ(t)] , ϑ(t) = ζ0 ϑ(t − ∆tstim ) + (1 − ζ0 )Θ(t) 1 (3b) This linear approximation gets worse with increasing Θ, but turns out to work well until short before ttc (τ adopts a minimum at tc − 0.428978 · l/|v|). 2 LGMD activity is usually monitored via its postsynaptic neuron, the Descending Contralateral Movement Detector (DCMD) neuron. This represents no problem as LGMD spikes follow DCMD spikes 1:1 under visual stimulation [22] from 300Hz [21] to at least 400Hz [24]. 3 Here we assume that the membrane potential serves as a predictor for the LGMD’s mean firing rate. 4 Set to unity for all simulations. 5 LGMD receives also inhibition from a laterally acting network [21]. The η-function considers only direct feedforward inhibition [22, 6], and so do we. 2 Θ ∈ [7.63°, 180.00°[ temporal resolution ∆ tstim=1.0ms l/|v|=20.00ms, β=1.00, γ=7.50, e=3.00, ζ0=0.90, ζ1=0.99, nrelax=25 0.04 scaled dΘ/dt continuous discretized 0.035 0.03 Θ(t) (input) ϑ(t) (filtered) voltage V(t) (output) t = 56ms max t =300ms c 0.025 0 10 2 η(t): α=3.29, R =1.00 n =10 → t =37ms log Θ(t) amplitude relax max 0.02 0.015 0.01 0.005 0 −0.005 0 50 100 150 200 250 300 −0.01 0 350 time [ms] 50 100 150 200 250 300 350 time [ms] (b) ψ versus η (a) discretized optical variables Figure 1: (a) The continuous visual angle of an approaching object is shown along with its discretized version. Discretization transforms angular velocity from a continuous variable into a series of “spikes” (rescaled). (b) The ψ function with the inputs shown in a, with nrelax = 25 relaxation time steps. Its peak occurs tmax = 56ms before ttc (tc = 300ms). An η function (α = 3.29) that was fitted to ψ shows good agreement. For continuous optical variables, the peak would occur 4ms earlier, and η would have α = 4.44 with R2 = 1. For nrelax = 10, ψ is farther away from its equilibrium at V∞ , and its peak moves 19ms closer to ttc. t =500ms, dia=12.0cm, ∆t c =1.00ms, dt=10.00µs, discrete=1 stim 250 n relax = 50 2 200 α=4.66, R =0.99 [normal] n = 25 relax 2 α=3.91, R =1.00 [normal] n =0 relax tmax [ms] 150 2 α=1.15, R =0.99 [normal] 100 50 0 β=1.00, γ=7.50, e=3.00, V =−0.001, ζ =0.90, ζ =0.99 inh −50 5 10 15 20 25 30 0 35 1 40 45 50 l/|v| [ms] (a) different nrelax (b) different ∆tstim ˆ ˆ Figure 2: The figures plot the relative time tmax ≡ tc − t of the response peak of ψ, V (t), as a function of half-size-to-velocity ratio (points). Line fits with slope α and intercept δ were added (lines). The predicted linear relationship in all cases is consistent with experimental evidence [9]. (a) The stimulus time scale is held constant at ∆tstim = 1ms, and several LGMD time scales are defined by nrelax (= number of intercalated relaxation steps for each integration time step). Bigger values of nrelax move V (t) closer to its equilibrium V∞ (t), implying higher slopes α in turn. (b) LGMD time scale is fixed at nrelax = 25, and ∆tstim is manipulated. Because of the discretization of optical variables (OVs) in our simulation, increasing ∆tstim translates to an overall smaller number of jumps in OVs, but each with higher amplitude. Thus, we say ψ(t) ≡ V (t) if and only if gexc and ginh are defined with the last equation. The time ˙ scale of stimulation is defined by ∆tstim (by default 1ms). The variables ϑ and ϑ are lowpass filtered angular size and rate of expansion, respectively. The amount of filtering is defined by memory constants ζ0 and ζ1 (no filtering if zero). The idea is to continue with generating synaptic input ˙ after ttc, where Θ(t > tc ) = const and thus Θ(t > tc ) = 0. Inhibition is first weighted by γ, and then potentiated by the exponent e. Hodgkin-Huxley potentiates gating variables n, m ∈ [0, 1] instead (potassium ∝ n4 , sodium ∝ m3 , [12]) and multiplies them with conductances. Gabbiani and co-workers found that the function which transforms membrane potential to firing rate is better described by a power function with e = 3 than by exp(·) (Figure 4d in [8]). 3 Dynamics of the ψ-function 3 Discretization. In a typical experiment, a monitor is placed a short distance away from the insect’s eye, and an approaching object is displayed. Computer screens have a fixed spatial resolution, and as a consequence size increments of the displayed object proceed in discrete jumps. The locust retina is furthermore composed of a discrete array of ommatidia units. We therefore can expect a corresponding step-wise increment of Θ with time, although optical and neuronal filtering may ˙ smooth Θ to some extent again, resulting in ϑ (figure 1). Discretization renders Θ discontinuous, ˙ For simulating the dynamics of ψ, we discretized angular size what again will be alleviated in ϑ. ˙ with floor(Θ), and Θ(t) ≈ [Θ(t + ∆tstim ) − Θ(t)]/∆tstim . Discretized optical variables (OVs) were re-normalized to match the range of original (i.e. continuous) OVs. To peak, or not to peak? Rind & Simmons reject the hypothesis that the activity peak signals impending collision on grounds of two arguments [28]: (i) If Θ(t + ∆tstim ) − Θ(t) 3o in consecutively displayed stimulus frames, the illusion of an object approach would be lost. Such stimulation would rather be perceived as a sequence of rapidly appearing (but static) objects, causing reduced responses. (ii) After the last stimulation frame has been displayed (that is Θ = const), LGMD responses keep on building up beyond ttc. This behavior clearly depends on l/|v|, also according to their own data (e.g. Figure 4 in [26]): Response build up after ttc is typically observed for suffi˙ ciently small values of l/|v|. Input into ψ in situations where Θ = const and Θ = 0, respectively, ˙ is accommodated by ϑ and ϑ, respectively. We simulated (i) by setting ∆tstim = 5ms, thus producing larger and more infrequent jumps in discrete OVs than with ∆tstim = 1ms (default). As a consequence, ϑ(t) grows more slowly (deˆ layed build up of inhibition), and the peak occurs later (tmax ≡ tc − t = 10ms with everything else ˆ ˆ identical with figure 1b). The peak amplitude V = V (t) decreases nearly sixfold with respect to default. Our model thus predicts the reduced responses observed by Rind & Simmons [28]. Linearity. Time of peak firing rate is linearly related to l/|v| [10, 9]. The η-function is consistent ˆ with this experimental evidence: t = tc − αl/|v| + δ (e.g. α = 4.7, δ = −27ms). The ψ-function reproduces this relationship as well (figure 2), where α depends critically on the time scale of biophysical processes in the LGMD. We studied the impact of this time scale by choosing 10µs for the numerical integration of equation 1 (algorithm: 4th order Runge-Kutta). Apart from improving the numerical stability of the integration algorithm, ψ is far from its equilibrium V∞ (t) in every moment ˙ t, given the stimulation time scale ∆tstim = 1ms 6 . Now, at each value of Θ(t) and Θ(t), respectively, we intercalated nrelax iterations for integrating ψ. Each iteration takes V (t) asymptotically closer to V∞ (t), and limnrelax 1 V (t) = V∞ (t). If the internal processes in the LGMD cannot keep up with stimulation (nrelax = 0), we obtain slopes values that underestimate experimentally found values (figure 2a). In contrast, for nrelax 25 we get an excellent agreement with the experimentally determined α. This means that – under the reported experimental stimulation conditions (e.g. [9]) – the LGMD would operate relatively close to its steady state7 . Now we fix nrelax at 25 and manipulate ∆tstim instead (figure 2b). The default value ∆tstim = 1ms corresponds to α = 3.91. Slightly bigger values of ∆tstim (2.5ms and 5ms) underestimate the experimental α. In addition, the line fits also return smaller intercept values then. We see tmax < 0 up to l/|v| ≈ 13.5ms – LGMD activity peaks after ttc! Or, in other words, LGMD activity continues to increase after ttc. In the limit, where stimulus dynamics is extremely fast, and LGMD processes are kept far from equilibrium at each instant of the approach, α gets very small. As a consequence, tmax gets largely independent of l/|v|: The activity peak would cling to tmax although we varied l/|v|. 4 Freeze! Experimental data versus steady state of “psi” In the previous section, experimentally plausible values for α were obtained if ψ is close to equilibrium at each instant of time during stimulation. In this section we will thus introduce a steady-state 6 Assuming one ∆tstim for each integration time step. This means that by default stimulation and biophysical dynamics will proceed at identical time scales. 7 Notice that in this moment we can only make relative statements - we do not have data at hand for defining absolute time scales 4 tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 300 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 350 300 β=10.00 β=5.00 norm. rmse = 0.058...0.153 correlation (β,α)=−0.90 (n=4) ∞ β=1.00 e=4.00 norm. |η−ψ | = 0.009...0.114 e=3.00 300 norm. rmse = 0.014...0.160 correlation (e,α)=0.98 (n=4) ∞ e=2.50 250 250 norm. |η−ψ | = 0.043...0.241 ∞ norm. rmse = 0.085...0.315 correlation (γ,α)=1.00 (n=5) 150 tmax [ms] 200 tmax [ms] 200 tmax [ms] γ=5.00 γ=2.50 γ=1.00 γ=0.50 γ=0.25 e=5.00 norm. |η−ψ | = 0.020...0.128 β=2.50 250 200 150 100 150 100 100 50 50 50 0 5 10 15 20 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] (a) β varies 25 30 35 40 45 50 l/|v| [ms] (b) e varies (c) γ varies ˆ ˆ Figure 3: Each curve shows how the peak ψ∞ ≡ ψ∞ (t) depends on the half-size-to-velocity ratio. In each display, one parameter of ψ∞ is varied (legend), while the others are held constant (figure title). Line slopes vary according to parameter values. Symbol sizes are scaled according to rmse (see also figure 4). Rmse was calculated between normalized ψ∞ (t) & normalized η(t) (i.e. both functions ∈ [0, 1] with original minimum and maximum indicated by the textbox). To this end, the ˆ peak of the η-function was placed at tc , by choosing, at each parameter value, α = |v| · (tc − t)/l (for determining correlation, the mean value of α was taken across l/|v|). tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 0.25 β=5.00 0.12 β=2.50 β=1.00 0.1 0.08 (normalized η, ψ∞) 0.12 β=10.00 (normalized η, ψ∞) (normalized η, ψ∞) 0.14 0.1 0.08 γ=5.00 γ=2.50 0.2 γ=1.00 γ=0.50 γ=0.25 0.15 0.06 0.04 0.02 0 5 10 15 20 25 30 35 40 45 50 meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| 0.06 0.04 e=5.00 e=4.00 e=3.00 0.02 e=2.50 10 l/|v| [ms] 15 20 25 30 35 40 45 50 l/|v| [ms] (a) β varies (b) e varies 0.1 0.05 0 5 10 15 20 25 30 35 40 45 50 l/|v| [ms] (c) γ varies Figure 4: This figure complements figure 3. It visualizes the time averaged absolute difference between normalized ψ∞ (t) & normalized η(t). For η, its value of α was chosen such that the maxima of both functions coincide. Although not being a fit, it gives a rough estimate on how the shape of both curves deviate from each other. The maximum possible difference would be one. version of ψ (i.e. equation 2 with Vrest = 0, Vexc = 1, and equations 3 plugged in), ψ∞ (t) ≡ e ˙ Θ(t) + Vinh [γΘ(t)] e ˙ β + Θ(t) + [γΘ(t)] (4) (Here we use continuous versions of angular size and rate of expansion). The ψ∞ -function makes life easier when it comes to fitting experimental data. However, it has its limitations, because we brushed the whole dynamic of ψ under the carpet. Figure 3 illustrates how the linˆ ear relationship (=“linearity”) between tmax ≡ tc − t and l/|v| is influenced by changes in parameter values. Changing any of the values of e, β, γ predominantly causes variation in line slopes. The smallest slope changes are obtained by varying Vinh (data not shown; we checked Vinh = 0, −0.001, −0.01, −0.1). For Vinh −0.01, linearity is getting slightly compromised, as slope increases with l/|v| (e.g. Vinh = −1 α ∈ [4.2, 4.7]). In order to get a notion about how well the shape of ψ∞ (t) matches η(t), we computed timeaveraged difference measures between normalized versions of both functions (details: figure 3 & 4). Bigger values of β match η better at smaller, but worse at bigger values of l/|v| (figure 4a). Smaller β cause less variation across l/|v|. As to variation of e, overall curve shapes seem to be best aligned with e = 3 to e = 4 (figure 4b). Furthermore, better matches between ψ∞ (t) and η(t) correspond to bigger values of γ (figure 4c). And finally, Vinh marches again to a different tune (data not shown). Vinh = −0.1 leads to the best agreement (≈ 0.04 across l/|v|) of all Vinh , quite different from the other considered values. For the rest, ψ∞ (t) and η(t) align the same (all have maximum 0.094), 5 ˙ (a) Θ = 126o /s ˙ (b) Θ = 63o /s Figure 5: The original data (legend label “HaGaLa95”) were resampled from ref. [10] and show ˙ DCMD responses to an object approach with Θ = const. Thus, Θ increases linearly with time. The η-function (fitting function: Aη(t+δ)+o) and ψ∞ (fitting function: Aψ∞ (t)+o) were fitted to these data: (a) (Figure 3 Di in [10]) Good fits for ψ∞ are obtained with e = 5 or higher (e = 3 R2 = 0.35 and rmse = 0.644; e = 4 R2 = 0.45 and rmse = 0.592). “Psi” adopts a sigmoid-like curve form which (subjectively) appears to fit the original data better than η. (b) (Figure 3 Dii in [10]) “Psi” yields an excellent fit for e = 3. RoHaTo10 gregarious locust LV=0.03s Θ(t), lv=30ms e011pos014 sgolay with 100 t =107ms max ttc=5.00s ψ adj.R2 0.95 (LM:3) ∞ η(t) adj.R2 1 (TR::1) 2 ψ : R =0.95, rmse=0.004, 3 coefficients ∞ → β=2.22, γ=0.70, e=3.00, V =−0.001, A=0.07, o=0.02, δ=0.00ms inh η: R2=1.00, rmse=0.001 → α=3.30, A=0.08, o=0.0, δ=−10.5ms 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 time [s] (b) α versus β (a) spike trace Figure 6: (a) DCMD activity in response to a black square (l/|v| = 30ms, legend label “e011pos14”, ref. [30]) approaching to the eye center of a gregarious locust (final visual angle 50o ). Data show the first stimulation so habituation is minimal. The spike trace (sampled at 104 Hz) was full wave rectified, lowpass filtered, and sub-sampled to 1ms resolution. Firing rate was estimated with Savitzky-Golay filtering (“sgolay”). The fits of the η-function (Aη(t + δ) + o; 4 coefficients) and ψ∞ -function (Aψ∞ (t) with fixed e, o, δ, Vinh ; 3 coefficients) provide both excellent fits to firing rate. (b) Fitting coefficient α (→ η-function) inversely correlates with β (→ ψ∞ ) when fitting firing rates of another 5 trials as just described (continuous line = line fit to the data points). Similar correlation values would be obtained if e is fixed at values e = 2.5, 4, 5 c = −0.95, −0.96, −0.91. If o was determined by the fitting algorithm, then c = −0.70. No clear correlations with α were obtained for γ. despite of covering different orders of magnitude with Vinh = 0, −0.001, −0.01. Decelerating approach. Hatsopoulos et al. [10] recorded DCMD activity in response to an ap˙ proaching object which projected image edges on the retina moving at constant velocity: Θ = const. ˙ This “linear approach” is perceived as if the object is getting increasingly implies Θ(t) = Θ0 + Θt. slower. But what appears a relatively unnatural movement pattern serves as a test for the functions η & ψ∞ . Figure 5 illustrates that ψ∞ passes the test, and consistently predicts that activity sharply rises in the initial approach phase, and subsequently declines (η passed this test already in the year 1995). 6 Spike traces. We re-sampled about 30 curves obtained from LGMD recordings from a variety of publications, and fitted η & ψ∞ -functions. We cannot show the results here, but in terms of goodness of fit measures, both functions are in the same ballbark. Rather, figure 6a shows a representative example [30]. When α and β are plotted against each other for five trials, we see a strong inverse correlation (figure 6b). Although five data points are by no means a firm statistical sample, the strong correlation could indicate that β and α play similar roles in both functions. Biophysically, β is the leakage conductance, which determines the (passive) membrane time constant τm ∝ 1/β of the neuron. Voltage drops within τm to exp(−1) times its initial value. Bigger values of β mean shorter τm (i.e., “faster neurons”). Getting back to η, this would suggest α ∝ τm , such that higher (absolute) values for α would possibly indicate a slower dynamic of the underlying processes. 5 Discussion (“The Good, the Bad, and the Ugly”) Up to now, mainly two classes of LGMD models existed: The phenomenological η-function on the one hand, and computational models with neuronal layers presynaptic to the LGMD on the other (e.g. [25, 15]; real-world video sequences & robotics: e.g. [3, 14, 32, 2]). Computational models predict that LGMD response features originate from excitatory and inhibitory interactions in – and between – presynaptic neuronal layers. Put differently, non-linear operations are generated in the presynaptic network, and can be a function of many (model) parameters (e.g. synaptic weights, time constants, etc.). In contrast, the η-function assigns concrete nonlinear operations to the LGMD [7]. The η-function is accessible to mathematical analysis, whereas computational models have to be probed with videos or artificial stimulus sequences. The η-function is vague about biophysical parameters, whereas (good) computational models need to be precise at each (model) parameter value. The η-function establishes a clear link between physical stimulus attributes and LGMD activity: It postulates what is to be computed from the optical variables (OVs). But in computational models, such a clear understanding of LGMD inputs cannot always be expected: Presynaptic processing may strongly transform OVs. The ψ function thus represents an intermediate model class: It takes OVs as input, and connects them with biophysical parameters of the LGMD. For the neurophysiologist, the situation could hardly be any better. Psi implements the multiplicative operation of the η-function by shunting inhibition (equation 1: Vexc ≈ Vrest and Vinh ≈ Vrest ). The η-function fits ψ very well according to our dynamical simulations (figure 1), and satisfactory by the approximate criterion of figure 4. We can conclude that ψ implements the η-function in a biophysically plausible way. However, ψ does neither explicitly specify η’s multiplicative operation, nor its exponential function exp(·). Instead we have an interaction between shunting inhibition and a power law (·)e , with e ≈ 3. So what about power laws in neurons? Because of e > 1, we have an expansive nonlinearity. Expansive power-law nonlinearities are well established in phenomenological models of simple cells of the primate visual cortex [1, 11]. Such models approximate a simple cell’s instantaneous firing rate r from linear filtering of a stimulus (say Y ) by r ∝ ([Y ]+ )e , where [·]+ sets all negative values to zero and lets all positive pass. Although experimental evidence favors linear thresholding operations like r ∝ [Y − Ythres ]+ , neuronal responses can behave according to power law functions if Y includes stimulus-independent noise [19]. Given this evidence, the power-law function of the inhibitory input into ψ could possibly be interpreted as a phenomenological description of presynaptic processes. The power law would also be the critical feature by means of which the neurophysiologist could distinguish between the η function and ψ. A study of Gabbiani et al. aimed to provide direct evidence for a neuronal implementation of the η-function [8]. Consequently, the study would be an evidence ˙ for a biophysical implementation of “direct” multiplication via log Θ − αΘ. Their experimental evidence fell somewhat short in the last part, where “exponentation through active membrane conductances” should invert logarithmic encoding. Specifically, the authors observed that “In 7 out of 10 neurons, a third-order power law best described the data” (sixth-order in one animal). Alea iacta est. Acknowledgments MSK likes to thank Stephen M. Rogers for kindly providing the recording data for compiling figure 6. MSK furthermore acknowledges support from the Spanish Government, by the Ramon and Cajal program and the research grant DPI2010-21513. 7 References [1] D.G. Albrecht and D.B. Hamilton, Striate cortex of monkey and cat: contrast response function, Journal of Neurophysiology 48 (1982), 217–237. [2] S. Bermudez i Badia, U. Bernardet, and P.F.M.J. Verschure, Non-linear neuronal responses as an emergent property of afferent networks: A case study of the locust lobula giant movemement detector, PLoS Computational Biology 6 (2010), no. 3, e1000701. [3] M. Blanchard, F.C. Rind, and F.M.J. Verschure, Collision avoidance using a model of locust LGMD neuron, Robotics and Autonomous Systems 30 (2000), 17–38. [4] D.F. Cooke and M.S.A. Graziano, Super-flinchers and nerves of steel: Defensive movements altered by chemical manipulation of a cortical motor area, Neuron 43 (2004), no. 4, 585–593. [5] L. Fogassi, V. Gallese, L. Fadiga, G. Luppino, M. Matelli, and G. Rizzolatti, Coding of peripersonal space in inferior premotor cortex (area f4), Journal of Neurophysiology 76 (1996), 141–157. [6] F. Gabbiani, I. Cohen, and G. Laurent, Time-dependent activation of feed-forward inhibition in a looming sensitive neuron, Journal of Neurophysiology 94 (2005), 2150–2161. [7] F. Gabbiani, H.G. Krapp, N. Hatsopolous, C.H. Mo, C. Koch, and G. Laurent, Multiplication and stimulus invariance in a looming-sensitive neuron, Journal of Physiology - Paris 98 (2004), 19–34. [8] F. Gabbiani, H.G. Krapp, C. Koch, and G. Laurent, Multiplicative computation in a visual neuron sensitive to looming, Nature 420 (2002), 320–324. [9] F. Gabbiani, H.G. Krapp, and G. Laurent, Computation of object approach by a wide-field, motionsensitive neuron, Journal of Neuroscience 19 (1999), no. 3, 1122–1141. [10] N. Hatsopoulos, F. Gabbiani, and G. Laurent, Elementary computation of object approach by a wide-field visual neuron, Science 270 (1995), 1000–1003. [11] D.J. Heeger, Modeling simple-cell direction selectivity with normalized, half-squared, linear operators, Journal of Neurophysiology 70 (1993), 1885–1898. [12] A.L. Hodkin and A.F. Huxley, A quantitative description of membrane current and its application to conduction and excitation in nerve, Journal of Physiology 117 (1952), 500–544. [13] F. Hoyle, The black cloud, Pinguin Books, London, 1957. [14] M.S. Keil, E. Roca-Morena, and A. Rodr´guez-V´ zquez, A neural model of the locust visual system for ı a detection of object approaches with real-world scenes, Proceedings of the Fourth IASTED International Conference (Marbella, Spain), vol. 5119, 6-8 September 2004, pp. 340–345. [15] M.S. Keil and A. Rodr´guez-V´ zquez, Towards a computational approach for collision avoidance with ı a real-world scenes, Proceedings of SPIE: Bioengineered and Bioinspired Systems (Maspalomas, Gran Canaria, Canary Islands, Spain) (A. Rodr´guez-V´ zquez, D. Abbot, and R. Carmona, eds.), vol. 5119, ı a SPIE - The International Society for Optical Engineering, 19-21 May 2003, pp. 285–296. [16] J.G. King, J.Y. Lettvin, and E.R. Gruberg, Selective, unilateral, reversible loss of behavioral responses to looming stimuli after injection of tetrodotoxin or cadmium chloride into the frog optic nerve, Brain Research 841 (1999), no. 1-2, 20–26. [17] C. Koch, Biophysics of computation: information processing in single neurons, Oxford University Press, New York, 1999. [18] D.N. Lee, A theory of visual control of braking based on information about time-to-collision, Perception 5 (1976), 437–459. [19] K.D. Miller and T.W. Troyer, Neural noise can explain expansive, power-law nonlinearities in neuronal response functions, Journal of Neurophysiology 87 (2002), 653–659. [20] Hideki Nakagawa and Kang Hongjian, Collision-sensitive neurons in the optic tectum of the bullfrog, rana catesbeiana, Journal of Neurophysiology 104 (2010), no. 5, 2487–2499. [21] M. O’Shea and C.H.F. Rowell, Projection from habituation by lateral inhibition, Nature 254 (1975), 53– 55. [22] M. O’Shea and J.L.D. Williams, The anatomy and output connection of a locust visual interneurone: the lobula giant movement detector (lgmd) neurone, Journal of Comparative Physiology 91 (1974), 257–266. [23] S. Peron and F. Gabbiani, Spike frequency adaptation mediates looming stimulus selectivity, Nature Neuroscience 12 (2009), no. 3, 318–326. [24] F.C. Rind, A chemical synapse between two motion detecting neurones in the locust brain, Journal of Experimental Biology 110 (1984), 143–167. [25] F.C. Rind and D.I. Bramwell, Neural network based on the input organization of an identified neuron signaling implending collision, Journal of Neurophysiology 75 (1996), no. 3, 967–985. 8 [26] F.C. Rind and P.J. Simmons, Orthopteran DCMD neuron: a reevaluation of responses to moving objects. I. Selective responses to approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1654–1666. [27] , Orthopteran DCMD neuron: a reevaluation of responses to moving objects. II. Critical cues for detecting approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1667–1682. [28] , Signaling of object approach by the dcmd neuron of the locust, Journal of Neurophysiology 77 (1997), 1029–1033. [29] , Reply, Trends in Neuroscience 22 (1999), no. 5, 438. [30] S.M. Roger, G.W.J. Harston, F. Kilburn-Toppin, T. Matheson, M. Burrows, F. Gabbiani, and H.G. Krapp, Spatiotemporal receptive field properties of a looming-sensitive neuron in solitarious and gregarious phases of desert locust, Journal of Neurophysiology 103 (2010), 779–792. [31] S.K. Rushton and J.P. Wann, Weighted combination of size and disparity: a computational model for timing ball catch, Nature Neuroscience 2 (1999), no. 2, 186–190. [32] Yue. S., Rind. F.C., M.S. Keil, J. Cuadri, and R. Stafford, A bio-inspired visual collision detection mechanism for cars: Optimisation of a model of a locust neuron to a novel environment, Neurocomputing 69 (2006), 1591–1598. [33] G.R. Schlotterer, Response of the locust descending movement detector neuron to rapidly approaching and withdrawing visual stimuli, Canadian Journal of Zoology 55 (1977), 1372–1376. [34] H. Sun and B.J. Frost, Computation of different optical variables of looming objects in pigeon nucleus rotundus neurons, Nature Neuroscience 1 (1998), no. 4, 296–303. [35] J.R. Tresilian, Visually timed action: time-out for ’tau’?, Trends in Cognitive Sciences 3 (1999), no. 8, 1999. [36] Y. Wang and B.J. Frost, Time to collision is signalled by neurons in the nucleus rotundus of pigeons, Nature 356 (1992), 236–238. [37] J.P. Wann, Anticipating arrival: is the tau-margin a specious theory?, Journal of Experimental Psychology and Human Perceptual Performance 22 (1979), 1031–1048. [38] M. Wicklein and N.J. Strausfeld, Organization and significance of neurons that detect change of visual depth in the hawk moth manduca sexta, The Journal of Comparative Neurology 424 (2000), no. 2, 356– 376. 9

6 0.80031615 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

7 0.79185671 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

8 0.78764802 271 nips-2011-Statistical Tests for Optimization Efficiency

9 0.78752649 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

10 0.78733027 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

11 0.78706348 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

12 0.78592569 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

13 0.78219831 263 nips-2011-Sparse Manifold Clustering and Embedding

14 0.7811079 56 nips-2011-Committing Bandits

15 0.78082335 24 nips-2011-Active learning of neural response functions with Gaussian processes

16 0.77997601 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

17 0.77886033 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

18 0.77865398 175 nips-2011-Multi-Bandit Best Arm Identification

19 0.77787727 283 nips-2011-The Fixed Points of Off-Policy TD

20 0.77692538 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds