nips nips2011 nips2011-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Loc X. Bui, Ramesh Johari, Shie Mannor
Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.
Reference: text
sentIndex sentText sentNum sentScore
1 The first phase is an experimentation phase where the decision maker is free to explore multiple options. [sent-2, score-0.931]
2 In the second phase the decision maker has to commit to one of the arms and stick with it. [sent-3, score-0.611]
3 Cost is incurred during both phases with a higher cost during the experimentation phase. [sent-4, score-0.664]
4 We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. [sent-5, score-1.152]
5 1 Introduction In a range of applications, a dynamic decision making problem exhibits two distinctly different kinds of phases: experimentation and commitment. [sent-7, score-0.638]
6 However, eventually the decision maker must commit to a choice, and use that decision for the duration of the problem horizon. [sent-9, score-0.496]
7 A notable feature of these phases in the models we study is that costs are incurred during both phases; that is, experimentation is not carried out “offline,” but rather is run “live” in the actual system. [sent-10, score-0.683]
8 However, such testing is not carried out without consequences; the retailer might lose potential rewards if experimentation leads to suboptimal recommendations. [sent-13, score-0.64]
9 Eventually, the recommendation engine must be stabilized (both from a software development standpoint and a customer expectation standpoint), and when this happens the retailer has effectively committed to one strategy moving forward. [sent-14, score-0.281]
10 The process of experimentation during design entails costs to the producer, but eventually the experimentation must stop and the design must be committed. [sent-18, score-1.257]
11 If the problem consists of only experimentation, then we have the classical multi-armed bandit problem, where the decision maker is interested in minimizing the expected total regret against the best arm [9, 2]. [sent-22, score-0.875]
12 il † 1 exploration or budgeted learning problem, where the goal is to output the best arm at the end of an experimentation phase [13, 6, 4]; no costs are incurred for experimentation, but after finite time a single decision must be chosen (see [12] for a review). [sent-29, score-1.206]
13 Formally, in a committing bandit problem, the decision maker can experiment without constraints for the first N of T periods, but must commit to a single decision for the last T − N periods, where T is the problem horizon. [sent-30, score-0.823]
14 We first consider the soft deadline setting where the experimentation deadline N can be chosen by the decision maker, but there is a cost incurred per experimentation period. [sent-31, score-1.771]
15 We divide this setting into two regimes depending on how N is chosen: the non-adaptive regime (Section 3) in which the decision maker has to choose N before the algorithm begins running, and the adaptive regime (Section 4) in which N can be chosen adaptively as the algorithm runs. [sent-32, score-0.692]
16 We obtain two main results for the soft deadline setting. [sent-33, score-0.272]
17 Finally, we demonstrate that if the algorithm has no initial distributional information, adaptivity is beneficial: we demonstrate an adaptive algorithm that achieves Θ(ln T /T ) regret in this case. [sent-36, score-0.31]
18 We then study the hard deadline regime where the value of N is given to the decision maker in advance (Section 5). [sent-37, score-0.589]
19 This is a sensible assumption for problems where the decision maker cannot control how long the experimentation period is; for example, in the product design example above, the release date is often fixed well in advance, and the engineers are not generally free to alter it. [sent-38, score-0.853]
20 We propose the UCB-poly(δ) algorithm for this setting, where the parameter δ ∈ (0, 1) reflects the tradeoff between experimentation and commitment. [sent-39, score-0.613]
21 During the first N periods the tradeoff between exploration and exploitation exists bearing in mind that the last T − N periods will be used solely for exploitation. [sent-42, score-0.247]
22 2 The committing bandit problem We first describe the setup of the classical stochastic multi-armed bandit problem, as it will serve as background for the committing bandit problem. [sent-44, score-0.944]
23 In a stochastic multi-armed bandit problem, there are K independent arms; each arm i, when pulled, returns a reward which is independently and identically drawn from a fixed Bernoulli distribution1 with unknown parameter θi ∈ [0, 1]. [sent-45, score-0.615]
24 Let It denote the index of the arm pulled at time t (It ∈ {1, 2, . [sent-46, score-0.389]
25 i:∆i >0 An allocation policy is an algorithm that chooses the next arm to pull based on the sequence of past pulled arms and obtained rewards. [sent-52, score-0.881]
26 The cumulative regret of an allocation policy A after time n is: n ∗ (Xt − Xt ) , Rn = t=1 ∗ Xt where is the reward that the algorithm would have received at time t if it had pulled the optimal arm i∗ . [sent-53, score-0.997]
27 In other words, Rn is the cumulative loss due to the fact that the allocation policy does not always pull the optimal arm. [sent-54, score-0.355]
28 Let Ti (n) be the number of times that arm i is pulled up to time n. [sent-55, score-0.389]
29 A recommendation policy is an algorithm that tries to recommend the “best” arm based on the sequence of past pulled arms and obtained rewards. [sent-62, score-0.808]
30 Suppose that after time n, a recommendation policy R recommends the arm Jn as the “best” arm. [sent-63, score-0.544]
31 Then the regret of recommendation policy R after time n, called the simple regret in [4], is defined as rn = θ∗ − θJn = ∆Jn . [sent-64, score-0.64]
32 The committing bandit problem considered in this paper is a version of the stochastic multi-armed bandit problem in which the algorithm is forced to commit to only one arm after some period of time. [sent-68, score-1.078]
33 From time 1 to some time N (N < T ), the algorithm can pull any arm in {1, 2, . [sent-71, score-0.401]
34 Then, from time N + 1 to the end of the horizon (time T ), it must commit to pull only one arm. [sent-75, score-0.275]
35 The first phase (time 1 to N ) is called the experimentation phase, and the second phase (time N + 1 to T ) is called the commitment phase. [sent-76, score-0.93]
36 We refer to time N as the experimentation deadline. [sent-77, score-0.571]
37 An algorithm for the committing bandit problem is a combination of an allocation and a recommendation policy. [sent-78, score-0.679]
38 That is, the algorithm has to decide which arm to pull during the first N slots, and then choose an arm to commit to during the remaining T − N slots. [sent-79, score-0.89]
39 Because we consider settings where the algorithm designer can choose the experimentation deadline, we also assume a cost is imposed during the experimentation phase; otherwise, it is never optimal to be forced to commit. [sent-80, score-1.238]
40 In particular, we assume that the reward earned during the experimentation phase is reduced by a constant factor γ ∈ [0, 1). [sent-81, score-0.767]
41 Thus the expected regret E[Reg] of such an algorithm is the average regret across both phases, i. [sent-82, score-0.357]
42 1 Committing bandit regimes We focus on three distinct regimes, that differ in the level of control given to the algorithm designer in choosing the experimentation deadline. [sent-86, score-0.866]
43 That is, the algorithm cannot control the experimentation deadline N . [sent-94, score-0.825]
44 2 Known lower-bounds As mentioned in the Introduction section, the experimentation and commitment phases have each been extensively studied in isolation. [sent-97, score-0.813]
45 In this subsection, we only summarize briefly the known lower bounds on cumulative regret and simple regret that will be used in the paper. [sent-98, score-0.47]
46 Result 1 (Distribution-dependent lower bound on cumulative regret [9]). [sent-99, score-0.311]
47 For any allocation policy, and for any set of reward distributions such that their parameters θi are not all equal, there exists 3 an ordering of (θ1 , . [sent-100, score-0.363]
48 , θK ) such that E[Rn ] ≥ pi p∗ p∗ pi ∗ i=i∗ ∆i + o(1) ln n, D(pi p∗ ) where D(pi p∗ ) = pi log + p∗ log is the Kullback-Leibler divergence between two Bernoulli reward distributions pi (of arm i) and p (of the optimal arm), and o(1) → 0 as n → ∞. [sent-103, score-0.853]
49 Result 2 (Distribution-free lower bound on cumulative regret [13]). [sent-104, score-0.311]
50 There exist positive constants c and N0 such that for any allocation policy, there exists a set of Bernoulli reward distributions such that E[Rn ] ≥ cK(ln n − ln K), ∀n ≥ N0 . [sent-105, score-0.523]
51 The difference between Result 1 and Result 2 is that the lower bound in the former depends on the parameters of reward distributions (hence, called distribution-dependent), while the lower bound in the latter does not (hence, called distribution-free). [sent-106, score-0.356]
52 For any pair of allocation and recommendation policies, if the allocation policy can achieve an upper bound such that for all (Bernoulli) reward distributions θ1 , . [sent-110, score-0.749]
53 , θK , there exists a constant C ≥ 0 with E[Rn ] ≤ Cf (n), then for all sets of K ≥ 3 Bernoulli reward distributions with parameters θi that are all distinct and all different from 1, there exists an ordering (θ1 , . [sent-113, score-0.263]
54 Result 4 (Distribution-free lower bound on simple regret [4]). [sent-124, score-0.253]
55 mendation policies, there exists a set of Bernoulli reward distributions such that E[rn ] ≥ 20 n In the subsequent sections we analyze each of the committing bandit regimes in detail; in particular, we provide constructive upper bounds and matching lower bounds on the regret in each regime. [sent-126, score-0.934]
56 3 Regime 1: Soft experimentation deadline, non-adaptive In this regime, for a given value of T , the value of N can be chosen freely between 1 and T − 1, but only before the algorithm begins pulling arms. [sent-128, score-0.651]
57 (1) Distribution-dependent lower bound: In Regime 1, for any algorithm, and any set of K ≥ 3 Bernoulli reward distributions such that θi are all distinct and all different from 1, there exists an ordering (θ1 , . [sent-131, score-0.272]
58 (2) Distribution-free lower bound: Also, for any algorithm in Regime 1, there exists a set of Bernoulli reward distributions such that ln K ln T E[Reg] ≥ cK 1 − , ln T T where c is the constant in Result 2. [sent-135, score-0.811]
59 Commit to the arm with maximum empirical average reward for the remaining periods. [sent-145, score-0.444]
60 E[Reg] ≤ 2 (1 − γ)θ∗ + ∆ K ln T T ∗ i=i This matches the lower bounds in Theorem 1 to the correct order in T . [sent-148, score-0.286]
61 Observe that in this regime, both distribution-dependent and distribution-free lower bounds have the same asymptotic order of ln T /T. [sent-149, score-0.284]
62 If ∆ is unknown, a low regret algorithm that matches the lower bound does not seem to be possible in this regime, because of the relative nature of the regret. [sent-151, score-0.297]
63 An algorithm may be unable to choose an N that explores sufficiently long when arms are difficult to distinguish, and yet commits quickly when arms are easy to distinguish. [sent-152, score-0.389]
64 4 Regime 2: Soft experimentation deadline, adaptive The setting in this regime is the same as the previous one, except that the algorithm is not required to choose N before it runs, i. [sent-153, score-0.775]
65 We first present the lower bounds on regret for any algorithm in this regime. [sent-157, score-0.265]
66 (1) Distribution-dependent lower bound: In Regime 2, for any algorithm, and any set of K ≥ 3 Bernoulli reward distribution such that θi are all distinct and all different from 1, there exists an ordering (θ1 , . [sent-159, score-0.242]
67 (2) Distribution-free lower bound: Also, for any algorithm in Regime 2, there exists a set of Bernoulli reward distributions such that ln K ln T E[Reg] ≥ cK 1 − , ln T T where c is the constant in Result 2. [sent-163, score-0.811]
68 For the SEC1 algorithm (Algorithm 2), K γ ln T E[Reg] ≤ 2 (1 − γ)θ∗ + , ∆i + b ∆ K T ∗ i=i where b = 2 + ∆2 (K+2) (1−e−∆2 /2 )2 1 ln T → 0 as T → ∞. [sent-167, score-0.395]
69 Let Sm be the total reward obtained from arm i so far. [sent-176, score-0.444]
70 for i ∈ Bm do i if m ≤ α ln T and |mθ∗ − Sm | > 1 ln T then Delete arm i from Bm . [sent-178, score-0.694]
71 end if i if m > α ln T and |mθ∗ − Sm | > 2 m then Delete arm i from Bm . [sent-179, score-0.524]
72 end if end for until there is only one arm in Bm , then commit to that arm or the horizon T is reached. [sent-180, score-0.854]
73 We note that when N can be chosen adaptively, both distribution-dependent and distribution-free lower bounds have the same asymptotic order of ln T /T as the ones in the non-adaptive regime. [sent-182, score-0.307]
74 The reason is rather intuitive: due to its adaptive nature, SEC1 is able to eliminate poor arms much earlier than the ln T /∆2 threshold, while Non-adaptive Unif-EBA has to wait until that point to make decisions. [sent-188, score-0.384]
75 , log2 (T /e)/2 do if |Bm | > 1 then ˜ ˜ Sample each arm in Bm until each arm has been chosen nm = 2 ln(T ∆2 )/∆2 times. [sent-204, score-0.682]
76 m m i Let Sm be the total reward obtained from arm i so far. [sent-205, score-0.444]
77 5 Regime 3: Hard experimentation deadline We now investigate the third regime where, in contrast to the previous two, the experimentation deadline N is fixed exogenously together with T . [sent-213, score-1.748]
78 Note that since in this case the experimentation deadline is outside the algorithm designer’s control, we set the cost of experimentation γ = 1 for this section. [sent-215, score-1.396]
79 We know from Result 3 that for any pair of allocation and recommendation policies, if E[RN ] ≤ C1 f (N ), then E[rN ] ≥ (∆/2)e−Df (N ) . [sent-218, score-0.285]
80 In other words, given an allocation policy A that has a cumulative regret bound C1 f (N ) (for some constant C1 ), the best (distribution-dependent) upper bound that any recommendation policy can achieve is C2 e−C3 f (N ) (for some constants C2 and C3 ). [sent-219, score-0.803]
81 Assuming that there exists a recommendation policy RA that achieves such an upper bound, we have the following upper bound on regret when applying [A, RA ] to the committing bandit problem: f (N ) T − N E[Reg] ≤ C1 + C2 e−C3 f (N ) . [sent-220, score-0.901]
82 (1) T T One can clearly see the trade-off between experimentation and commitment in (1): the smaller the first term, the larger the second term, and vice versa. [sent-221, score-0.754]
83 In particular, we focus on finding a pair of allocation and recommendation policies which can simultaneously achieve the allocation bound C1 N δ and the recommendation δ bound C2 e−C3 N where 0 < δ < 1. [sent-226, score-0.695]
84 Let us consider a modification of the UCB allocation policy called UCB-poly(δ) (for 0 < δ < 1), ˆ where for t > K, with θi,Ti (t−1) be the empirical average of rewards from arm i so far, It = arg max 1≤i≤K ˆ θi,Ti (t−1) + 2(t − 1)δ Ti (t − 1) . [sent-227, score-0.592]
85 In particular, if T (N ) is super-exponential in N we get an optimal δ of 1 representing pure exploration in the experimentation phase. [sent-233, score-0.657]
86 If T (N ) is sub-exponential we get an optimal δ of 0 representing a standard UCB during the experimentation phase. [sent-234, score-0.571]
87 The simulation setting includes K arms with Bernoulli reward distributions, the time horizon T , and the values of γ and ∆. [sent-240, score-0.329]
88 5, 1] interval, and the second best arm reward is set as θ2 = θ∗ − ∆. [sent-243, score-0.444]
89 These two values are then assigned to two randomly chosen arms, and the rest of arm rewards are generated ∗ independently and uniformly in [0, θ2 ]. [sent-244, score-0.362]
90 Particularly, the performance of Non-adaptive Unif-EBA is quite poor when the experimentation deadline is roughly equal to T , since the algorithm does not commit before the experimentation deadline. [sent-250, score-1.545]
91 7 Extensions and future directions Our work is a first step in the study of the committing bandit setup. [sent-253, score-0.373]
92 First, an extension of the basic committing bandits setup to the case of contextual bandits [10, 11] is natural. [sent-255, score-0.315]
93 In this setup before choosing an arm an additional “context” is provided to the decision maker. [sent-256, score-0.414]
94 The problem is to choose a decision rule from a given class that prescribes what arm to choose for every context. [sent-257, score-0.427]
95 This setup is more realistic when the decision maker has to commit to such a rule after some exploration time. [sent-258, score-0.457]
96 Second, models with many arms (structured as in [8, 5]) or even infinitely arms (as in [1, 7, 14]) are of interest here as they may lead to different regimes and results here. [sent-259, score-0.397]
97 Third, our models assumed that the commitment time is either predetermined or according to the decision maker’s will. [sent-260, score-0.25]
98 Finally, a situation where the exploration and commitment phases alternate (randomly or according to a given schedule or at a cost) is of practical interest. [sent-262, score-0.307]
99 UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. [sent-277, score-0.373]
100 The sample complexity of exploration in the multi-armed bandit problem. [sent-335, score-0.236]
wordName wordTfidf (topN-words)
[('experimentation', 0.571), ('arm', 0.32), ('deadline', 0.233), ('committing', 0.202), ('ln', 0.187), ('commitment', 0.183), ('arms', 0.174), ('bandit', 0.171), ('bm', 0.17), ('regret', 0.168), ('reg', 0.167), ('maker', 0.149), ('commit', 0.149), ('allocation', 0.149), ('regime', 0.14), ('recommendation', 0.136), ('reward', 0.124), ('policy', 0.088), ('ucb', 0.077), ('bernoulli', 0.076), ('phase', 0.072), ('pulled', 0.069), ('decision', 0.067), ('exploration', 0.065), ('periods', 0.064), ('rn', 0.064), ('sm', 0.064), ('pull', 0.06), ('phases', 0.059), ('adaptivity', 0.059), ('cumulative', 0.058), ('retailer', 0.05), ('regimes', 0.049), ('pi', 0.048), ('jn', 0.046), ('bound', 0.043), ('lower', 0.042), ('email', 0.04), ('policies', 0.039), ('soft', 0.039), ('designer', 0.038), ('bandits', 0.035), ('incurred', 0.034), ('delete', 0.034), ('bounds', 0.034), ('eba', 0.033), ('mannor', 0.033), ('exists', 0.033), ('horizon', 0.031), ('distributions', 0.03), ('elimination', 0.03), ('upper', 0.03), ('releases', 0.029), ('committed', 0.029), ('remark', 0.028), ('maxj', 0.028), ('ordering', 0.027), ('period', 0.027), ('setup', 0.027), ('duration', 0.026), ('ti', 0.026), ('standpoint', 0.025), ('sequential', 0.025), ('shie', 0.024), ('unif', 0.024), ('adaptive', 0.023), ('chosen', 0.023), ('adaptively', 0.023), ('engine', 0.023), ('matches', 0.023), ('xt', 0.023), ('theorem', 0.022), ('ck', 0.022), ('df', 0.021), ('pure', 0.021), ('tradeoff', 0.021), ('algorithm', 0.021), ('asymptotic', 0.021), ('eventually', 0.02), ('freely', 0.02), ('ra', 0.02), ('supplementary', 0.02), ('design', 0.02), ('choose', 0.02), ('rewards', 0.019), ('costs', 0.019), ('release', 0.019), ('nm', 0.019), ('distributional', 0.018), ('must', 0.018), ('end', 0.017), ('forced', 0.017), ('simulations', 0.017), ('matching', 0.017), ('tune', 0.016), ('distinct', 0.016), ('contextual', 0.016), ('auer', 0.016), ('begins', 0.016), ('called', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 56 nips-2011-Committing Bandits
Author: Loc X. Bui, Ramesh Johari, Shie Mannor
Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.
2 0.27304944 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
Author: Alexandra Carpentier, Rémi Munos
Abstract: We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution� dependent bound O(n−3/2 ) that depends on a measure of the disparity of � the strata, and a distribution-free bound O(n−4/3 ) that does not. 1
3 0.23025452 175 nips-2011-Multi-Bandit Best Arm Identification
Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck
Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.
4 0.21821743 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
5 0.18987462 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
6 0.18431653 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
7 0.17577206 32 nips-2011-An Empirical Evaluation of Thompson Sampling
8 0.13917005 177 nips-2011-Multi-armed bandits on implicit metric spaces
9 0.12929882 220 nips-2011-Prediction strategies without loss
10 0.12160888 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
11 0.10857867 61 nips-2011-Contextual Gaussian Process Bandit Optimization
12 0.10610399 25 nips-2011-Adaptive Hedge
13 0.10341699 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
14 0.08675988 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
15 0.084909633 109 nips-2011-Greedy Model Averaging
16 0.080408819 272 nips-2011-Stochastic convex optimization with bandit feedback
17 0.080177017 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
18 0.079085805 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
19 0.076747298 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
20 0.067444928 145 nips-2011-Learning Eigenvectors for Free
topicId topicWeight
[(0, 0.15), (1, -0.307), (2, 0.045), (3, 0.105), (4, 0.108), (5, -0.024), (6, 0.033), (7, 0.144), (8, 0.074), (9, 0.068), (10, -0.188), (11, 0.058), (12, -0.164), (13, -0.064), (14, 0.026), (15, -0.033), (16, -0.114), (17, -0.131), (18, 0.004), (19, 0.071), (20, 0.023), (21, 0.014), (22, 0.032), (23, 0.05), (24, -0.044), (25, -0.037), (26, -0.072), (27, -0.038), (28, 0.062), (29, 0.039), (30, -0.028), (31, 0.009), (32, 0.072), (33, 0.046), (34, -0.098), (35, 0.089), (36, -0.075), (37, 0.136), (38, -0.05), (39, -0.031), (40, 0.053), (41, -0.024), (42, -0.037), (43, 0.025), (44, -0.014), (45, -0.059), (46, 0.012), (47, 0.076), (48, -0.083), (49, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.95827079 56 nips-2011-Committing Bandits
Author: Loc X. Bui, Ramesh Johari, Shie Mannor
Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.
2 0.88733608 175 nips-2011-Multi-Bandit Best Arm Identification
Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck
Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.
3 0.86506003 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
Author: Alexandra Carpentier, Rémi Munos
Abstract: We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution� dependent bound O(n−3/2 ) that depends on a measure of the disparity of � the strata, and a distribution-free bound O(n−4/3 ) that does not. 1
4 0.72242516 32 nips-2011-An Empirical Evaluation of Thompson Sampling
Author: Olivier Chapelle, Lihong Li
Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
5 0.67717487 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
6 0.62428778 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
7 0.5745036 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
8 0.54493892 177 nips-2011-Multi-armed bandits on implicit metric spaces
9 0.48961002 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
10 0.45888159 109 nips-2011-Greedy Model Averaging
11 0.44327345 272 nips-2011-Stochastic convex optimization with bandit feedback
12 0.41235146 61 nips-2011-Contextual Gaussian Process Bandit Optimization
13 0.40477213 25 nips-2011-Adaptive Hedge
14 0.38084456 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
15 0.35021895 220 nips-2011-Prediction strategies without loss
16 0.33957306 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
17 0.3102302 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
18 0.30681968 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
19 0.30620772 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
20 0.29617003 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval
topicId topicWeight
[(0, 0.024), (4, 0.018), (20, 0.045), (26, 0.03), (31, 0.096), (43, 0.066), (45, 0.103), (57, 0.024), (67, 0.244), (74, 0.025), (79, 0.115), (83, 0.024), (99, 0.076)]
simIndex simValue paperId paperTitle
1 0.82708478 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors
Author: Hsiu-chin Lin, Vickie Baracos, Russell Greiner, Chun-nam J. Yu
Abstract: An accurate model of patient survival time can help in the treatment and care of cancer patients. The common practice of providing survival time estimates based only on population averages for the site and stage of cancer ignores many important individual differences among patients. In this paper, we propose a local regression method for learning patient-specific survival time distribution based on patient attributes such as blood tests and clinical assessments. When tested on a cohort of more than 2000 cancer patients, our method gives survival time predictions that are much more accurate than popular survival analysis models such as the Cox and Aalen regression models. Our results also show that using patient-specific attributes can reduce the prediction error on survival time by as much as 20% when compared to using cancer site and stage only. 1
same-paper 2 0.77818209 56 nips-2011-Committing Bandits
Author: Loc X. Bui, Ramesh Johari, Shie Mannor
Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.
3 0.73846602 129 nips-2011-Improving Topic Coherence with Regularized Topic Models
Author: David Newman, Edwin V. Bonilla, Wray Buntine
Abstract: Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data. 1
4 0.63861984 227 nips-2011-Pylon Model for Semantic Segmentation
Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman
Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. 1
Author: Matthias S. Keil
Abstract: Many species show avoidance reactions in response to looming object approaches. In locusts, the corresponding escape behavior correlates with the activity of the lobula giant movement detector (LGMD) neuron. During an object approach, its firing rate was reported to gradually increase until a peak is reached, and then it declines quickly. The η-function predicts that the LGMD activity is a product ˙ between an exponential function of angular size exp(−Θ) and angular velocity Θ, and that peak activity is reached before time-to-contact (ttc). The η-function has become the prevailing LGMD model because it reproduces many experimental observations, and even experimental evidence for the multiplicative operation was reported. Several inconsistencies remain unresolved, though. Here we address ˙ these issues with a new model (ψ-model), which explicitly connects Θ and Θ to biophysical quantities. The ψ-model avoids biophysical problems associated with implementing exp(·), implements the multiplicative operation of η via divisive inhibition, and explains why activity peaks could occur after ttc. It consistently predicts response features of the LGMD, and provides excellent fits to published experimental data, with goodness of fit measures comparable to corresponding fits with the η-function. 1 Introduction: τ and η Collision sensitive neurons were reported in species such different as monkeys [5, 4], pigeons [36, 34], frogs [16, 20], and insects [33, 26, 27, 10, 38]. This indicates a high ecological relevance, and raises the question about how neurons compute a signal that eventually triggers corresponding movement patterns (e.g. escape behavior or interceptive actions). Here, we will focus on visual stimulation. Consider, for simplicity, a circular object (diameter 2l), which approaches the eye at a collision course with constant velocity v. If we do not have any a priori knowledge about the object in question (e.g. its typical size or speed), then we will be able to access only two information sources. These information sources can be measured at the retina and are called optical variables (OVs). The first is the visual angle Θ, which can be derived from the number of stimulated photore˙ ˙ ceptors (spatial contrast). The second is its rate of change dΘ(t)/dt ≡ Θ(t). Angular velocity Θ is related to temporal contrast. ˙ How should we combine Θ and Θ in order to track an imminent collision? The perhaps simplest ˙ combination is τ (t) ≡ Θ(t)/Θ(t) [13, 18]. If the object hit us at time tc , then τ (t) ≈ tc − t will ∗ Also: www.ir3c.ub.edu, Research Institute for Brain, Cognition, and Behaviour (IR3C) Edifici de Ponent, Campus Mundet, Universitat de Barcelona, Passeig Vall d’Hebron, 171. E-08035 Barcelona 1 give us a running estimation of the time that is left until contact1 . Moreover, we do not need to know anything about the approaching object: The ttc estimation computed by τ is practically independent of object size and velocity. Neurons with τ -like responses were indeed identified in the nucleus retundus of the pigeon brain [34]. In humans, only fast interceptive actions seem to rely exclusively on τ [37, 35]. Accurate ttc estimation, however, seems to involve further mechanisms (rate of disparity change [31]). ˙ Another function of OVs with biological relevance is η ≡ Θ exp(−αΘ), with α = const. [10]. While η-type neurons were found again in pigeons [34] and bullfrogs [20], most data were gathered from the LGMD2 in locusts (e.g. [10, 9, 7, 23]). The η-function is a phenomenological model for the LGMD, and implies three principal hypothesis: (i) An implementation of an exponential function exp(·). Exponentation is thought to take place in the LGMD axon, via active membrane conductances [8]. Experimental data, though, seem to favor a third-power law rather than exp(·). (ii) The LGMD carries out biophysical computations for implementing the multiplicative operation. It has been suggested that multiplication is done within the LGMD itself, by subtracting the loga˙ rithmically encoded variables log Θ − αΘ [10, 8]. (iii) The peak of the η-function occurs before ˆ ttc, at visual angle Θ(t) = 2 arctan(1/α) [9]. It follows ttc for certain stimulus configurations (e.g. ˆ l/|v| 5ms). In principle, t > tc can be accounted for by η(t + δ) with a fixed delay δ < 0 (e.g. −27ms). But other researchers observed that LGMD activity continuous to rise after ttc even for l/|v| 5ms [28]. These discrepancies remain unexplained so far [29], but stimulation dynamics perhaps plays a role. We we will address these three issues by comparing the novel function “ψ” with the η-function. LGMD computations with the ψ-function: No multiplication, no exponentiation 2 A circular object which starts its approach at distance x0 and with speed v projects a visual angle Θ(t) = 2 arctan[l/(x0 − vt)] on the retina [34, 9]. The kinematics is hence entirely specified by the ˙ half-size-to-velocity ratio l/|v|, and x0 . Furthermore, Θ(t) = 2lv/((x0 − vt)2 + l2 ). In order to define ψ, we consider at first the LGMD neuron as an RC-circuit with membrane potential3 V [17] dV Cm = β (Vrest − V ) + gexc (Vexc − V ) + ginh (Vinh − V ) (1) dt 4 Cm = membrane capacity ; β ≡ 1/Rm denotes leakage conductance across the cell membrane (Rm : membrane resistance); gexc and ginh are excitatory and inhibitory inputs. Each conductance gi (i = exc, inh ) can drive the membrane potential to its associated reversal potential Vi (usually Vinh ≤ Vexc ). Shunting inhibition means Vinh = Vrest . Shunting inhibition lurks “silently” because it gets effective only if the neuron is driven away from its resting potential. With synaptic input, the neuron decays into its equilibrium state Vrest β + Vexc gexc + Vinh ginh V∞ ≡ (2) β + gexc + ginh according to V (t) = V∞ (1 − exp(−t/τm )). Without external input, V (t 1) → Vrest . The time scale is set by τm . Without synaptic input τm ≡ Cm /β. Slowly varying inputs gexc , ginh > 0 modify the time scale to approximately τm /(1 + (gexc + ginh )/β). For highly dynamic inputs, such as in late phase of the object approach, the time scale gets dynamical as well. The ψ-model assigns synaptic inputs5 ˙ ˙ ˙ ˙ gexc (t) = ϑ(t), ϑ(t) = ζ1 ϑ(t − ∆tstim ) + (1 − ζ1 )Θ(t) (3a) e ginh (t) = [γϑ(t)] , ϑ(t) = ζ0 ϑ(t − ∆tstim ) + (1 − ζ0 )Θ(t) 1 (3b) This linear approximation gets worse with increasing Θ, but turns out to work well until short before ttc (τ adopts a minimum at tc − 0.428978 · l/|v|). 2 LGMD activity is usually monitored via its postsynaptic neuron, the Descending Contralateral Movement Detector (DCMD) neuron. This represents no problem as LGMD spikes follow DCMD spikes 1:1 under visual stimulation [22] from 300Hz [21] to at least 400Hz [24]. 3 Here we assume that the membrane potential serves as a predictor for the LGMD’s mean firing rate. 4 Set to unity for all simulations. 5 LGMD receives also inhibition from a laterally acting network [21]. The η-function considers only direct feedforward inhibition [22, 6], and so do we. 2 Θ ∈ [7.63°, 180.00°[ temporal resolution ∆ tstim=1.0ms l/|v|=20.00ms, β=1.00, γ=7.50, e=3.00, ζ0=0.90, ζ1=0.99, nrelax=25 0.04 scaled dΘ/dt continuous discretized 0.035 0.03 Θ(t) (input) ϑ(t) (filtered) voltage V(t) (output) t = 56ms max t =300ms c 0.025 0 10 2 η(t): α=3.29, R =1.00 n =10 → t =37ms log Θ(t) amplitude relax max 0.02 0.015 0.01 0.005 0 −0.005 0 50 100 150 200 250 300 −0.01 0 350 time [ms] 50 100 150 200 250 300 350 time [ms] (b) ψ versus η (a) discretized optical variables Figure 1: (a) The continuous visual angle of an approaching object is shown along with its discretized version. Discretization transforms angular velocity from a continuous variable into a series of “spikes” (rescaled). (b) The ψ function with the inputs shown in a, with nrelax = 25 relaxation time steps. Its peak occurs tmax = 56ms before ttc (tc = 300ms). An η function (α = 3.29) that was fitted to ψ shows good agreement. For continuous optical variables, the peak would occur 4ms earlier, and η would have α = 4.44 with R2 = 1. For nrelax = 10, ψ is farther away from its equilibrium at V∞ , and its peak moves 19ms closer to ttc. t =500ms, dia=12.0cm, ∆t c =1.00ms, dt=10.00µs, discrete=1 stim 250 n relax = 50 2 200 α=4.66, R =0.99 [normal] n = 25 relax 2 α=3.91, R =1.00 [normal] n =0 relax tmax [ms] 150 2 α=1.15, R =0.99 [normal] 100 50 0 β=1.00, γ=7.50, e=3.00, V =−0.001, ζ =0.90, ζ =0.99 inh −50 5 10 15 20 25 30 0 35 1 40 45 50 l/|v| [ms] (a) different nrelax (b) different ∆tstim ˆ ˆ Figure 2: The figures plot the relative time tmax ≡ tc − t of the response peak of ψ, V (t), as a function of half-size-to-velocity ratio (points). Line fits with slope α and intercept δ were added (lines). The predicted linear relationship in all cases is consistent with experimental evidence [9]. (a) The stimulus time scale is held constant at ∆tstim = 1ms, and several LGMD time scales are defined by nrelax (= number of intercalated relaxation steps for each integration time step). Bigger values of nrelax move V (t) closer to its equilibrium V∞ (t), implying higher slopes α in turn. (b) LGMD time scale is fixed at nrelax = 25, and ∆tstim is manipulated. Because of the discretization of optical variables (OVs) in our simulation, increasing ∆tstim translates to an overall smaller number of jumps in OVs, but each with higher amplitude. Thus, we say ψ(t) ≡ V (t) if and only if gexc and ginh are defined with the last equation. The time ˙ scale of stimulation is defined by ∆tstim (by default 1ms). The variables ϑ and ϑ are lowpass filtered angular size and rate of expansion, respectively. The amount of filtering is defined by memory constants ζ0 and ζ1 (no filtering if zero). The idea is to continue with generating synaptic input ˙ after ttc, where Θ(t > tc ) = const and thus Θ(t > tc ) = 0. Inhibition is first weighted by γ, and then potentiated by the exponent e. Hodgkin-Huxley potentiates gating variables n, m ∈ [0, 1] instead (potassium ∝ n4 , sodium ∝ m3 , [12]) and multiplies them with conductances. Gabbiani and co-workers found that the function which transforms membrane potential to firing rate is better described by a power function with e = 3 than by exp(·) (Figure 4d in [8]). 3 Dynamics of the ψ-function 3 Discretization. In a typical experiment, a monitor is placed a short distance away from the insect’s eye, and an approaching object is displayed. Computer screens have a fixed spatial resolution, and as a consequence size increments of the displayed object proceed in discrete jumps. The locust retina is furthermore composed of a discrete array of ommatidia units. We therefore can expect a corresponding step-wise increment of Θ with time, although optical and neuronal filtering may ˙ smooth Θ to some extent again, resulting in ϑ (figure 1). Discretization renders Θ discontinuous, ˙ For simulating the dynamics of ψ, we discretized angular size what again will be alleviated in ϑ. ˙ with floor(Θ), and Θ(t) ≈ [Θ(t + ∆tstim ) − Θ(t)]/∆tstim . Discretized optical variables (OVs) were re-normalized to match the range of original (i.e. continuous) OVs. To peak, or not to peak? Rind & Simmons reject the hypothesis that the activity peak signals impending collision on grounds of two arguments [28]: (i) If Θ(t + ∆tstim ) − Θ(t) 3o in consecutively displayed stimulus frames, the illusion of an object approach would be lost. Such stimulation would rather be perceived as a sequence of rapidly appearing (but static) objects, causing reduced responses. (ii) After the last stimulation frame has been displayed (that is Θ = const), LGMD responses keep on building up beyond ttc. This behavior clearly depends on l/|v|, also according to their own data (e.g. Figure 4 in [26]): Response build up after ttc is typically observed for suffi˙ ciently small values of l/|v|. Input into ψ in situations where Θ = const and Θ = 0, respectively, ˙ is accommodated by ϑ and ϑ, respectively. We simulated (i) by setting ∆tstim = 5ms, thus producing larger and more infrequent jumps in discrete OVs than with ∆tstim = 1ms (default). As a consequence, ϑ(t) grows more slowly (deˆ layed build up of inhibition), and the peak occurs later (tmax ≡ tc − t = 10ms with everything else ˆ ˆ identical with figure 1b). The peak amplitude V = V (t) decreases nearly sixfold with respect to default. Our model thus predicts the reduced responses observed by Rind & Simmons [28]. Linearity. Time of peak firing rate is linearly related to l/|v| [10, 9]. The η-function is consistent ˆ with this experimental evidence: t = tc − αl/|v| + δ (e.g. α = 4.7, δ = −27ms). The ψ-function reproduces this relationship as well (figure 2), where α depends critically on the time scale of biophysical processes in the LGMD. We studied the impact of this time scale by choosing 10µs for the numerical integration of equation 1 (algorithm: 4th order Runge-Kutta). Apart from improving the numerical stability of the integration algorithm, ψ is far from its equilibrium V∞ (t) in every moment ˙ t, given the stimulation time scale ∆tstim = 1ms 6 . Now, at each value of Θ(t) and Θ(t), respectively, we intercalated nrelax iterations for integrating ψ. Each iteration takes V (t) asymptotically closer to V∞ (t), and limnrelax 1 V (t) = V∞ (t). If the internal processes in the LGMD cannot keep up with stimulation (nrelax = 0), we obtain slopes values that underestimate experimentally found values (figure 2a). In contrast, for nrelax 25 we get an excellent agreement with the experimentally determined α. This means that – under the reported experimental stimulation conditions (e.g. [9]) – the LGMD would operate relatively close to its steady state7 . Now we fix nrelax at 25 and manipulate ∆tstim instead (figure 2b). The default value ∆tstim = 1ms corresponds to α = 3.91. Slightly bigger values of ∆tstim (2.5ms and 5ms) underestimate the experimental α. In addition, the line fits also return smaller intercept values then. We see tmax < 0 up to l/|v| ≈ 13.5ms – LGMD activity peaks after ttc! Or, in other words, LGMD activity continues to increase after ttc. In the limit, where stimulus dynamics is extremely fast, and LGMD processes are kept far from equilibrium at each instant of the approach, α gets very small. As a consequence, tmax gets largely independent of l/|v|: The activity peak would cling to tmax although we varied l/|v|. 4 Freeze! Experimental data versus steady state of “psi” In the previous section, experimentally plausible values for α were obtained if ψ is close to equilibrium at each instant of time during stimulation. In this section we will thus introduce a steady-state 6 Assuming one ∆tstim for each integration time step. This means that by default stimulation and biophysical dynamics will proceed at identical time scales. 7 Notice that in this moment we can only make relative statements - we do not have data at hand for defining absolute time scales 4 tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 300 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 350 300 β=10.00 β=5.00 norm. rmse = 0.058...0.153 correlation (β,α)=−0.90 (n=4) ∞ β=1.00 e=4.00 norm. |η−ψ | = 0.009...0.114 e=3.00 300 norm. rmse = 0.014...0.160 correlation (e,α)=0.98 (n=4) ∞ e=2.50 250 250 norm. |η−ψ | = 0.043...0.241 ∞ norm. rmse = 0.085...0.315 correlation (γ,α)=1.00 (n=5) 150 tmax [ms] 200 tmax [ms] 200 tmax [ms] γ=5.00 γ=2.50 γ=1.00 γ=0.50 γ=0.25 e=5.00 norm. |η−ψ | = 0.020...0.128 β=2.50 250 200 150 100 150 100 100 50 50 50 0 5 10 15 20 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] (a) β varies 25 30 35 40 45 50 l/|v| [ms] (b) e varies (c) γ varies ˆ ˆ Figure 3: Each curve shows how the peak ψ∞ ≡ ψ∞ (t) depends on the half-size-to-velocity ratio. In each display, one parameter of ψ∞ is varied (legend), while the others are held constant (figure title). Line slopes vary according to parameter values. Symbol sizes are scaled according to rmse (see also figure 4). Rmse was calculated between normalized ψ∞ (t) & normalized η(t) (i.e. both functions ∈ [0, 1] with original minimum and maximum indicated by the textbox). To this end, the ˆ peak of the η-function was placed at tc , by choosing, at each parameter value, α = |v| · (tc − t)/l (for determining correlation, the mean value of α was taken across l/|v|). tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 0.25 β=5.00 0.12 β=2.50 β=1.00 0.1 0.08 (normalized η, ψ∞) 0.12 β=10.00 (normalized η, ψ∞) (normalized η, ψ∞) 0.14 0.1 0.08 γ=5.00 γ=2.50 0.2 γ=1.00 γ=0.50 γ=0.25 0.15 0.06 0.04 0.02 0 5 10 15 20 25 30 35 40 45 50 meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| 0.06 0.04 e=5.00 e=4.00 e=3.00 0.02 e=2.50 10 l/|v| [ms] 15 20 25 30 35 40 45 50 l/|v| [ms] (a) β varies (b) e varies 0.1 0.05 0 5 10 15 20 25 30 35 40 45 50 l/|v| [ms] (c) γ varies Figure 4: This figure complements figure 3. It visualizes the time averaged absolute difference between normalized ψ∞ (t) & normalized η(t). For η, its value of α was chosen such that the maxima of both functions coincide. Although not being a fit, it gives a rough estimate on how the shape of both curves deviate from each other. The maximum possible difference would be one. version of ψ (i.e. equation 2 with Vrest = 0, Vexc = 1, and equations 3 plugged in), ψ∞ (t) ≡ e ˙ Θ(t) + Vinh [γΘ(t)] e ˙ β + Θ(t) + [γΘ(t)] (4) (Here we use continuous versions of angular size and rate of expansion). The ψ∞ -function makes life easier when it comes to fitting experimental data. However, it has its limitations, because we brushed the whole dynamic of ψ under the carpet. Figure 3 illustrates how the linˆ ear relationship (=“linearity”) between tmax ≡ tc − t and l/|v| is influenced by changes in parameter values. Changing any of the values of e, β, γ predominantly causes variation in line slopes. The smallest slope changes are obtained by varying Vinh (data not shown; we checked Vinh = 0, −0.001, −0.01, −0.1). For Vinh −0.01, linearity is getting slightly compromised, as slope increases with l/|v| (e.g. Vinh = −1 α ∈ [4.2, 4.7]). In order to get a notion about how well the shape of ψ∞ (t) matches η(t), we computed timeaveraged difference measures between normalized versions of both functions (details: figure 3 & 4). Bigger values of β match η better at smaller, but worse at bigger values of l/|v| (figure 4a). Smaller β cause less variation across l/|v|. As to variation of e, overall curve shapes seem to be best aligned with e = 3 to e = 4 (figure 4b). Furthermore, better matches between ψ∞ (t) and η(t) correspond to bigger values of γ (figure 4c). And finally, Vinh marches again to a different tune (data not shown). Vinh = −0.1 leads to the best agreement (≈ 0.04 across l/|v|) of all Vinh , quite different from the other considered values. For the rest, ψ∞ (t) and η(t) align the same (all have maximum 0.094), 5 ˙ (a) Θ = 126o /s ˙ (b) Θ = 63o /s Figure 5: The original data (legend label “HaGaLa95”) were resampled from ref. [10] and show ˙ DCMD responses to an object approach with Θ = const. Thus, Θ increases linearly with time. The η-function (fitting function: Aη(t+δ)+o) and ψ∞ (fitting function: Aψ∞ (t)+o) were fitted to these data: (a) (Figure 3 Di in [10]) Good fits for ψ∞ are obtained with e = 5 or higher (e = 3 R2 = 0.35 and rmse = 0.644; e = 4 R2 = 0.45 and rmse = 0.592). “Psi” adopts a sigmoid-like curve form which (subjectively) appears to fit the original data better than η. (b) (Figure 3 Dii in [10]) “Psi” yields an excellent fit for e = 3. RoHaTo10 gregarious locust LV=0.03s Θ(t), lv=30ms e011pos014 sgolay with 100 t =107ms max ttc=5.00s ψ adj.R2 0.95 (LM:3) ∞ η(t) adj.R2 1 (TR::1) 2 ψ : R =0.95, rmse=0.004, 3 coefficients ∞ → β=2.22, γ=0.70, e=3.00, V =−0.001, A=0.07, o=0.02, δ=0.00ms inh η: R2=1.00, rmse=0.001 → α=3.30, A=0.08, o=0.0, δ=−10.5ms 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 time [s] (b) α versus β (a) spike trace Figure 6: (a) DCMD activity in response to a black square (l/|v| = 30ms, legend label “e011pos14”, ref. [30]) approaching to the eye center of a gregarious locust (final visual angle 50o ). Data show the first stimulation so habituation is minimal. The spike trace (sampled at 104 Hz) was full wave rectified, lowpass filtered, and sub-sampled to 1ms resolution. Firing rate was estimated with Savitzky-Golay filtering (“sgolay”). The fits of the η-function (Aη(t + δ) + o; 4 coefficients) and ψ∞ -function (Aψ∞ (t) with fixed e, o, δ, Vinh ; 3 coefficients) provide both excellent fits to firing rate. (b) Fitting coefficient α (→ η-function) inversely correlates with β (→ ψ∞ ) when fitting firing rates of another 5 trials as just described (continuous line = line fit to the data points). Similar correlation values would be obtained if e is fixed at values e = 2.5, 4, 5 c = −0.95, −0.96, −0.91. If o was determined by the fitting algorithm, then c = −0.70. No clear correlations with α were obtained for γ. despite of covering different orders of magnitude with Vinh = 0, −0.001, −0.01. Decelerating approach. Hatsopoulos et al. [10] recorded DCMD activity in response to an ap˙ proaching object which projected image edges on the retina moving at constant velocity: Θ = const. ˙ This “linear approach” is perceived as if the object is getting increasingly implies Θ(t) = Θ0 + Θt. slower. But what appears a relatively unnatural movement pattern serves as a test for the functions η & ψ∞ . Figure 5 illustrates that ψ∞ passes the test, and consistently predicts that activity sharply rises in the initial approach phase, and subsequently declines (η passed this test already in the year 1995). 6 Spike traces. We re-sampled about 30 curves obtained from LGMD recordings from a variety of publications, and fitted η & ψ∞ -functions. We cannot show the results here, but in terms of goodness of fit measures, both functions are in the same ballbark. Rather, figure 6a shows a representative example [30]. When α and β are plotted against each other for five trials, we see a strong inverse correlation (figure 6b). Although five data points are by no means a firm statistical sample, the strong correlation could indicate that β and α play similar roles in both functions. Biophysically, β is the leakage conductance, which determines the (passive) membrane time constant τm ∝ 1/β of the neuron. Voltage drops within τm to exp(−1) times its initial value. Bigger values of β mean shorter τm (i.e., “faster neurons”). Getting back to η, this would suggest α ∝ τm , such that higher (absolute) values for α would possibly indicate a slower dynamic of the underlying processes. 5 Discussion (“The Good, the Bad, and the Ugly”) Up to now, mainly two classes of LGMD models existed: The phenomenological η-function on the one hand, and computational models with neuronal layers presynaptic to the LGMD on the other (e.g. [25, 15]; real-world video sequences & robotics: e.g. [3, 14, 32, 2]). Computational models predict that LGMD response features originate from excitatory and inhibitory interactions in – and between – presynaptic neuronal layers. Put differently, non-linear operations are generated in the presynaptic network, and can be a function of many (model) parameters (e.g. synaptic weights, time constants, etc.). In contrast, the η-function assigns concrete nonlinear operations to the LGMD [7]. The η-function is accessible to mathematical analysis, whereas computational models have to be probed with videos or artificial stimulus sequences. The η-function is vague about biophysical parameters, whereas (good) computational models need to be precise at each (model) parameter value. The η-function establishes a clear link between physical stimulus attributes and LGMD activity: It postulates what is to be computed from the optical variables (OVs). But in computational models, such a clear understanding of LGMD inputs cannot always be expected: Presynaptic processing may strongly transform OVs. The ψ function thus represents an intermediate model class: It takes OVs as input, and connects them with biophysical parameters of the LGMD. For the neurophysiologist, the situation could hardly be any better. Psi implements the multiplicative operation of the η-function by shunting inhibition (equation 1: Vexc ≈ Vrest and Vinh ≈ Vrest ). The η-function fits ψ very well according to our dynamical simulations (figure 1), and satisfactory by the approximate criterion of figure 4. We can conclude that ψ implements the η-function in a biophysically plausible way. However, ψ does neither explicitly specify η’s multiplicative operation, nor its exponential function exp(·). Instead we have an interaction between shunting inhibition and a power law (·)e , with e ≈ 3. So what about power laws in neurons? Because of e > 1, we have an expansive nonlinearity. Expansive power-law nonlinearities are well established in phenomenological models of simple cells of the primate visual cortex [1, 11]. Such models approximate a simple cell’s instantaneous firing rate r from linear filtering of a stimulus (say Y ) by r ∝ ([Y ]+ )e , where [·]+ sets all negative values to zero and lets all positive pass. Although experimental evidence favors linear thresholding operations like r ∝ [Y − Ythres ]+ , neuronal responses can behave according to power law functions if Y includes stimulus-independent noise [19]. Given this evidence, the power-law function of the inhibitory input into ψ could possibly be interpreted as a phenomenological description of presynaptic processes. The power law would also be the critical feature by means of which the neurophysiologist could distinguish between the η function and ψ. A study of Gabbiani et al. aimed to provide direct evidence for a neuronal implementation of the η-function [8]. Consequently, the study would be an evidence ˙ for a biophysical implementation of “direct” multiplication via log Θ − αΘ. Their experimental evidence fell somewhat short in the last part, where “exponentation through active membrane conductances” should invert logarithmic encoding. Specifically, the authors observed that “In 7 out of 10 neurons, a third-order power law best described the data” (sixth-order in one animal). Alea iacta est. Acknowledgments MSK likes to thank Stephen M. Rogers for kindly providing the recording data for compiling figure 6. MSK furthermore acknowledges support from the Spanish Government, by the Ramon and Cajal program and the research grant DPI2010-21513. 7 References [1] D.G. Albrecht and D.B. Hamilton, Striate cortex of monkey and cat: contrast response function, Journal of Neurophysiology 48 (1982), 217–237. [2] S. Bermudez i Badia, U. Bernardet, and P.F.M.J. Verschure, Non-linear neuronal responses as an emergent property of afferent networks: A case study of the locust lobula giant movemement detector, PLoS Computational Biology 6 (2010), no. 3, e1000701. [3] M. Blanchard, F.C. Rind, and F.M.J. Verschure, Collision avoidance using a model of locust LGMD neuron, Robotics and Autonomous Systems 30 (2000), 17–38. [4] D.F. Cooke and M.S.A. Graziano, Super-flinchers and nerves of steel: Defensive movements altered by chemical manipulation of a cortical motor area, Neuron 43 (2004), no. 4, 585–593. [5] L. Fogassi, V. Gallese, L. Fadiga, G. Luppino, M. Matelli, and G. Rizzolatti, Coding of peripersonal space in inferior premotor cortex (area f4), Journal of Neurophysiology 76 (1996), 141–157. [6] F. Gabbiani, I. Cohen, and G. Laurent, Time-dependent activation of feed-forward inhibition in a looming sensitive neuron, Journal of Neurophysiology 94 (2005), 2150–2161. [7] F. Gabbiani, H.G. Krapp, N. Hatsopolous, C.H. Mo, C. Koch, and G. Laurent, Multiplication and stimulus invariance in a looming-sensitive neuron, Journal of Physiology - Paris 98 (2004), 19–34. [8] F. Gabbiani, H.G. Krapp, C. Koch, and G. Laurent, Multiplicative computation in a visual neuron sensitive to looming, Nature 420 (2002), 320–324. [9] F. Gabbiani, H.G. Krapp, and G. Laurent, Computation of object approach by a wide-field, motionsensitive neuron, Journal of Neuroscience 19 (1999), no. 3, 1122–1141. [10] N. Hatsopoulos, F. Gabbiani, and G. Laurent, Elementary computation of object approach by a wide-field visual neuron, Science 270 (1995), 1000–1003. [11] D.J. Heeger, Modeling simple-cell direction selectivity with normalized, half-squared, linear operators, Journal of Neurophysiology 70 (1993), 1885–1898. [12] A.L. Hodkin and A.F. Huxley, A quantitative description of membrane current and its application to conduction and excitation in nerve, Journal of Physiology 117 (1952), 500–544. [13] F. Hoyle, The black cloud, Pinguin Books, London, 1957. [14] M.S. Keil, E. Roca-Morena, and A. Rodr´guez-V´ zquez, A neural model of the locust visual system for ı a detection of object approaches with real-world scenes, Proceedings of the Fourth IASTED International Conference (Marbella, Spain), vol. 5119, 6-8 September 2004, pp. 340–345. [15] M.S. Keil and A. Rodr´guez-V´ zquez, Towards a computational approach for collision avoidance with ı a real-world scenes, Proceedings of SPIE: Bioengineered and Bioinspired Systems (Maspalomas, Gran Canaria, Canary Islands, Spain) (A. Rodr´guez-V´ zquez, D. Abbot, and R. Carmona, eds.), vol. 5119, ı a SPIE - The International Society for Optical Engineering, 19-21 May 2003, pp. 285–296. [16] J.G. King, J.Y. Lettvin, and E.R. Gruberg, Selective, unilateral, reversible loss of behavioral responses to looming stimuli after injection of tetrodotoxin or cadmium chloride into the frog optic nerve, Brain Research 841 (1999), no. 1-2, 20–26. [17] C. Koch, Biophysics of computation: information processing in single neurons, Oxford University Press, New York, 1999. [18] D.N. Lee, A theory of visual control of braking based on information about time-to-collision, Perception 5 (1976), 437–459. [19] K.D. Miller and T.W. Troyer, Neural noise can explain expansive, power-law nonlinearities in neuronal response functions, Journal of Neurophysiology 87 (2002), 653–659. [20] Hideki Nakagawa and Kang Hongjian, Collision-sensitive neurons in the optic tectum of the bullfrog, rana catesbeiana, Journal of Neurophysiology 104 (2010), no. 5, 2487–2499. [21] M. O’Shea and C.H.F. Rowell, Projection from habituation by lateral inhibition, Nature 254 (1975), 53– 55. [22] M. O’Shea and J.L.D. Williams, The anatomy and output connection of a locust visual interneurone: the lobula giant movement detector (lgmd) neurone, Journal of Comparative Physiology 91 (1974), 257–266. [23] S. Peron and F. Gabbiani, Spike frequency adaptation mediates looming stimulus selectivity, Nature Neuroscience 12 (2009), no. 3, 318–326. [24] F.C. Rind, A chemical synapse between two motion detecting neurones in the locust brain, Journal of Experimental Biology 110 (1984), 143–167. [25] F.C. Rind and D.I. Bramwell, Neural network based on the input organization of an identified neuron signaling implending collision, Journal of Neurophysiology 75 (1996), no. 3, 967–985. 8 [26] F.C. Rind and P.J. Simmons, Orthopteran DCMD neuron: a reevaluation of responses to moving objects. I. Selective responses to approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1654–1666. [27] , Orthopteran DCMD neuron: a reevaluation of responses to moving objects. II. Critical cues for detecting approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1667–1682. [28] , Signaling of object approach by the dcmd neuron of the locust, Journal of Neurophysiology 77 (1997), 1029–1033. [29] , Reply, Trends in Neuroscience 22 (1999), no. 5, 438. [30] S.M. Roger, G.W.J. Harston, F. Kilburn-Toppin, T. Matheson, M. Burrows, F. Gabbiani, and H.G. Krapp, Spatiotemporal receptive field properties of a looming-sensitive neuron in solitarious and gregarious phases of desert locust, Journal of Neurophysiology 103 (2010), 779–792. [31] S.K. Rushton and J.P. Wann, Weighted combination of size and disparity: a computational model for timing ball catch, Nature Neuroscience 2 (1999), no. 2, 186–190. [32] Yue. S., Rind. F.C., M.S. Keil, J. Cuadri, and R. Stafford, A bio-inspired visual collision detection mechanism for cars: Optimisation of a model of a locust neuron to a novel environment, Neurocomputing 69 (2006), 1591–1598. [33] G.R. Schlotterer, Response of the locust descending movement detector neuron to rapidly approaching and withdrawing visual stimuli, Canadian Journal of Zoology 55 (1977), 1372–1376. [34] H. Sun and B.J. Frost, Computation of different optical variables of looming objects in pigeon nucleus rotundus neurons, Nature Neuroscience 1 (1998), no. 4, 296–303. [35] J.R. Tresilian, Visually timed action: time-out for ’tau’?, Trends in Cognitive Sciences 3 (1999), no. 8, 1999. [36] Y. Wang and B.J. Frost, Time to collision is signalled by neurons in the nucleus rotundus of pigeons, Nature 356 (1992), 236–238. [37] J.P. Wann, Anticipating arrival: is the tau-margin a specious theory?, Journal of Experimental Psychology and Human Perceptual Performance 22 (1979), 1031–1048. [38] M. Wicklein and N.J. Strausfeld, Organization and significance of neurons that detect change of visual depth in the hawk moth manduca sexta, The Journal of Comparative Neurology 424 (2000), no. 2, 356– 376. 9
6 0.62951332 175 nips-2011-Multi-Bandit Best Arm Identification
7 0.61397469 186 nips-2011-Noise Thresholds for Spectral Clustering
8 0.60653955 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
9 0.56344557 32 nips-2011-An Empirical Evaluation of Thompson Sampling
10 0.56038684 24 nips-2011-Active learning of neural response functions with Gaussian processes
11 0.5534026 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
12 0.54956883 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
13 0.54758793 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
14 0.5459227 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
15 0.54468614 180 nips-2011-Multiple Instance Filtering
16 0.54439181 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
17 0.54028761 29 nips-2011-Algorithms and hardness results for parallel large margin learning
18 0.53890568 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
19 0.53777671 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
20 0.53776109 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning