nips nips2011 nips2011-175 knowledge-graph by maker-knowledge-mining

175 nips-2011-Multi-Bandit Best Arm Identification


Source: pdf

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck

Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. [sent-6, score-0.387]

2 We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i. [sent-7, score-0.796]

3 We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. [sent-10, score-0.303]

4 Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. [sent-12, score-0.194]

5 Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems. [sent-13, score-0.183]

6 Since it may take significantly more resources to find the best treatment for one subpopulation than for the others, the common strategy of enrolling patients as they arrive may not yield an overall good performance. [sent-19, score-0.178]

7 This problem can be formulated as the best arm identification over M multi-armed bandits [1], which itself can be seen as the problem of pure exploration [4] over multiple bandits. [sent-21, score-0.502]

8 In this formulation, each subpopulation is considered as a multi-armed bandit, each treatment as an arm, trying a medication on a patient as a pull, and we are asked to recommend an arm for each bandit after a given number of pulls (budget). [sent-22, score-0.716]

9 The evaluation can be based on 1) the average over the bandits of the reward of the recommended arms, or 2) the average probability of error (not selecting the best arm), or 3) the maximum probability of error. [sent-23, score-0.184]

10 Note that this setting is different from the standard multi-armed bandit problem in which the goal is to maximize the cumulative sum of rewards (see e. [sent-24, score-0.246]

11 The pure exploration problem is about designing strategies that make the best use of the limited budget (e. [sent-27, score-0.212]

12 They showed that both algorithms are nearly optimal since their probability of returning the wrong arm decreases exponentially at a rate. [sent-32, score-0.303]

13 , [10, 12]) 1 and action-elimination algorithms [7] address this problem under a constraint on the accuracy in identifying the best arm and they minimize the budget needed to achieve that accuracy. [sent-35, score-0.31]

14 However, UCB-E and Successive Rejects are designed for a single bandit problem, and as we will discuss later, cannot be easily extended to the multi-bandit case studied in this paper. [sent-36, score-0.246]

15 have recently proposed an active learning algorithm for resource allocation over multiple bandits [5]. [sent-38, score-0.243]

16 Moreover, the target of their proposed algorithm is to minimize the maximum uncertainty in estimating the value of the arms for each bandit. [sent-40, score-0.266]

17 Note that this is different than our target, which is to maximize the quality of the arms recommended for each bandit. [sent-41, score-0.279]

18 The allocation strategy implemented by GapE focuses on the gap of the arms, i. [sent-43, score-0.238]

19 , the difference between the mean of the arm and the mean of the best arm (in that bandit). [sent-45, score-0.514]

20 Since both GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is rarely known in advance, we also introduce their adaptive version. [sent-48, score-0.174]

21 2 Problem Setup In this section, we introduce the notation used throughout the paper and formalize the multi-bandit best arm identification problem. [sent-51, score-0.267]

22 Let M be the number of bandits and K be the number of arms for each bandit (we use indices m, p, q for the bandits and k, i, j for the arms). [sent-52, score-0.71]

23 Each arm k of a bandit 2 m is characterized by a distribution νmk bounded in [0, b] with mean µmk and variance σmk . [sent-53, score-0.519]

24 In the ∗ ∗ following, we assume that each bandit has a unique best arm. [sent-54, score-0.266]

25 We denote by µm and km the mean and ∗ the index of the best arm of bandit m (i. [sent-55, score-0.596]

26 In m each bandit m, we define the gap for each arm as ∆mk = | maxj=k µmj − µmk |. [sent-58, score-0.525]

27 , n, the forecaster pulls a bandit-arm pair I(t) = (m, k) and observes a sample drawn from the distribution νI(t) independent from the past. [sent-64, score-0.204]

28 The forecaster estimates the expected value of each arm by computing the average of the samples observed over time. [sent-65, score-0.289]

29 Let Tmk (t) be the number of times that arm k of bandit m has been pulled by the end of round t, 1 then the mean of this arm is estimated as µmk (t) = Tmk (t) Tmk (t) Xmk (s), where Xmk (s) is the s=1 s-th sample observed from νmk . [sent-66, score-0.807]

30 At the end of round n, the forecaster returns for each bandit m the arm with the highest estimated mean, i. [sent-68, score-0.566]

31 In some applications, returning the wrong arm is considered as an error independently from its regret, and thus, the objective is to minimize the average probability of error e(n) = 1 M M em (n) = m=1 1 M M ∗ P Jm (n) = km . [sent-72, score-0.395]

32 , n do Compute Bmk (t) = −∆mk (t − 1) + b Tmka for all bandit-arm pairs (m, k) (t−1) Draw I(t) ∈ arg maxm,k Bmk (t) Observe XI(t) TI(t) (t − 1) + 1 ∼ νI(t) Update TI(t) (t) = TI(t) (t − 1) + 1 and ∆mk (t) ∀k of the selected bandit end for Return Jm (n) ∈ arg maxk∈{1,. [sent-84, score-0.27]

33 The first term is the negative of the estimated gap for arm k in bandit m. [sent-96, score-0.54]

34 Similar to other upper-confidence bound (UCB) methods [3], the second part is an exploration term which forces the algorithm to pull arms that have been less explored. [sent-97, score-0.431]

35 As a result, the algorithm tends to pull arms with small estimated gap and small number of pulls. [sent-98, score-0.394]

36 The exploration parameter a tunes the level of exploration of the algorithm. [sent-99, score-0.21]

37 1, if the time 4 horizon n is known, a should be set to a = 9 n−K , where H = m,k b2 /∆2 is the complexity of mk H the problem (see Sec. [sent-102, score-0.302]

38 Note that GapE differs from most standard bandit strategies in the sense that the B-index for an arm depends explicitly on the statistics of the other arms. [sent-105, score-0.519]

39 ∗ In the single-bandit problem, since the best and second best arms have the same gap (∆mkm = ∗ mink=km ∆mk ), GapE considers them equivalent and tends to pull them the same amount of time, while UCB-E tends to pull the best arm more often than the second best one. [sent-110, score-0.787]

40 Despite this difference, the performance of both algorithms in predicting the best arm after n pulls would be the same. [sent-111, score-0.422]

41 In this case, if we run UCB-E on all the M K arms, it tends to pull more the arm with the highest mean over all the bandits, i. [sent-117, score-0.328]

42 As a result, it would be accurate in predicting the best arm k ∗ over bandits, but may have an arbitrarily bad performance in predicting the best arm for each bandit, and thus, may incur a large error ℓ(n). [sent-120, score-0.552]

43 On the other hand, GapE focuses on the arms with the smallest gaps. [sent-121, score-0.283]

44 This way, it assigns more pulls to bandits whose optimal arms are difficult to identify (i. [sent-122, score-0.507]

45 , bandits with arms with small gaps), and as shown in the next section, it achieves a high probability in identifying the best arm in each bandit. [sent-124, score-0.649]

46 H If we denote by Hmk = b2 /∆2 , the complexity of arm k in bandit m, it is clear from the definition mk of H that each arm has an additive impact on the overall complexity of the multi-bandit problem. [sent-134, score-1.065]

47 2 2 Moreover, if we define the complexity of each bandit m as Hm = k b /∆mk (similar to the definition of complexity for UCB-E in [1]), the GapE complexity may be rewritten as H = m Hm . [sent-135, score-0.36]

48 Remark 2 (Comparison with the static allocation strategy). [sent-137, score-0.178]

49 The main objective of GapE is to tradeoff between allocating pulls according to the gaps (more precisely, according to the complexities Hmk ) and the exploration needed to improve the accuracy of their estimates. [sent-138, score-0.381]

50 If the gaps were known in advance, a nearly-optimal static allocation strategy assigns to each bandit-arm pair a number of pulls proportional to its complexity. [sent-139, score-0.451]

51 Let us consider a strategy that pulls each arm a fixed number of times over the horizon n. [sent-140, score-0.449]

52 The probability of error for this strategy may be bounded as M M ∗ P Jm (n) = km ≤ ∗ ℓStatic (n) ≤ P ∃m : Jm (n) = km ≤ M P µmkm (n) ≤ µmk (n) ˆ ∗ ˆ ∗ m=1 k=km m=1 M exp − Tmk (n) ≤ ∗ m=1 k=km ∆2 −1 mk exp − Tmk (n)Hmk . [sent-141, score-0.539]

53 = b2 m=1 k=k∗ (1) m Given the constraint mk Tmk (n) = n, the allocation minimizing the last term in Eq. [sent-142, score-0.393]

54 Although this is not neces∗ sarily the optimal static strategy (Tmk (n) minimizes an upper-bound), this allocation guarantees a probability of error smaller than M K exp(−n/H). [sent-145, score-0.258]

55 Theorem 1 shows that, for n large enough, GapE achieves the same performance as the static allocation StaticGap. [sent-146, score-0.178]

56 In the uniform allocation strategy, the total budget n is uniformly split over all the bandits and arms. [sent-151, score-0.3]

57 Using the same derivation as in Remark 2, the probability of error ℓ(n) for this strategy may be bounded as M exp − ℓUnif (n) ≤ ∗ m=1 k=km n ∆2 mk ≤ M K exp M K b2 − n . [sent-153, score-0.373]

58 , a two-level algorithm that first selects a bandit uniformly and then pulls arms within each bandit using UCB-E, the total number of pulls for each bandit m is k Tmk (n) = n/M , while the number of pulls Tmk (n) over the arms in bandit m is determined by UCB-E. [sent-156, score-1.942]

59 4, shows that GapE is able to adapt to the complexity H of the overall multi-bandit problem better than the other two allocation strategies. [sent-163, score-0.201]

60 In fact, while the performance of the Uniform strategy depends on the most complex arm over the bandits and the strategy Unif+UCB-E is affected by the most complex bandit, the performance of GapE depends on the sum of the complexities of all the arms involved in the pure exploration problem. [sent-164, score-0.89]

61 Now we would like to prove that on the event E, we find the best arm for all the bandits, i. [sent-178, score-0.287]

62 Since Jm (n) is the empirical best arm of bandit m, we should prove that for ∗ any k ∈ {1, . [sent-184, score-0.533]

63 In this step, we show that in GapE, for any bandits (m, q) and arms (k, j), and for any t ≥ M K, the following dependence between the number of pulls of the arms holds a ≥ −∆qj + (1 − d)b max Tmk (t) − 1, 1 −∆mk + (1 + d)b a , Tqj (t) (2) where d ∈ [0, 1]. [sent-190, score-0.773]

64 We know that after the first M K rounds of the GapE algorithm, all the arms have been pulled once, i. [sent-193, score-0.316]

65 Let us assume that (2) holds at time t − 1 and we pull arm i of bandit p at time t, i. [sent-197, score-0.553]

66 Tqj (t) (3) Since arm i of bandit p has been pulled at time t, we have that for any bandit-arm pair (q, j) −∆pi (t − 1) + b a ≥ −∆qj (t − 1) + b Tpi (t − 1) a . [sent-208, score-0.549]

67 In order to prove the condition of Tmk (n) in step 1, we need to find a lower-bound on the number of pulls of all the arms at time t = n (at the end). [sent-215, score-0.428]

68 Let us assume that arm k of bandit m has 2 (1−d)2 a been pulled less than ab ∆2 , which indicates that −∆mk + (1 − d)b Tmk (n) > 0. [sent-216, score-0.529]

69 From this mk result and (2), we have −∆qj + (1 + d)b a Tqj (n)−1 > 0, or equivalently Tqj (n) < ab2 (1+d)2 ∆2 qj +1 for any pair (q, j). [sent-217, score-0.332]

70 So, if we select a such that n − M K ≥ ab2 (1 + d)2 q,j ∆1 , we contradict 2 2 qj 2 qj 2 2 2 (1−d) , which means that Tmk (n) ≥ 4ab c for any pair the first assumption that Tmk (n) < ab ∆2 ∆2 mk mk (m, k), when 1 − d ≥ 2c. [sent-220, score-0.644]

71 The allocation strategy implemented by GapE focuses only on the arms with small gap and does not take into consideration their variance. [sent-226, score-0.504]

72 However, it is clear that the arms with small variance, even if their gap is small, just need a few pulls to be correctly estimated. [sent-227, score-0.44]

73 Let σmk (t) = Tmk (t)−1 s=1 Xmk (s) − µ2 (t) be the estimated variance mk for arm k of bandit m at the end of round t. [sent-229, score-0.799]

74 As a result, arms with low variance will be explored much less than in the GapE algorithm. [sent-232, score-0.292]

75 In Theorem 2, H σ is the complexity of the GapE-V algorithm and is defined as K M Hσ = 2 σmk + (16/3)b∆mk ∆2 mk σmk + m=1 k=1 2 . [sent-239, score-0.287]

76 Although the variance-complexity H σ could be larger than the complexity H used in GapE, whenever the variances of the arms are small compared to the range b of the distribution, we expect H σ to be smaller than H. [sent-240, score-0.321]

77 Furthermore, if the arms have very different variances, then GapE-V is expected to better capture the complexity of each arm and allocate the pulls accordingly. [sent-241, score-0.708]

78 For instance, in the case where all the gaps are the same, GapE tends to allocate pulls proportionally to the complexity Hmk and it would perform an almost uniform allocation over bandits and arms. [sent-242, score-0.552]

79 On the other hand, the variances of the arms could be very heterogeneous and GapE-V would adapt the allocation strategy by pulling more often the arms whose values are more uncertain. [sent-243, score-0.757]

80 By using a lower-bound on the true H and H σ , the algorithms tend to explore arms more uniformly and this allows them to increase the accuracy of their estimated complexities. [sent-250, score-0.294]

81 The arms have Bernoulli distribution with parameters: bandit 1 = (0. [sent-277, score-0.512]

82 The arms have Rademacher distribution with parameters (x, y): bandit 1 = {(0, 1. [sent-288, score-0.512]

83 The arms have Rademacher distribution with parameters (x, y): bandit 1 = {(0, 1. [sent-305, score-0.512]

84 , the number of samples to find the best arms with high probability), η should simply correct the inaccuracy of the constants in the analysis, and thus, the range of its nearly-optimal values should be constant across different problems. [sent-339, score-0.286]

85 Finally, we set n ≃ H σ , since we expect H σ to roughly capture the number of pulls necessary to solve the pure exploration problem with high probability. [sent-341, score-0.278]

86 the probability to identify the best arm in all the bandits after n rounds, of the gap-based algorithms as well as Unif and Unif+UCB-E strategies. [sent-345, score-0.396]

87 4% probability of error), since it adapts the allocation within each bandit so as to pull more often the nearly-optimal arms. [sent-352, score-0.484]

88 However, the two bandit problems are not equally difficult. [sent-353, score-0.246]

89 In fact, their complexities are very different (H1 ≃ 925 and H2 ≃ 67), and thus, much less samples are needed to identify the best arm in the second bandit than in the first one. [sent-354, score-0.565]

90 05), thus all the arms (and bandits) have the same complexity Hmk = 400. [sent-361, score-0.304]

91 The reason why GapE is still able to improve over Unif may be explained by the difference between static and dynamic allocation strategies and it is further investigated in App. [sent-363, score-0.204]

92 Unlike the gaps, the variance of the arms is extremely heterogeneous. [sent-365, score-0.292]

93 In fact, the variance of the arms of bandit 1 is bigger than in bandit 2, thus making it harder to solve. [sent-366, score-0.784]

94 the variance of each bandit and within each bandit is strongly heterogeneous. [sent-385, score-0.518]

95 5 Conclusion In this paper, we studied the problem of best arm identification in a multi-bandit multi-armed setting. [sent-388, score-0.267]

96 We extended the basic algorithm to also consider the variance of the arms and proved an upper-bound for its probability of error. [sent-390, score-0.322]

97 The numerical simulations confirmed the theoretical findings that GapE and GapE-V outperform other allocation strategies, and that their adaptive counterparts are able to estimate the complexity without worsening the global performance. [sent-392, score-0.213]

98 Although GapE does not know the gaps, the experimental results reported in [8] indicate that it might outperform a static allocation strategy, which knows the gaps in advance, thus suggesting that an adaptive strategy could perform better than a static one. [sent-393, score-0.354]

99 Moreover, we plan to apply the algorithms introduced in this paper to the problem of rollout allocation for classification-based policy iteration in reinforcement learning [9, 6], where the goal is to identify the greedy action (arm) in each of the states (bandit) in a training set. [sent-395, score-0.189]

100 Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. [sent-435, score-0.261]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gape', 0.698), ('arms', 0.266), ('tmk', 0.25), ('mk', 0.249), ('arm', 0.247), ('bandit', 0.246), ('unif', 0.157), ('allocation', 0.144), ('pulls', 0.142), ('hmk', 0.125), ('exploration', 0.105), ('bandits', 0.099), ('km', 0.083), ('tqj', 0.073), ('gaps', 0.066), ('jm', 0.066), ('qj', 0.063), ('ucbe', 0.063), ('pull', 0.06), ('bmk', 0.052), ('subpopulation', 0.052), ('complexities', 0.052), ('strategy', 0.045), ('forecaster', 0.042), ('complexity', 0.038), ('ucb', 0.036), ('pulled', 0.036), ('hm', 0.034), ('static', 0.034), ('gap', 0.032), ('lcb', 0.031), ('mkm', 0.031), ('tpi', 0.031), ('xmk', 0.031), ('adaptive', 0.031), ('pure', 0.031), ('budget', 0.03), ('treatment', 0.029), ('uniform', 0.027), ('clinical', 0.027), ('strategies', 0.026), ('variance', 0.026), ('pi', 0.023), ('exp', 0.022), ('trial', 0.021), ('gabillon', 0.021), ('multibandit', 0.021), ('tends', 0.021), ('prove', 0.02), ('pair', 0.02), ('inequality', 0.02), ('advance', 0.02), ('best', 0.02), ('inductive', 0.02), ('bubeck', 0.019), ('adapt', 0.019), ('bernstein', 0.019), ('error', 0.018), ('maxm', 0.018), ('twentieth', 0.018), ('remark', 0.018), ('probability', 0.017), ('patients', 0.017), ('audibert', 0.017), ('variances', 0.017), ('adapts', 0.017), ('rollout', 0.017), ('focuses', 0.017), ('round', 0.016), ('options', 0.016), ('kn', 0.016), ('ti', 0.016), ('allocating', 0.016), ('rademacher', 0.016), ('identi', 0.016), ('regret', 0.016), ('resources', 0.015), ('reinforcement', 0.015), ('lazaric', 0.015), ('allocate', 0.015), ('estimated', 0.015), ('horizon', 0.015), ('rounds', 0.014), ('decreases', 0.014), ('ghavamzadeh', 0.014), ('allocated', 0.014), ('maxk', 0.014), ('algorithms', 0.013), ('recommended', 0.013), ('bc', 0.013), ('mj', 0.013), ('proved', 0.013), ('hoeffding', 0.012), ('deng', 0.012), ('arg', 0.012), ('tuned', 0.012), ('returning', 0.012), ('maxj', 0.012), ('inequalities', 0.011), ('account', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 175 nips-2011-Multi-Bandit Best Arm Identification

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck

Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

2 0.24923582 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

Author: Alexandra Carpentier, Rémi Munos

Abstract: We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution� dependent bound O(n−3/2 ) that depends on a measure of the disparity of � the strata, and a distribution-free bound O(n−4/3 ) that does not. 1

3 0.23025452 56 nips-2011-Committing Bandits

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

4 0.14255308 177 nips-2011-Multi-armed bandits on implicit metric spaces

Author: Aleksandrs Slivkins

Abstract: The multi-armed bandit (MAB) setting is a useful abstraction of many online learning tasks which focuses on the trade-off between exploration and exploitation. In this setting, an online algorithm has a fixed set of alternatives (“arms”), and in each round it selects one arm and then observes the corresponding reward. While the case of small number of arms is by now well-understood, a lot of recent work has focused on multi-armed bandits with (infinitely) many arms, where one needs to assume extra structure in order to make the problem tractable. In particular, in the Lipschitz MAB problem there is an underlying similarity metric space, known to the algorithm, such that any two arms that are close in this metric space have similar payoffs. In this paper we consider the more realistic scenario in which the metric space is implicit – it is defined by the available structure but not revealed to the algorithm directly. Specifically, we assume that an algorithm is given a tree-based classification of arms. For any given problem instance such a classification implicitly defines a similarity metric space, but the numerical similarity information is not available to the algorithm. We provide an algorithm for this setting, whose performance guarantees (almost) match the best known guarantees for the corresponding instance of the Lipschitz MAB problem. 1

5 0.11635502 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

6 0.10567774 32 nips-2011-An Empirical Evaluation of Thompson Sampling

7 0.10081046 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

8 0.075931184 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

9 0.065856911 168 nips-2011-Maximum Margin Multi-Instance Learning

10 0.063678116 61 nips-2011-Contextual Gaussian Process Bandit Optimization

11 0.059215602 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems

12 0.057665203 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

13 0.053999491 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

14 0.052231736 49 nips-2011-Boosting with Maximum Adaptive Sampling

15 0.047720496 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

16 0.046033427 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

17 0.044786017 272 nips-2011-Stochastic convex optimization with bandit feedback

18 0.043113988 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

19 0.042904701 220 nips-2011-Prediction strategies without loss

20 0.041713547 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.094), (1, -0.157), (2, 0.005), (3, 0.043), (4, 0.098), (5, -0.034), (6, 0.033), (7, 0.095), (8, 0.042), (9, 0.047), (10, -0.187), (11, 0.063), (12, -0.132), (13, -0.033), (14, 0.047), (15, -0.009), (16, -0.128), (17, -0.139), (18, -0.017), (19, 0.03), (20, 0.064), (21, -0.013), (22, 0.001), (23, -0.06), (24, -0.033), (25, -0.08), (26, -0.122), (27, -0.095), (28, 0.067), (29, 0.04), (30, -0.03), (31, -0.002), (32, 0.054), (33, 0.113), (34, -0.112), (35, 0.125), (36, -0.069), (37, 0.2), (38, -0.083), (39, -0.061), (40, 0.017), (41, 0.01), (42, -0.047), (43, 0.099), (44, -0.022), (45, -0.047), (46, 0.038), (47, 0.032), (48, -0.213), (49, 0.091)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96316642 175 nips-2011-Multi-Bandit Best Arm Identification

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck

Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

2 0.80098081 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

Author: Alexandra Carpentier, Rémi Munos

Abstract: We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution� dependent bound O(n−3/2 ) that depends on a measure of the disparity of � the strata, and a distribution-free bound O(n−4/3 ) that does not. 1

3 0.75831825 56 nips-2011-Committing Bandits

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

4 0.53004038 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

5 0.48867407 177 nips-2011-Multi-armed bandits on implicit metric spaces

Author: Aleksandrs Slivkins

Abstract: The multi-armed bandit (MAB) setting is a useful abstraction of many online learning tasks which focuses on the trade-off between exploration and exploitation. In this setting, an online algorithm has a fixed set of alternatives (“arms”), and in each round it selects one arm and then observes the corresponding reward. While the case of small number of arms is by now well-understood, a lot of recent work has focused on multi-armed bandits with (infinitely) many arms, where one needs to assume extra structure in order to make the problem tractable. In particular, in the Lipschitz MAB problem there is an underlying similarity metric space, known to the algorithm, such that any two arms that are close in this metric space have similar payoffs. In this paper we consider the more realistic scenario in which the metric space is implicit – it is defined by the available structure but not revealed to the algorithm directly. Specifically, we assume that an algorithm is given a tree-based classification of arms. For any given problem instance such a classification implicitly defines a similarity metric space, but the numerical similarity information is not available to the algorithm. We provide an algorithm for this setting, whose performance guarantees (almost) match the best known guarantees for the corresponding instance of the Lipschitz MAB problem. 1

6 0.42554528 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

7 0.37640047 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

8 0.31069118 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

9 0.30378699 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

10 0.26174861 60 nips-2011-Confidence Sets for Network Structure

11 0.24127211 61 nips-2011-Contextual Gaussian Process Bandit Optimization

12 0.23589191 49 nips-2011-Boosting with Maximum Adaptive Sampling

13 0.21788338 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

14 0.21300848 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

15 0.21036984 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

16 0.20581934 299 nips-2011-Variance Penalizing AdaBoost

17 0.2034089 109 nips-2011-Greedy Model Averaging

18 0.20190495 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors

19 0.19479662 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

20 0.19355606 272 nips-2011-Stochastic convex optimization with bandit feedback


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (4, 0.021), (11, 0.01), (20, 0.018), (26, 0.039), (31, 0.075), (33, 0.017), (43, 0.051), (45, 0.091), (57, 0.03), (67, 0.012), (74, 0.031), (79, 0.408), (83, 0.023), (99, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75597411 175 nips-2011-Multi-Bandit Best Arm Identification

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck

Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

2 0.69369048 85 nips-2011-Emergence of Multiplication in a Biophysical Model of a Wide-Field Visual Neuron for Computing Object Approaches: Dynamics, Peaks, & Fits

Author: Matthias S. Keil

Abstract: Many species show avoidance reactions in response to looming object approaches. In locusts, the corresponding escape behavior correlates with the activity of the lobula giant movement detector (LGMD) neuron. During an object approach, its firing rate was reported to gradually increase until a peak is reached, and then it declines quickly. The η-function predicts that the LGMD activity is a product ˙ between an exponential function of angular size exp(−Θ) and angular velocity Θ, and that peak activity is reached before time-to-contact (ttc). The η-function has become the prevailing LGMD model because it reproduces many experimental observations, and even experimental evidence for the multiplicative operation was reported. Several inconsistencies remain unresolved, though. Here we address ˙ these issues with a new model (ψ-model), which explicitly connects Θ and Θ to biophysical quantities. The ψ-model avoids biophysical problems associated with implementing exp(·), implements the multiplicative operation of η via divisive inhibition, and explains why activity peaks could occur after ttc. It consistently predicts response features of the LGMD, and provides excellent fits to published experimental data, with goodness of fit measures comparable to corresponding fits with the η-function. 1 Introduction: τ and η Collision sensitive neurons were reported in species such different as monkeys [5, 4], pigeons [36, 34], frogs [16, 20], and insects [33, 26, 27, 10, 38]. This indicates a high ecological relevance, and raises the question about how neurons compute a signal that eventually triggers corresponding movement patterns (e.g. escape behavior or interceptive actions). Here, we will focus on visual stimulation. Consider, for simplicity, a circular object (diameter 2l), which approaches the eye at a collision course with constant velocity v. If we do not have any a priori knowledge about the object in question (e.g. its typical size or speed), then we will be able to access only two information sources. These information sources can be measured at the retina and are called optical variables (OVs). The first is the visual angle Θ, which can be derived from the number of stimulated photore˙ ˙ ceptors (spatial contrast). The second is its rate of change dΘ(t)/dt ≡ Θ(t). Angular velocity Θ is related to temporal contrast. ˙ How should we combine Θ and Θ in order to track an imminent collision? The perhaps simplest ˙ combination is τ (t) ≡ Θ(t)/Θ(t) [13, 18]. If the object hit us at time tc , then τ (t) ≈ tc − t will ∗ Also: www.ir3c.ub.edu, Research Institute for Brain, Cognition, and Behaviour (IR3C) Edifici de Ponent, Campus Mundet, Universitat de Barcelona, Passeig Vall d’Hebron, 171. E-08035 Barcelona 1 give us a running estimation of the time that is left until contact1 . Moreover, we do not need to know anything about the approaching object: The ttc estimation computed by τ is practically independent of object size and velocity. Neurons with τ -like responses were indeed identified in the nucleus retundus of the pigeon brain [34]. In humans, only fast interceptive actions seem to rely exclusively on τ [37, 35]. Accurate ttc estimation, however, seems to involve further mechanisms (rate of disparity change [31]). ˙ Another function of OVs with biological relevance is η ≡ Θ exp(−αΘ), with α = const. [10]. While η-type neurons were found again in pigeons [34] and bullfrogs [20], most data were gathered from the LGMD2 in locusts (e.g. [10, 9, 7, 23]). The η-function is a phenomenological model for the LGMD, and implies three principal hypothesis: (i) An implementation of an exponential function exp(·). Exponentation is thought to take place in the LGMD axon, via active membrane conductances [8]. Experimental data, though, seem to favor a third-power law rather than exp(·). (ii) The LGMD carries out biophysical computations for implementing the multiplicative operation. It has been suggested that multiplication is done within the LGMD itself, by subtracting the loga˙ rithmically encoded variables log Θ − αΘ [10, 8]. (iii) The peak of the η-function occurs before ˆ ttc, at visual angle Θ(t) = 2 arctan(1/α) [9]. It follows ttc for certain stimulus configurations (e.g. ˆ l/|v| 5ms). In principle, t > tc can be accounted for by η(t + δ) with a fixed delay δ < 0 (e.g. −27ms). But other researchers observed that LGMD activity continuous to rise after ttc even for l/|v| 5ms [28]. These discrepancies remain unexplained so far [29], but stimulation dynamics perhaps plays a role. We we will address these three issues by comparing the novel function “ψ” with the η-function. LGMD computations with the ψ-function: No multiplication, no exponentiation 2 A circular object which starts its approach at distance x0 and with speed v projects a visual angle Θ(t) = 2 arctan[l/(x0 − vt)] on the retina [34, 9]. The kinematics is hence entirely specified by the ˙ half-size-to-velocity ratio l/|v|, and x0 . Furthermore, Θ(t) = 2lv/((x0 − vt)2 + l2 ). In order to define ψ, we consider at first the LGMD neuron as an RC-circuit with membrane potential3 V [17] dV Cm = β (Vrest − V ) + gexc (Vexc − V ) + ginh (Vinh − V ) (1) dt 4 Cm = membrane capacity ; β ≡ 1/Rm denotes leakage conductance across the cell membrane (Rm : membrane resistance); gexc and ginh are excitatory and inhibitory inputs. Each conductance gi (i = exc, inh ) can drive the membrane potential to its associated reversal potential Vi (usually Vinh ≤ Vexc ). Shunting inhibition means Vinh = Vrest . Shunting inhibition lurks “silently” because it gets effective only if the neuron is driven away from its resting potential. With synaptic input, the neuron decays into its equilibrium state Vrest β + Vexc gexc + Vinh ginh V∞ ≡ (2) β + gexc + ginh according to V (t) = V∞ (1 − exp(−t/τm )). Without external input, V (t 1) → Vrest . The time scale is set by τm . Without synaptic input τm ≡ Cm /β. Slowly varying inputs gexc , ginh > 0 modify the time scale to approximately τm /(1 + (gexc + ginh )/β). For highly dynamic inputs, such as in late phase of the object approach, the time scale gets dynamical as well. The ψ-model assigns synaptic inputs5 ˙ ˙ ˙ ˙ gexc (t) = ϑ(t), ϑ(t) = ζ1 ϑ(t − ∆tstim ) + (1 − ζ1 )Θ(t) (3a) e ginh (t) = [γϑ(t)] , ϑ(t) = ζ0 ϑ(t − ∆tstim ) + (1 − ζ0 )Θ(t) 1 (3b) This linear approximation gets worse with increasing Θ, but turns out to work well until short before ttc (τ adopts a minimum at tc − 0.428978 · l/|v|). 2 LGMD activity is usually monitored via its postsynaptic neuron, the Descending Contralateral Movement Detector (DCMD) neuron. This represents no problem as LGMD spikes follow DCMD spikes 1:1 under visual stimulation [22] from 300Hz [21] to at least 400Hz [24]. 3 Here we assume that the membrane potential serves as a predictor for the LGMD’s mean firing rate. 4 Set to unity for all simulations. 5 LGMD receives also inhibition from a laterally acting network [21]. The η-function considers only direct feedforward inhibition [22, 6], and so do we. 2 Θ ∈ [7.63°, 180.00°[ temporal resolution ∆ tstim=1.0ms l/|v|=20.00ms, β=1.00, γ=7.50, e=3.00, ζ0=0.90, ζ1=0.99, nrelax=25 0.04 scaled dΘ/dt continuous discretized 0.035 0.03 Θ(t) (input) ϑ(t) (filtered) voltage V(t) (output) t = 56ms max t =300ms c 0.025 0 10 2 η(t): α=3.29, R =1.00 n =10 → t =37ms log Θ(t) amplitude relax max 0.02 0.015 0.01 0.005 0 −0.005 0 50 100 150 200 250 300 −0.01 0 350 time [ms] 50 100 150 200 250 300 350 time [ms] (b) ψ versus η (a) discretized optical variables Figure 1: (a) The continuous visual angle of an approaching object is shown along with its discretized version. Discretization transforms angular velocity from a continuous variable into a series of “spikes” (rescaled). (b) The ψ function with the inputs shown in a, with nrelax = 25 relaxation time steps. Its peak occurs tmax = 56ms before ttc (tc = 300ms). An η function (α = 3.29) that was fitted to ψ shows good agreement. For continuous optical variables, the peak would occur 4ms earlier, and η would have α = 4.44 with R2 = 1. For nrelax = 10, ψ is farther away from its equilibrium at V∞ , and its peak moves 19ms closer to ttc. t =500ms, dia=12.0cm, ∆t c =1.00ms, dt=10.00µs, discrete=1 stim 250 n relax = 50 2 200 α=4.66, R =0.99 [normal] n = 25 relax 2 α=3.91, R =1.00 [normal] n =0 relax tmax [ms] 150 2 α=1.15, R =0.99 [normal] 100 50 0 β=1.00, γ=7.50, e=3.00, V =−0.001, ζ =0.90, ζ =0.99 inh −50 5 10 15 20 25 30 0 35 1 40 45 50 l/|v| [ms] (a) different nrelax (b) different ∆tstim ˆ ˆ Figure 2: The figures plot the relative time tmax ≡ tc − t of the response peak of ψ, V (t), as a function of half-size-to-velocity ratio (points). Line fits with slope α and intercept δ were added (lines). The predicted linear relationship in all cases is consistent with experimental evidence [9]. (a) The stimulus time scale is held constant at ∆tstim = 1ms, and several LGMD time scales are defined by nrelax (= number of intercalated relaxation steps for each integration time step). Bigger values of nrelax move V (t) closer to its equilibrium V∞ (t), implying higher slopes α in turn. (b) LGMD time scale is fixed at nrelax = 25, and ∆tstim is manipulated. Because of the discretization of optical variables (OVs) in our simulation, increasing ∆tstim translates to an overall smaller number of jumps in OVs, but each with higher amplitude. Thus, we say ψ(t) ≡ V (t) if and only if gexc and ginh are defined with the last equation. The time ˙ scale of stimulation is defined by ∆tstim (by default 1ms). The variables ϑ and ϑ are lowpass filtered angular size and rate of expansion, respectively. The amount of filtering is defined by memory constants ζ0 and ζ1 (no filtering if zero). The idea is to continue with generating synaptic input ˙ after ttc, where Θ(t > tc ) = const and thus Θ(t > tc ) = 0. Inhibition is first weighted by γ, and then potentiated by the exponent e. Hodgkin-Huxley potentiates gating variables n, m ∈ [0, 1] instead (potassium ∝ n4 , sodium ∝ m3 , [12]) and multiplies them with conductances. Gabbiani and co-workers found that the function which transforms membrane potential to firing rate is better described by a power function with e = 3 than by exp(·) (Figure 4d in [8]). 3 Dynamics of the ψ-function 3 Discretization. In a typical experiment, a monitor is placed a short distance away from the insect’s eye, and an approaching object is displayed. Computer screens have a fixed spatial resolution, and as a consequence size increments of the displayed object proceed in discrete jumps. The locust retina is furthermore composed of a discrete array of ommatidia units. We therefore can expect a corresponding step-wise increment of Θ with time, although optical and neuronal filtering may ˙ smooth Θ to some extent again, resulting in ϑ (figure 1). Discretization renders Θ discontinuous, ˙ For simulating the dynamics of ψ, we discretized angular size what again will be alleviated in ϑ. ˙ with floor(Θ), and Θ(t) ≈ [Θ(t + ∆tstim ) − Θ(t)]/∆tstim . Discretized optical variables (OVs) were re-normalized to match the range of original (i.e. continuous) OVs. To peak, or not to peak? Rind & Simmons reject the hypothesis that the activity peak signals impending collision on grounds of two arguments [28]: (i) If Θ(t + ∆tstim ) − Θ(t) 3o in consecutively displayed stimulus frames, the illusion of an object approach would be lost. Such stimulation would rather be perceived as a sequence of rapidly appearing (but static) objects, causing reduced responses. (ii) After the last stimulation frame has been displayed (that is Θ = const), LGMD responses keep on building up beyond ttc. This behavior clearly depends on l/|v|, also according to their own data (e.g. Figure 4 in [26]): Response build up after ttc is typically observed for suffi˙ ciently small values of l/|v|. Input into ψ in situations where Θ = const and Θ = 0, respectively, ˙ is accommodated by ϑ and ϑ, respectively. We simulated (i) by setting ∆tstim = 5ms, thus producing larger and more infrequent jumps in discrete OVs than with ∆tstim = 1ms (default). As a consequence, ϑ(t) grows more slowly (deˆ layed build up of inhibition), and the peak occurs later (tmax ≡ tc − t = 10ms with everything else ˆ ˆ identical with figure 1b). The peak amplitude V = V (t) decreases nearly sixfold with respect to default. Our model thus predicts the reduced responses observed by Rind & Simmons [28]. Linearity. Time of peak firing rate is linearly related to l/|v| [10, 9]. The η-function is consistent ˆ with this experimental evidence: t = tc − αl/|v| + δ (e.g. α = 4.7, δ = −27ms). The ψ-function reproduces this relationship as well (figure 2), where α depends critically on the time scale of biophysical processes in the LGMD. We studied the impact of this time scale by choosing 10µs for the numerical integration of equation 1 (algorithm: 4th order Runge-Kutta). Apart from improving the numerical stability of the integration algorithm, ψ is far from its equilibrium V∞ (t) in every moment ˙ t, given the stimulation time scale ∆tstim = 1ms 6 . Now, at each value of Θ(t) and Θ(t), respectively, we intercalated nrelax iterations for integrating ψ. Each iteration takes V (t) asymptotically closer to V∞ (t), and limnrelax 1 V (t) = V∞ (t). If the internal processes in the LGMD cannot keep up with stimulation (nrelax = 0), we obtain slopes values that underestimate experimentally found values (figure 2a). In contrast, for nrelax 25 we get an excellent agreement with the experimentally determined α. This means that – under the reported experimental stimulation conditions (e.g. [9]) – the LGMD would operate relatively close to its steady state7 . Now we fix nrelax at 25 and manipulate ∆tstim instead (figure 2b). The default value ∆tstim = 1ms corresponds to α = 3.91. Slightly bigger values of ∆tstim (2.5ms and 5ms) underestimate the experimental α. In addition, the line fits also return smaller intercept values then. We see tmax < 0 up to l/|v| ≈ 13.5ms – LGMD activity peaks after ttc! Or, in other words, LGMD activity continues to increase after ttc. In the limit, where stimulus dynamics is extremely fast, and LGMD processes are kept far from equilibrium at each instant of the approach, α gets very small. As a consequence, tmax gets largely independent of l/|v|: The activity peak would cling to tmax although we varied l/|v|. 4 Freeze! Experimental data versus steady state of “psi” In the previous section, experimentally plausible values for α were obtained if ψ is close to equilibrium at each instant of time during stimulation. In this section we will thus introduce a steady-state 6 Assuming one ∆tstim for each integration time step. This means that by default stimulation and biophysical dynamics will proceed at identical time scales. 7 Notice that in this moment we can only make relative statements - we do not have data at hand for defining absolute time scales 4 tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 300 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 350 300 β=10.00 β=5.00 norm. rmse = 0.058...0.153 correlation (β,α)=−0.90 (n=4) ∞ β=1.00 e=4.00 norm. |η−ψ | = 0.009...0.114 e=3.00 300 norm. rmse = 0.014...0.160 correlation (e,α)=0.98 (n=4) ∞ e=2.50 250 250 norm. |η−ψ | = 0.043...0.241 ∞ norm. rmse = 0.085...0.315 correlation (γ,α)=1.00 (n=5) 150 tmax [ms] 200 tmax [ms] 200 tmax [ms] γ=5.00 γ=2.50 γ=1.00 γ=0.50 γ=0.25 e=5.00 norm. |η−ψ | = 0.020...0.128 β=2.50 250 200 150 100 150 100 100 50 50 50 0 5 10 15 20 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] 25 30 35 40 45 0 5 50 10 15 20 l/|v| [ms] (a) β varies 25 30 35 40 45 50 l/|v| [ms] (b) e varies (c) γ varies ˆ ˆ Figure 3: Each curve shows how the peak ψ∞ ≡ ψ∞ (t) depends on the half-size-to-velocity ratio. In each display, one parameter of ψ∞ is varied (legend), while the others are held constant (figure title). Line slopes vary according to parameter values. Symbol sizes are scaled according to rmse (see also figure 4). Rmse was calculated between normalized ψ∞ (t) & normalized η(t) (i.e. both functions ∈ [0, 1] with original minimum and maximum indicated by the textbox). To this end, the ˆ peak of the η-function was placed at tc , by choosing, at each parameter value, α = |v| · (tc − t)/l (for determining correlation, the mean value of α was taken across l/|v|). tc=500ms, v=2.00m/s ψ∞ → (β varies), γ=3.50, e=3.00, Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, γ=3.50, (e varies), Vinh=−0.001 tc=500ms, v=2.00m/s ψ∞ → β=2.50, (γ varies), e=3.00, Vinh=−0.001 0.25 β=5.00 0.12 β=2.50 β=1.00 0.1 0.08 (normalized η, ψ∞) 0.12 β=10.00 (normalized η, ψ∞) (normalized η, ψ∞) 0.14 0.1 0.08 γ=5.00 γ=2.50 0.2 γ=1.00 γ=0.50 γ=0.25 0.15 0.06 0.04 0.02 0 5 10 15 20 25 30 35 40 45 50 meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| meant |η(t)−ψ∞(t)| 0.06 0.04 e=5.00 e=4.00 e=3.00 0.02 e=2.50 10 l/|v| [ms] 15 20 25 30 35 40 45 50 l/|v| [ms] (a) β varies (b) e varies 0.1 0.05 0 5 10 15 20 25 30 35 40 45 50 l/|v| [ms] (c) γ varies Figure 4: This figure complements figure 3. It visualizes the time averaged absolute difference between normalized ψ∞ (t) & normalized η(t). For η, its value of α was chosen such that the maxima of both functions coincide. Although not being a fit, it gives a rough estimate on how the shape of both curves deviate from each other. The maximum possible difference would be one. version of ψ (i.e. equation 2 with Vrest = 0, Vexc = 1, and equations 3 plugged in), ψ∞ (t) ≡ e ˙ Θ(t) + Vinh [γΘ(t)] e ˙ β + Θ(t) + [γΘ(t)] (4) (Here we use continuous versions of angular size and rate of expansion). The ψ∞ -function makes life easier when it comes to fitting experimental data. However, it has its limitations, because we brushed the whole dynamic of ψ under the carpet. Figure 3 illustrates how the linˆ ear relationship (=“linearity”) between tmax ≡ tc − t and l/|v| is influenced by changes in parameter values. Changing any of the values of e, β, γ predominantly causes variation in line slopes. The smallest slope changes are obtained by varying Vinh (data not shown; we checked Vinh = 0, −0.001, −0.01, −0.1). For Vinh −0.01, linearity is getting slightly compromised, as slope increases with l/|v| (e.g. Vinh = −1 α ∈ [4.2, 4.7]). In order to get a notion about how well the shape of ψ∞ (t) matches η(t), we computed timeaveraged difference measures between normalized versions of both functions (details: figure 3 & 4). Bigger values of β match η better at smaller, but worse at bigger values of l/|v| (figure 4a). Smaller β cause less variation across l/|v|. As to variation of e, overall curve shapes seem to be best aligned with e = 3 to e = 4 (figure 4b). Furthermore, better matches between ψ∞ (t) and η(t) correspond to bigger values of γ (figure 4c). And finally, Vinh marches again to a different tune (data not shown). Vinh = −0.1 leads to the best agreement (≈ 0.04 across l/|v|) of all Vinh , quite different from the other considered values. For the rest, ψ∞ (t) and η(t) align the same (all have maximum 0.094), 5 ˙ (a) Θ = 126o /s ˙ (b) Θ = 63o /s Figure 5: The original data (legend label “HaGaLa95”) were resampled from ref. [10] and show ˙ DCMD responses to an object approach with Θ = const. Thus, Θ increases linearly with time. The η-function (fitting function: Aη(t+δ)+o) and ψ∞ (fitting function: Aψ∞ (t)+o) were fitted to these data: (a) (Figure 3 Di in [10]) Good fits for ψ∞ are obtained with e = 5 or higher (e = 3 R2 = 0.35 and rmse = 0.644; e = 4 R2 = 0.45 and rmse = 0.592). “Psi” adopts a sigmoid-like curve form which (subjectively) appears to fit the original data better than η. (b) (Figure 3 Dii in [10]) “Psi” yields an excellent fit for e = 3. RoHaTo10 gregarious locust LV=0.03s Θ(t), lv=30ms e011pos014 sgolay with 100 t =107ms max ttc=5.00s ψ adj.R2 0.95 (LM:3) ∞ η(t) adj.R2 1 (TR::1) 2 ψ : R =0.95, rmse=0.004, 3 coefficients ∞ → β=2.22, γ=0.70, e=3.00, V =−0.001, A=0.07, o=0.02, δ=0.00ms inh η: R2=1.00, rmse=0.001 → α=3.30, A=0.08, o=0.0, δ=−10.5ms 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 time [s] (b) α versus β (a) spike trace Figure 6: (a) DCMD activity in response to a black square (l/|v| = 30ms, legend label “e011pos14”, ref. [30]) approaching to the eye center of a gregarious locust (final visual angle 50o ). Data show the first stimulation so habituation is minimal. The spike trace (sampled at 104 Hz) was full wave rectified, lowpass filtered, and sub-sampled to 1ms resolution. Firing rate was estimated with Savitzky-Golay filtering (“sgolay”). The fits of the η-function (Aη(t + δ) + o; 4 coefficients) and ψ∞ -function (Aψ∞ (t) with fixed e, o, δ, Vinh ; 3 coefficients) provide both excellent fits to firing rate. (b) Fitting coefficient α (→ η-function) inversely correlates with β (→ ψ∞ ) when fitting firing rates of another 5 trials as just described (continuous line = line fit to the data points). Similar correlation values would be obtained if e is fixed at values e = 2.5, 4, 5 c = −0.95, −0.96, −0.91. If o was determined by the fitting algorithm, then c = −0.70. No clear correlations with α were obtained for γ. despite of covering different orders of magnitude with Vinh = 0, −0.001, −0.01. Decelerating approach. Hatsopoulos et al. [10] recorded DCMD activity in response to an ap˙ proaching object which projected image edges on the retina moving at constant velocity: Θ = const. ˙ This “linear approach” is perceived as if the object is getting increasingly implies Θ(t) = Θ0 + Θt. slower. But what appears a relatively unnatural movement pattern serves as a test for the functions η & ψ∞ . Figure 5 illustrates that ψ∞ passes the test, and consistently predicts that activity sharply rises in the initial approach phase, and subsequently declines (η passed this test already in the year 1995). 6 Spike traces. We re-sampled about 30 curves obtained from LGMD recordings from a variety of publications, and fitted η & ψ∞ -functions. We cannot show the results here, but in terms of goodness of fit measures, both functions are in the same ballbark. Rather, figure 6a shows a representative example [30]. When α and β are plotted against each other for five trials, we see a strong inverse correlation (figure 6b). Although five data points are by no means a firm statistical sample, the strong correlation could indicate that β and α play similar roles in both functions. Biophysically, β is the leakage conductance, which determines the (passive) membrane time constant τm ∝ 1/β of the neuron. Voltage drops within τm to exp(−1) times its initial value. Bigger values of β mean shorter τm (i.e., “faster neurons”). Getting back to η, this would suggest α ∝ τm , such that higher (absolute) values for α would possibly indicate a slower dynamic of the underlying processes. 5 Discussion (“The Good, the Bad, and the Ugly”) Up to now, mainly two classes of LGMD models existed: The phenomenological η-function on the one hand, and computational models with neuronal layers presynaptic to the LGMD on the other (e.g. [25, 15]; real-world video sequences & robotics: e.g. [3, 14, 32, 2]). Computational models predict that LGMD response features originate from excitatory and inhibitory interactions in – and between – presynaptic neuronal layers. Put differently, non-linear operations are generated in the presynaptic network, and can be a function of many (model) parameters (e.g. synaptic weights, time constants, etc.). In contrast, the η-function assigns concrete nonlinear operations to the LGMD [7]. The η-function is accessible to mathematical analysis, whereas computational models have to be probed with videos or artificial stimulus sequences. The η-function is vague about biophysical parameters, whereas (good) computational models need to be precise at each (model) parameter value. The η-function establishes a clear link between physical stimulus attributes and LGMD activity: It postulates what is to be computed from the optical variables (OVs). But in computational models, such a clear understanding of LGMD inputs cannot always be expected: Presynaptic processing may strongly transform OVs. The ψ function thus represents an intermediate model class: It takes OVs as input, and connects them with biophysical parameters of the LGMD. For the neurophysiologist, the situation could hardly be any better. Psi implements the multiplicative operation of the η-function by shunting inhibition (equation 1: Vexc ≈ Vrest and Vinh ≈ Vrest ). The η-function fits ψ very well according to our dynamical simulations (figure 1), and satisfactory by the approximate criterion of figure 4. We can conclude that ψ implements the η-function in a biophysically plausible way. However, ψ does neither explicitly specify η’s multiplicative operation, nor its exponential function exp(·). Instead we have an interaction between shunting inhibition and a power law (·)e , with e ≈ 3. So what about power laws in neurons? Because of e > 1, we have an expansive nonlinearity. Expansive power-law nonlinearities are well established in phenomenological models of simple cells of the primate visual cortex [1, 11]. Such models approximate a simple cell’s instantaneous firing rate r from linear filtering of a stimulus (say Y ) by r ∝ ([Y ]+ )e , where [·]+ sets all negative values to zero and lets all positive pass. Although experimental evidence favors linear thresholding operations like r ∝ [Y − Ythres ]+ , neuronal responses can behave according to power law functions if Y includes stimulus-independent noise [19]. Given this evidence, the power-law function of the inhibitory input into ψ could possibly be interpreted as a phenomenological description of presynaptic processes. The power law would also be the critical feature by means of which the neurophysiologist could distinguish between the η function and ψ. A study of Gabbiani et al. aimed to provide direct evidence for a neuronal implementation of the η-function [8]. Consequently, the study would be an evidence ˙ for a biophysical implementation of “direct” multiplication via log Θ − αΘ. Their experimental evidence fell somewhat short in the last part, where “exponentation through active membrane conductances” should invert logarithmic encoding. Specifically, the authors observed that “In 7 out of 10 neurons, a third-order power law best described the data” (sixth-order in one animal). Alea iacta est. Acknowledgments MSK likes to thank Stephen M. Rogers for kindly providing the recording data for compiling figure 6. MSK furthermore acknowledges support from the Spanish Government, by the Ramon and Cajal program and the research grant DPI2010-21513. 7 References [1] D.G. Albrecht and D.B. Hamilton, Striate cortex of monkey and cat: contrast response function, Journal of Neurophysiology 48 (1982), 217–237. [2] S. Bermudez i Badia, U. Bernardet, and P.F.M.J. Verschure, Non-linear neuronal responses as an emergent property of afferent networks: A case study of the locust lobula giant movemement detector, PLoS Computational Biology 6 (2010), no. 3, e1000701. [3] M. Blanchard, F.C. Rind, and F.M.J. Verschure, Collision avoidance using a model of locust LGMD neuron, Robotics and Autonomous Systems 30 (2000), 17–38. [4] D.F. Cooke and M.S.A. Graziano, Super-flinchers and nerves of steel: Defensive movements altered by chemical manipulation of a cortical motor area, Neuron 43 (2004), no. 4, 585–593. [5] L. Fogassi, V. Gallese, L. Fadiga, G. Luppino, M. Matelli, and G. Rizzolatti, Coding of peripersonal space in inferior premotor cortex (area f4), Journal of Neurophysiology 76 (1996), 141–157. [6] F. Gabbiani, I. Cohen, and G. Laurent, Time-dependent activation of feed-forward inhibition in a looming sensitive neuron, Journal of Neurophysiology 94 (2005), 2150–2161. [7] F. Gabbiani, H.G. Krapp, N. Hatsopolous, C.H. Mo, C. Koch, and G. Laurent, Multiplication and stimulus invariance in a looming-sensitive neuron, Journal of Physiology - Paris 98 (2004), 19–34. [8] F. Gabbiani, H.G. Krapp, C. Koch, and G. Laurent, Multiplicative computation in a visual neuron sensitive to looming, Nature 420 (2002), 320–324. [9] F. Gabbiani, H.G. Krapp, and G. Laurent, Computation of object approach by a wide-field, motionsensitive neuron, Journal of Neuroscience 19 (1999), no. 3, 1122–1141. [10] N. Hatsopoulos, F. Gabbiani, and G. Laurent, Elementary computation of object approach by a wide-field visual neuron, Science 270 (1995), 1000–1003. [11] D.J. Heeger, Modeling simple-cell direction selectivity with normalized, half-squared, linear operators, Journal of Neurophysiology 70 (1993), 1885–1898. [12] A.L. Hodkin and A.F. Huxley, A quantitative description of membrane current and its application to conduction and excitation in nerve, Journal of Physiology 117 (1952), 500–544. [13] F. Hoyle, The black cloud, Pinguin Books, London, 1957. [14] M.S. Keil, E. Roca-Morena, and A. Rodr´guez-V´ zquez, A neural model of the locust visual system for ı a detection of object approaches with real-world scenes, Proceedings of the Fourth IASTED International Conference (Marbella, Spain), vol. 5119, 6-8 September 2004, pp. 340–345. [15] M.S. Keil and A. Rodr´guez-V´ zquez, Towards a computational approach for collision avoidance with ı a real-world scenes, Proceedings of SPIE: Bioengineered and Bioinspired Systems (Maspalomas, Gran Canaria, Canary Islands, Spain) (A. Rodr´guez-V´ zquez, D. Abbot, and R. Carmona, eds.), vol. 5119, ı a SPIE - The International Society for Optical Engineering, 19-21 May 2003, pp. 285–296. [16] J.G. King, J.Y. Lettvin, and E.R. Gruberg, Selective, unilateral, reversible loss of behavioral responses to looming stimuli after injection of tetrodotoxin or cadmium chloride into the frog optic nerve, Brain Research 841 (1999), no. 1-2, 20–26. [17] C. Koch, Biophysics of computation: information processing in single neurons, Oxford University Press, New York, 1999. [18] D.N. Lee, A theory of visual control of braking based on information about time-to-collision, Perception 5 (1976), 437–459. [19] K.D. Miller and T.W. Troyer, Neural noise can explain expansive, power-law nonlinearities in neuronal response functions, Journal of Neurophysiology 87 (2002), 653–659. [20] Hideki Nakagawa and Kang Hongjian, Collision-sensitive neurons in the optic tectum of the bullfrog, rana catesbeiana, Journal of Neurophysiology 104 (2010), no. 5, 2487–2499. [21] M. O’Shea and C.H.F. Rowell, Projection from habituation by lateral inhibition, Nature 254 (1975), 53– 55. [22] M. O’Shea and J.L.D. Williams, The anatomy and output connection of a locust visual interneurone: the lobula giant movement detector (lgmd) neurone, Journal of Comparative Physiology 91 (1974), 257–266. [23] S. Peron and F. Gabbiani, Spike frequency adaptation mediates looming stimulus selectivity, Nature Neuroscience 12 (2009), no. 3, 318–326. [24] F.C. Rind, A chemical synapse between two motion detecting neurones in the locust brain, Journal of Experimental Biology 110 (1984), 143–167. [25] F.C. Rind and D.I. Bramwell, Neural network based on the input organization of an identified neuron signaling implending collision, Journal of Neurophysiology 75 (1996), no. 3, 967–985. 8 [26] F.C. Rind and P.J. Simmons, Orthopteran DCMD neuron: a reevaluation of responses to moving objects. I. Selective responses to approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1654–1666. [27] , Orthopteran DCMD neuron: a reevaluation of responses to moving objects. II. Critical cues for detecting approaching objects, Journal of Neurophysiology 68 (1992), no. 5, 1667–1682. [28] , Signaling of object approach by the dcmd neuron of the locust, Journal of Neurophysiology 77 (1997), 1029–1033. [29] , Reply, Trends in Neuroscience 22 (1999), no. 5, 438. [30] S.M. Roger, G.W.J. Harston, F. Kilburn-Toppin, T. Matheson, M. Burrows, F. Gabbiani, and H.G. Krapp, Spatiotemporal receptive field properties of a looming-sensitive neuron in solitarious and gregarious phases of desert locust, Journal of Neurophysiology 103 (2010), 779–792. [31] S.K. Rushton and J.P. Wann, Weighted combination of size and disparity: a computational model for timing ball catch, Nature Neuroscience 2 (1999), no. 2, 186–190. [32] Yue. S., Rind. F.C., M.S. Keil, J. Cuadri, and R. Stafford, A bio-inspired visual collision detection mechanism for cars: Optimisation of a model of a locust neuron to a novel environment, Neurocomputing 69 (2006), 1591–1598. [33] G.R. Schlotterer, Response of the locust descending movement detector neuron to rapidly approaching and withdrawing visual stimuli, Canadian Journal of Zoology 55 (1977), 1372–1376. [34] H. Sun and B.J. Frost, Computation of different optical variables of looming objects in pigeon nucleus rotundus neurons, Nature Neuroscience 1 (1998), no. 4, 296–303. [35] J.R. Tresilian, Visually timed action: time-out for ’tau’?, Trends in Cognitive Sciences 3 (1999), no. 8, 1999. [36] Y. Wang and B.J. Frost, Time to collision is signalled by neurons in the nucleus rotundus of pigeons, Nature 356 (1992), 236–238. [37] J.P. Wann, Anticipating arrival: is the tau-margin a specious theory?, Journal of Experimental Psychology and Human Perceptual Performance 22 (1979), 1031–1048. [38] M. Wicklein and N.J. Strausfeld, Organization and significance of neurons that detect change of visual depth in the hawk moth manduca sexta, The Journal of Comparative Neurology 424 (2000), no. 2, 356– 376. 9

3 0.56102973 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

4 0.50842679 56 nips-2011-Committing Bandits

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

5 0.4585298 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

Author: Alexandra Carpentier, Rémi Munos

Abstract: We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution� dependent bound O(n−3/2 ) that depends on a measure of the disparity of � the strata, and a distribution-free bound O(n−4/3 ) that does not. 1

6 0.42404857 24 nips-2011-Active learning of neural response functions with Gaussian processes

7 0.41250703 177 nips-2011-Multi-armed bandits on implicit metric spaces

8 0.40803632 32 nips-2011-An Empirical Evaluation of Thompson Sampling

9 0.36865473 61 nips-2011-Contextual Gaussian Process Bandit Optimization

10 0.36769533 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

11 0.35840079 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

12 0.35411686 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

13 0.34022814 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

14 0.33826801 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

15 0.33685037 231 nips-2011-Randomized Algorithms for Comparison-based Search

16 0.33668849 197 nips-2011-On Tracking The Partition Function

17 0.33629403 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

18 0.33612692 258 nips-2011-Sparse Bayesian Multi-Task Learning

19 0.33577999 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

20 0.3352437 22 nips-2011-Active Ranking using Pairwise Comparisons