nips nips2012 nips2012-295 knowledge-graph by maker-knowledge-mining

295 nips-2012-Risk-Aversion in Multi-armed Bandits


Source: pdf

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. [sent-7, score-0.593]

2 This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. [sent-8, score-1.053]

3 1 Introduction The multi–armed bandit [13] elegantly formalizes the problem of on–line learning with partial feedback, which encompasses a large number of real–world applications, such as clinical trials, online advertisements, adaptive routing, and cognitive radio. [sent-10, score-0.263]

4 In the stochastic multi–armed bandit model, a learner chooses among several arms (e. [sent-11, score-0.617]

5 At each point in time, the learner selects one arm and receives a noisy reward observation from that arm (e. [sent-16, score-1.265]

6 , patients involved in the clinical trial), the learner faces a dilemma between repeatedly exploring all arms and collecting reward information versus exploiting current reward estimates by selecting the arm with the highest estimated reward. [sent-21, score-1.346]

7 Multi–arm bandit literature typically focuses on the problem of finding a learning algorithm capable of maximizing the expected cumulative reward (i. [sent-23, score-0.441]

8 , the reward collected over n rounds averaged over all possible observation realizations), thus implying that the best arm returns the highest expected reward. [sent-25, score-0.853]

9 More generally, some applications require an effective trade–off between risk and reward. [sent-29, score-0.252]

10 Two foundational risk modeling paradigms are Expected Utility theory [12] and the historically popular and accessible Mean-Variance paradigm [10]. [sent-33, score-0.295]

11 A large part of decision–making theory focuses on defining and managing risk (see e. [sent-34, score-0.273]

12 , [9] for an introduction to risk from an expected utility theory perspective). [sent-36, score-0.309]

13 In particular, [8] showed that in general, although it is possible to achieve a small regret w. [sent-40, score-0.429]

14 [16] studied the case of pure risk minimization (notably variance minimization) in an on-line setting where at each step the learner is given a covariance matrix and must choose a weight vector that minimizes the variance. [sent-45, score-0.488]

15 The regret is then computed over horizon and compared to the fixed weights minimizing the variance in hindsight. [sent-46, score-0.623]

16 In the multi–arm bandit domain, the most interesting results are by [5] and [14]. [sent-47, score-0.235]

17 [5] introduced an analysis of the expected regret and its distribution, revealing that an anytime version of UCB [6] and UCB-V might have large regret with some nonnegligible probability. [sent-48, score-0.946]

18 1 This analysis is further extended by [14] who derived negative results which show no anytime algorithm can achieve a regret with both a small expected regret and exponential tails. [sent-49, score-0.919]

19 Although these results represent an important step towards the analysis of risk within bandit algorithms, they are limited to the case where an algorithm’s cumulative reward is compared to the reward obtained by pulling the arm with the highest expectation. [sent-50, score-1.471]

20 In this paper, we focus on the problem of competing against the arm with the best risk–return trade– off. [sent-51, score-0.573]

21 2 we introduce notation and define the mean–variance bandit problem. [sent-54, score-0.235]

22 2 Mean–Variance Multi–arm Bandit In this section we introduce the notation and define the mean–variance multi–arm bandit problem. [sent-62, score-0.235]

23 We consider the standard multi–arm bandit setting with K arms, each characterized by a distribution 2 νi bounded in the interval [0, 1]. [sent-63, score-0.252]

24 The bandit problem is defined over a finite horizon of n rounds. [sent-65, score-0.269]

25 We denote by Xi,s ∼ νi the s-th random sample drawn from the distribution of arm i. [sent-66, score-0.553]

26 In the multi– arm bandit protocol, at each round t, an algorithm selects arm It and observes sample XIt ,Ti,t , t where Ti,t is the number of samples observed from arm i up to time t (i. [sent-68, score-1.912]

27 While in the standard bandit literature the objective is to select the arm leading to the highest reward in expectation (the arm with the largest expected value µi ), here we focus on the problem of finding the arm which effectively trades off between its expected reward (i. [sent-71, score-2.258]

28 Although a large number of models for risk–return trade–off have been proposed, here we focus on the most historically popular and simple model: the mean–variance model proposed by [10],where the return of an arm is measured by the expected reward and its risk by its variance. [sent-76, score-1.061]

29 The mean–variance of an arm i with mean µi , variance σi and coefficient of absolute 2 2 risk tolerance ρ is defined as MVi = σi − ρµi . [sent-78, score-1.038]

30 Thus the optimal arm is the arm with the smallest mean-variance, that is i∗ = arg mini MVi . [sent-79, score-1.128]

31 We notice that we can obtain two extreme settings depending on the value of risk tolerance ρ. [sent-80, score-0.311]

32 As ρ → ∞, the mean–variance of arm i tends to the opposite of its expected value µi and the problem reduces to the standard expected reward maximization traditionally considered in multi–arm bandit problems. [sent-81, score-1.002]

33 samples from the distribution νi , we define the empirical mean–variance of s=1 an arm i with t samples as MVi,t = σi,t − ρˆi,t , where ˆ2 µ µi,t = ˆ 1 t t Xi,s , σi,t = ˆ2 s=1 1 t t 2 s=1 Xi,s − µi,t . [sent-86, score-0.589]

34 Similar to a single arm i we define its empirical mean–variance as MVn (A) = σn (A) − ρˆn (A), ˆ2 µ where µn (A) = ˆ 1 2 1 n n Zt , σn (A) = ˆ2 t=1 1 n n t=1 (2) 2 Zt − µn (A) , ˆ (3) The analysis is for the pseudo–regret but it can be extended to the true regret (see Remark 2 at p. [sent-88, score-0.982]

35 The coefficient of risk tolerance is the inverse of the more popular coefficient of risk aversion A = 1/ρ. [sent-90, score-0.62]

36 This leads to a natural definition of the (random) regret at each single run of the algorithm as the difference in the mean– variance performance of the algorithm compared to the best arm. [sent-92, score-0.637]

37 The regret for a learning algorithm A over n rounds is defined as Rn (A) = MVn (A) − MVi∗ ,n . [sent-94, score-0.527]

38 (4) Given this definition, the objective is to design an algorithm whose regret decreases as the number of rounds increases (in high probability or in expectation). [sent-95, score-0.553]

39 This matches the definition of true regret in standard bandits (see e. [sent-98, score-0.466]

40 According to the definition of MV-LCB, the index Bi,s would simply reduce to Bi,s = log(1/δ)/s, thus forcing the algorithm to pull both arms uniformly (i. [sent-103, score-0.385]

41 Since the arms have the same variance, there is no direct regret in pulling either one or the other. [sent-106, score-0.911]

42 In this case, the algorithm suffers a constant (true) regret Rn (MV-LCB) = 0 + T1,n T2,n 2 1 Γ = Γ2 , 2 n 4 independent from the number of rounds n. [sent-108, score-0.558]

43 This argument can be generalized to multiple arms and ρ = 0, since it is always possible to design an environment (i. [sent-109, score-0.41]

44 In fact, two arms with the same mean– variance are likely to produce similar observations, thus leading MV-LCB to pull the two arms repeatedly over time, since the algorithm is designed to try to discriminate between similar arms. [sent-113, score-0.928]

45 Although this behavior does not suffer from any regret in pulling the “suboptimal” arm (the two arms are equivalent), it does introduce an additional variance, due to the difference in the means of the arms (Γ = 0), which finally leads to a regret the algorithm is not “aware” of. [sent-114, score-2.237]

46 This is particularly interesting since it reveals a huge gap between the mean–variance and the standard expected regret minimization problem and will be further investigated in the numerical √ simulations in Sect. [sent-116, score-0.532]

47 In fact, UCB is known to have a worst–case regret of Ω(1/ n) [3], while in the worst case, MV-LCB suffers a constant regret. [sent-118, score-0.511]

48 During the first phase all the arms are explored uniformly, thus collecting τ /K samples each 6 . [sent-121, score-0.407]

49 Once the exploration phase is over, the mean–variance of each arm is computed and the arm with the smallest estimated mean–variance MVi,τ /K is repeatedly pulled until the end. [sent-122, score-1.288]

50 The MV-LCB is specifically designed to minimize the probability of pulling the wrong arms, so whenever there are two equivalent arms (i. [sent-123, score-0.482]

51 , arms with the same mean–variance), the algorithm tends to pull them the same number of times, at the cost of potentially introducing an additional variance which might result in a constant regret. [sent-125, score-0.591]

52 On the other hand, ExpExp stops exploring the arms after τ rounds and then elicits one arm as the best and keeps pulling it for the remaining n − τ rounds. [sent-126, score-1.172]

53 On the other hand, the second part of the regret (i. [sent-131, score-0.429]

54 , the variance of pulling arms with different means) is minimized by taking τ as small as possible (e. [sent-133, score-0.642]

55 Let ExpExp be run with τ = K(n/14)2/3 , then for any choice of distributions {νi } the expected regret is E[Rn (A)] ≤ 2 nK . [sent-138, score-0.494]

56 1 demonstrates that MV-LCB has a regret decreasing as O(K log(n)/n) whenever the gaps ∆ are not small compared to n, while in the remarks of Thm. [sent-142, score-0.449]

57 On the other hand, the previous bound for ExpExp is distribution independent and indicates the regret is still a decreasing function of n even in the worst 4 Note that in this case (i. [sent-144, score-0.48]

58 1 does not hold, since the optimal arm is not unique. [sent-147, score-0.553]

59 Instead of a pure uniform exploration of all the arms, we could adopt a best–arm identification algorithms such as Successive Reject or UCB-E, which maximize the probability of returning the best arm given a fixed budget of rounds τ (see e. [sent-166, score-0.76]

60 In the following graphs we study the true regret Rn (A) averaged over 500 runs. [sent-170, score-0.429]

61 We first consider the variance minimization problem (ρ = 0) with K = 2 Gaussian 2 2 arms set to µ1 = 1. [sent-171, score-0.523]

62 In Figure 2 we report the true regret Rn (as in the original definition in eq. [sent-176, score-0.429]

63 1), the regret is characterized by the regret realized from pulling suboptimal arms and arms with different means (Exploration Risk) and tends to zero as n increases. [sent-182, score-1.751]

64 Indeed, if we considered two distributions with equal means (µ1 = µ2 ), the average regret coincides with R∆ . [sent-183, score-0.429]

65 1 the two regret terms decrease with the same rate O(log n/n). [sent-185, score-0.429]

66 In order to have a fair comparison, for any value of n and for each of the two algorithms, we select the pair ∆w , Γw which corresponds to the largest regret (we search in a grid of values with µ1 = 1. [sent-189, score-0.429]

67 4, while the worst–case regret of ExpExp keeps decreasing over n, it is always possible to find a problem for which regret of MV-LCB stabilizes to a constant. [sent-201, score-0.896]

68 6 Discussion In this paper we evaluate the risk of an algorithm in terms of the variability of the sequences of samples that it actually generates. [sent-204, score-0.386]

69 Although this notion might resemble other analyses of bandit algorithms (see e. [sent-205, score-0.262]

70 Whenever a bandit algorithm is run over n rounds, its behavior, combined with the arms’ distributions, generates a probability distribution over sequences of n rewards. [sent-208, score-0.319]

71 The variance of the sequence does not coincide with the variance of the algorithm over multiple runs. [sent-210, score-0.337]

72 Let us consider a simple case with two arms that deterministically generate 0s and 1s respectively, and two different algorithms. [sent-211, score-0.344]

73 Algorithm A1 pulls the arms in a fixed sequence at each run (e. [sent-212, score-0.416]

74 , arm 1, arm 2, arm 1, arm 2, and so on), so that each arm is always pulled n/2 times. [sent-214, score-2.811]

75 Algorithm A2 chooses one arm uniformly at random at the beginning of the run and repeatedly pulls this arm for n rounds. [sent-215, score-1.2]

76 , if ρ = 0), but has no variance over multiple runs because it always generates the same sequence. [sent-222, score-0.203]

77 On the other hand, A2 has no variability in each run, since it generates sequences with only 0s or only 1s, suffers no regret in the case of variance minimization, but has high variance over multiple runs since the two completely different sequences are generated with equal probability. [sent-223, score-0.935]

78 This simple example shows that an algorithm with small standard regret (e. [sent-224, score-0.429]

79 , A1 ), might generate at each run sequences with high variability, while an algorithm with small mean-variance regret (e. [sent-226, score-0.516]

80 7 Conclusions The majority of multi–armed bandit literature focuses on the problem of minimizing the regret w. [sent-229, score-0.685]

81 In this paper, we introduced a novel multi–armed bandit setting where the objective is to perform as well as the arm with the best risk–return trade–off. [sent-233, score-0.808]

82 In particular, we relied on the mean–variance model introduced in [10] to measure the performance of the arms and define the regret of a learning algorithm. [sent-234, score-0.773]

83 We show that defining the risk of a learning algorithm as the variability (i. [sent-235, score-0.319]

84 , empirical variance) of the sequence of rewards generated at each run, leads to an interesting effect on the regret where an additional algorithm variance appears. [sent-237, score-0.631]

85 We proposed two novel algorithms to solve the mean–variance bandit problem and we reported their corresponding theoretical analysis. [sent-238, score-0.235]

86 To the best of our knowledge this is the first work introducing risk–aversion in the multi–armed bandit setting and it opens a series of interesting questions. [sent-239, score-0.276]

87 2, MV-LCB has a regret of order O( K/n) on easy problems and O(1) on difficult problems, while ExpExp achieves the same regret O(K/n1/3 ) over all problems. [sent-243, score-0.858]

88 Considering alternative notions of risk is a natural extension to the previous setting. [sent-247, score-0.252]

89 From a point of view of the expected utility theory, the mean–variance model is only justified under a Gaussianity assumption on the arm distributions. [sent-249, score-0.61]

90 It also violates the monotonocity condition due to the different orders of the mean and variance and is not a coherent measure of risk [2]. [sent-250, score-0.479]

91 Furthermore, the variance is a symmetric measure of risk, while it is often the case that only one–sided deviations from the mean are undesirable (e. [sent-251, score-0.224]

92 , the quantile) or Conditional Value at Risk (otherwise known as average value at risk, tail value at risk, expected shortfall and lower tail risk) or other coherent measures of risk [2]. [sent-259, score-0.374]

93 Another issue in moving from variance to other measures of risk is whether single-period or multi-period risk evaluation should be used. [sent-261, score-0.688]

94 While the single-period risk of an arm is simply the risk of its distribution, in a multi-period evaluation we consider the risk of the sum of rewards obtained by repeatedly pulling the same arm over n rounds. [sent-262, score-2.064]

95 samples is simply n times their variance, for other measures of risk (e. [sent-266, score-0.294]

96 As a result, an arm with the smallest single-period risk might not be the optimal choice over an horizon of n rounds. [sent-269, score-0.888]

97 Therefore, the performance of an algorithm should be compared to the smallest risk that can be achieved by any sequence of arms over n rounds, thus requiring a new definition of regret. [sent-270, score-0.635]

98 Finally, an interesting related problem is the simple regret setting where the learner is allowed to explore over n rounds and it only suffers a regret defined on the solution returned at the end. [sent-272, score-1.025]

99 It is known that it is possible to design algorithm able to effectively estimate the mean of the arms and finally return the best arm with high probability. [sent-273, score-1.038]

100 In the risk-return setting, the objective would be to return the arm with the best risk-return tradeoff. [sent-274, score-0.628]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('arm', 0.553), ('regret', 0.429), ('arms', 0.344), ('risk', 0.252), ('bandit', 0.235), ('expexp', 0.207), ('multi', 0.161), ('variance', 0.16), ('pulling', 0.138), ('reward', 0.121), ('armed', 0.107), ('rounds', 0.098), ('mvi', 0.094), ('exploration', 0.07), ('variability', 0.067), ('aversion', 0.066), ('return', 0.055), ('trade', 0.054), ('worst', 0.051), ('pull', 0.041), ('mean', 0.04), ('repeatedly', 0.039), ('audibert', 0.039), ('learner', 0.038), ('mvn', 0.038), ('sani', 0.038), ('bandits', 0.037), ('dilemma', 0.037), ('expected', 0.037), ('horizon', 0.034), ('tolerance', 0.033), ('sequences', 0.032), ('treatment', 0.032), ('exploitation', 0.032), ('suffers', 0.031), ('suboptimal', 0.031), ('xit', 0.029), ('bastien', 0.029), ('run', 0.028), ('clinical', 0.028), ('pulled', 0.027), ('pulls', 0.027), ('coherent', 0.027), ('cumulative', 0.027), ('might', 0.027), ('nition', 0.027), ('simulations', 0.026), ('historically', 0.026), ('ucb', 0.026), ('design', 0.026), ('notice', 0.026), ('expert', 0.026), ('mi', 0.026), ('nonetheless', 0.025), ('quantile', 0.025), ('amir', 0.025), ('nance', 0.025), ('rewards', 0.025), ('generates', 0.024), ('alt', 0.024), ('anytime', 0.024), ('phase', 0.024), ('deviations', 0.024), ('measures', 0.024), ('trades', 0.024), ('highest', 0.024), ('zt', 0.023), ('validating', 0.023), ('lazaric', 0.023), ('alessandro', 0.022), ('csaba', 0.022), ('smallest', 0.022), ('opens', 0.021), ('munos', 0.021), ('environment', 0.021), ('focuses', 0.021), ('numerical', 0.021), ('rn', 0.021), ('collecting', 0.021), ('utility', 0.02), ('rounding', 0.02), ('patients', 0.02), ('best', 0.02), ('compete', 0.02), ('remarks', 0.02), ('minimization', 0.019), ('inria', 0.019), ('keeps', 0.019), ('always', 0.019), ('tends', 0.019), ('pure', 0.019), ('remark', 0.018), ('szepesv', 0.018), ('samples', 0.018), ('characterized', 0.017), ('tail', 0.017), ('actually', 0.017), ('sequence', 0.017), ('popular', 0.017), ('antoine', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

2 0.55925465 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric

Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1

3 0.20439611 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

4 0.19861935 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ronald Ortner, Daniil Ryabko

Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1

5 0.1435857 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1

6 0.13199832 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

7 0.12365769 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

8 0.11601436 102 nips-2012-Distributed Non-Stochastic Experts

9 0.11362346 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

10 0.099594399 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

11 0.098694146 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

12 0.095391907 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

13 0.092492126 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

14 0.08857768 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

15 0.080071926 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

16 0.07917618 293 nips-2012-Relax and Randomize : From Value to Algorithms

17 0.077654332 348 nips-2012-Tractable Objectives for Robust Policy Optimization

18 0.067896083 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

19 0.065793283 160 nips-2012-Imitation Learning by Coaching

20 0.065507896 162 nips-2012-Inverse Reinforcement Learning through Structured Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.131), (1, -0.17), (2, 0.082), (3, 0.053), (4, 0.068), (5, 0.021), (6, -0.009), (7, 0.091), (8, -0.102), (9, 0.027), (10, 0.166), (11, 0.146), (12, -0.214), (13, -0.234), (14, -0.23), (15, 0.078), (16, -0.14), (17, 0.032), (18, -0.214), (19, -0.13), (20, 0.158), (21, -0.158), (22, -0.073), (23, 0.055), (24, 0.173), (25, -0.096), (26, -0.022), (27, -0.158), (28, 0.021), (29, 0.095), (30, -0.134), (31, -0.105), (32, -0.038), (33, 0.112), (34, 0.052), (35, -0.08), (36, -0.039), (37, 0.015), (38, 0.049), (39, -0.093), (40, 0.072), (41, 0.05), (42, 0.076), (43, -0.033), (44, 0.087), (45, 0.063), (46, -0.051), (47, 0.048), (48, 0.029), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97427899 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

2 0.92073536 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric

Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1

3 0.56442165 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ronald Ortner, Daniil Ryabko

Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1

4 0.52999383 102 nips-2012-Distributed Non-Stochastic Experts

Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic

Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1

5 0.49724582 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

6 0.49181321 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

7 0.48382807 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

8 0.44788527 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

9 0.43152785 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

10 0.34836152 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

11 0.32585904 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

12 0.29426455 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.2923193 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

14 0.2918753 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

15 0.25353393 348 nips-2012-Tractable Objectives for Robust Policy Optimization

16 0.24859487 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

17 0.2373925 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

18 0.23649955 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

19 0.2195836 222 nips-2012-Multi-Task Averaging

20 0.20183112 32 nips-2012-Active Comparison of Prediction Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.022), (1, 0.018), (17, 0.016), (21, 0.023), (36, 0.378), (38, 0.092), (42, 0.02), (53, 0.013), (54, 0.04), (55, 0.016), (74, 0.03), (76, 0.131), (80, 0.061), (92, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83835387 35 nips-2012-Adaptive Learning of Smoothing Functions: Application to Electricity Load Forecasting

Author: Amadou Ba, Mathieu Sinn, Yannig Goude, Pascal Pompey

Abstract: This paper proposes an efficient online learning algorithm to track the smoothing functions of Additive Models. The key idea is to combine the linear representation of Additive Models with a Recursive Least Squares (RLS) filter. In order to quickly track changes in the model and put more weight on recent data, the RLS filter uses a forgetting factor which exponentially weights down observations by the order of their arrival. The tracking behaviour is further enhanced by using an adaptive forgetting factor which is updated based on the gradient of the a priori errors. Using results from Lyapunov stability theory, upper bounds for the learning rate are analyzed. The proposed algorithm is applied to 5 years of electricity load data provided by the French utility company Electricit´ de France (EDF). e Compared to state-of-the-art methods, it achieves a superior performance in terms of model tracking and prediction accuracy. 1

same-paper 2 0.72607726 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

3 0.70623791 288 nips-2012-Rational inference of relative preferences

Author: Nisheeth Srivastava, Paul R. Schrater

Abstract: Statistical decision theory axiomatically assumes that the relative desirability of different options that humans perceive is well described by assigning them optionspecific scalar utility functions. However, this assumption is refuted by observed human behavior, including studies wherein preferences have been shown to change systematically simply through variation in the set of choice options presented. In this paper, we show that interpreting desirability as a relative comparison between available options at any particular decision instance results in a rational theory of value-inference that explains heretofore intractable violations of rational choice behavior in human subjects. Complementarily, we also characterize the conditions under which a rational agent selecting optimal options indicated by dynamic value inference in our framework will behave identically to one whose preferences are encoded using a static ordinal utility function. 1

4 0.57555556 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric

Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1

5 0.551566 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

Author: Po-ling Loh, Martin J. Wainwright

Abstract: We investigate a curious relationship between the structure of a discrete graphical model and the support of the inverse of a generalized covariance matrix. We show that for certain graph structures, the support of the inverse covariance matrix of indicator variables on the vertices of a graph reflects the conditional independence structure of the graph. Our work extends results that have previously been established only in the context of multivariate Gaussian graphical models, thereby addressing an open question about the significance of the inverse covariance matrix of a non-Gaussian distribution. Based on our population-level results, we show how the graphical Lasso may be used to recover the edge structure of certain classes of discrete graphical models, and present simulations to verify our theoretical results. 1

6 0.54991889 69 nips-2012-Clustering Sparse Graphs

7 0.48461482 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

8 0.45715994 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

9 0.45601687 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

10 0.45114756 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

11 0.44731161 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

12 0.44481093 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

13 0.43987077 132 nips-2012-Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling

14 0.43904275 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

15 0.43789446 358 nips-2012-Value Pursuit Iteration

16 0.4355652 188 nips-2012-Learning from Distributions via Support Measure Machines

17 0.43502882 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

18 0.43464148 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

19 0.43425965 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

20 0.43403998 38 nips-2012-Algorithms for Learning Markov Field Policies