nips nips2010 nips2010-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hado V. Hasselt
Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation. 1
Reference: text
sentIndex sentText sentNum sentScore
1 This poor performance is caused by large overestimations of action values. [sent-2, score-0.27]
2 These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. [sent-3, score-0.596]
3 We introduce an alternative way to approximate the maximum expected value for any set of random variables. [sent-4, score-0.101]
4 The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. [sent-5, score-0.681]
5 We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. [sent-6, score-0.624]
6 We show that Q-learning’s performance can be poor in stochastic MDPs because of large overestimations of the action values. [sent-9, score-0.297]
7 (1) In this equation, Qt (s, a) gives the value of the action a in state s at time t. [sent-12, score-0.232]
8 The next state st+1 is determined by a fixed state transition distribution P : S × A × S → [0, 1], s′ s′ where Psa gives the probability of ending up in state s′ after performing a in s, and s′ Psa = 1. [sent-14, score-0.132]
9 The learning rate αt (s, a) ∈ [0, 1] ensures that the update averages over possible randomness in the rewards and transitions in order to converge in the limit to the optimal action value function. [sent-15, score-0.335]
10 Second, in non-episodic tasks, the discount factor makes sure that every action value is finite and therefore welldefined. [sent-19, score-0.217]
11 It has been proven that Q-learning reaches the optimal value function Q∗ with probability one in the limit under some mild conditions on the learning rates and exploration policy [4–6]. [sent-20, score-0.139]
12 The convergence rate of Q-learning can be exponential in the number of experiences [13], although this is dependent on the learning rates and with a proper choice of learning rates convergence in polynomial time can be obtained [14]. [sent-23, score-0.236]
13 Contributions An important aspect of the Q-learning algorithm has been overlooked in previous work: the use of the max operator to determine the value of the next state can cause large overestimations of the action values. [sent-25, score-0.359]
14 We show that Q-learning can suffer a large performance penalty because of a positive bias that results from using the maximum value as approximation for the maximum expected value. [sent-26, score-0.193]
15 We propose an alternative double estimator method to find an estimate for the maximum value of a set of stochastic values and we show that this sometimes underestimates rather than overestimates the maximum expected value. [sent-27, score-0.786]
16 In the second section, we analyze two methods to approximate the maximum expected value of a set of random variables. [sent-30, score-0.101]
17 2 Estimating the Maximum Expected Value In this section, we analyze two methods to find an approximation for the maximum expected value of a set of random variables. [sent-35, score-0.101]
18 The single estimator method uses the maximum of a set of estimators as an approximation. [sent-36, score-0.24]
19 This approach to approximate the value of the maximum expected value is positively biased, as discussed in previous work in economics [15] and decision making [16]. [sent-37, score-0.142]
20 The double estimator method uses two estimates for each variable and uncouples the selection of an estimator and its value. [sent-39, score-0.723]
21 In many problems, one is interested in the maximum expected value of the variables in such a set: max E{Xi } . [sent-46, score-0.122]
22 Unbiased estimates for the expected values can def be obtained by computing the sample average for each variable: E{Xi } = E{µi } ≈ µi (S) = 1 s∈Si s, where µi is an estimator for variable Xi . [sent-51, score-0.309]
23 This approximation is unbiased since every |Si | sample s ∈ Si is an unbiased estimate for the value of E{Xi }. [sent-52, score-0.245]
24 The error in the approximation thus consists solely of the variance in the estimator and decreases when we obtain more samples. [sent-53, score-0.151]
25 We use the following notations: fi denotes the probability density function (PDF) of the ith variable x Xi and Fi (x) = −∞ fi (x) dx is the cumulative distribution function (CDF) of this PDF. [sent-54, score-0.165]
26 Similarly, the PDF and CDF of the ith estimator are denoted fiµ and Fiµ . [sent-55, score-0.151]
27 The maximum expected value can ∞ be expressed in terms of the underlying PDFs as maxi E{Xi } = maxi −∞ x fi (x) dx . [sent-56, score-0.481]
28 1 The Single Estimator An obvious way to approximate the value in (3) is to use the value of the maximal estimator: max E{Xi } = max E{µi } ≈ max µi (S) . [sent-58, score-0.148]
29 Q-learning uses this method to approximate the value of the next state by maximizing over the estimated action values in that state. [sent-60, score-0.232]
30 2 µ The maximal estimator maxi µi is distributed according to some PDF fmax that is dependent on the µ PDFs of the estimators fiµ . [sent-61, score-0.443]
31 This probability is equal to the probability def def M µ that all the estimates are lower or equal to x: Fmax (x) = P (maxi µi ≤ x) = i=1 P (µi ≤ x) = ∞ M µ µ i=1 Fi (x). [sent-63, score-0.204]
32 The value maxi µi (S) is an unbiased estimate for E{maxj µj } = −∞ x fmax (x) dx , which can thus be given by M ∞ E{max µj } = j x −∞ d F µ (x) dx = dx i=1 i M ∞ −∞ j M Fiµ (x) dx . [sent-64, score-0.544]
33 This makes the maximal estimator maxi µi (S) a biased estimate for maxi E{Xi }. [sent-66, score-0.503]
34 2 The Double Estimator The overestimation that results from the single estimator approach can have a large negative impact on algorithms that use this method, such as Q-learning. [sent-70, score-0.244]
35 Therefore, we look at an alternative method to approximate maxi E{Xi }. [sent-71, score-0.136]
36 We refer to this method as the double estimator, since it uses two sets of estimators: µA = {µA , . [sent-72, score-0.403]
37 Like the single estimator A B i i i i i i µi , both µA and µB are unbiased if we assume that the samples are split in a proper manner, for i i def instance randomly, over the two sets of estimators. [sent-80, score-0.374]
38 Let M axA (S) = j | µA (S) = maxi µA (S) j i be the set of maximal estimates in µA (S). [sent-81, score-0.191]
39 Since µB is an independent, unbiased set of estimators, we have E{µB } = E{Xj } for all j, including all j ∈ M axA . [sent-82, score-0.101]
40 Let a∗ be an estimator that maximizes j def µA : µA∗ (S) = maxi µA (S). [sent-83, score-0.38]
41 Then we can use µB∗ as an estimate for maxi E{µB } and therefore also for a i maxi E{Xi } and we obtain the approximation max E{Xi } = max E{µB } ≈ µB∗ . [sent-85, score-0.333]
42 Integrating out x gives P (j = a∗ ) = −∞ P (µA = i j def M ∞ M A x) i=j P (µA < x) dx = −∞ fj (x) i=j FiA (x) dx , where fiA and FiA are the PDF and CDF i of µA . [sent-91, score-0.218]
43 The expected value of the approximation by the double estimator can thus be given by i M M P (j = a∗ )E{µB } = j j M ∞ E{µB } j j −∞ FiA (x) dx . [sent-92, score-0.676]
44 Comparing (7) to (5), we see the difference is that the double estimator uses E{µB } in place of j x. [sent-95, score-0.554]
45 The single estimator overestimates, because x is within integral and therefore correlates with µ the monotonically increasing product i=j Fi (x). [sent-96, score-0.151]
46 The double estimator underestimates because the probabilities P (j = a∗ ) sum to one and therefore the approximation is a weighted estimate of unbiased expected values, which must be lower or equal to the maximum expected value. [sent-97, score-0.83]
47 In the following lemma, which holds in both the discrete and the continuous case, we prove in general that the estimate E{µB∗ } is not an unbiased estimate of maxi E{Xi }. [sent-98, score-0.275]
48 , µB } be two sets of unbiased estimators such that E{µA } = E{µB } = E{Xi }, 1 i i M def for all i. [sent-109, score-0.253]
49 Let M = {j | E{Xj } = maxi E{Xi }} be the set of elements that maximize the expected values. [sent-110, score-0.183]
50 Let a∗ be an element that maximizes µA : µA∗ = maxi µA . [sent-111, score-0.136]
51 In contrast with the single estimator, the double estimator is unbiased when the variables are iid, since then all expected values are equal and P (a∗ ∈ M) = 1. [sent-121, score-0.702]
52 3 Double Q-learning We can interpret Q-learning as using the single estimator to estimate the value of the next state: maxa Qt (st+1 , a) is an estimate for E{maxa Qt (st+1 , a)}, which in turn approximates maxa E{Qt (st+1 , a)}. [sent-122, score-0.435]
53 Therefore, maxa Qt (st+1 , a) is an unbiased sample, drawn from an iid distribution with mean E{maxa Qt (st+1 , a)}. [sent-124, score-0.234]
54 The action a∗ in line 6 is the maximal valued action in state s′ , according to the value function QA . [sent-130, score-0.464]
55 However, instead of using the value QA (s′ , a∗ ) = maxa QA (s′ , a) to update QA , as Q-learning would do, we use the value QB (s′ , a∗ ). [sent-131, score-0.19]
56 Since QB was updated on the same problem, but with a different set of experience samples, this can be considered an unbiased estimate for the value of this action. [sent-132, score-0.182]
57 It is important that both Q functions learn from separate sets of experiences, but to select an action to perform one can use both value functions. [sent-134, score-0.188]
58 In our experiments, we calculated the average of the two Q values for each action and then performed ǫ-greedy exploration with the resulting average Q values. [sent-136, score-0.199]
59 Similar to the double estimator in Section 2, action a∗ may not be the action that maximizes the expected Q function maxa E{QA (s′ , a)}. [sent-138, score-1.04]
60 In general E{QB (s′ , a∗ )} ≤ maxa E{QA (s′ , a∗ )}, and underestimations of the action values can occur. [sent-139, score-0.275]
61 Intuitively, this is what one would expect: Q-learning is based on the single estimator and Double Q-learning is based on the double estimator and in Section 2 we argued that the estimates by the single and double estimator both converge to the same answer in the limit. [sent-142, score-1.277]
62 However, this argument does not transfer immediately to bootstrapping action values, so we prove this result making use of the following lemma which was also used to prove convergence of Sarsa [20]. [sent-143, score-0.229]
63 A ‘proper’ learning policy ensures that each state action pair is visited an infinite number of times. [sent-179, score-0.23]
64 , st , at }, X = S × A, ∆t = QA − Q∗ , ζ = α and Ft (st , at ) = rt + γQB (st+1 , a∗ ) − t t Q∗ (st , at ), where a∗ = arg maxa QA (st+1 , a). [sent-187, score-0.528]
65 The fourth condition of the lemma holds as a consequence of the boundedness condition on the variance of the rewards in the theorem. [sent-189, score-0.102]
66 By definition of a∗ as given in line 6 of t t Algorithm 1 we have QA (st+1 , a∗ ) = maxa QA (st+1 , a) ≥ QA (st+1 , b∗ ) and therefore t t t E{FtBA (st , at )|Pt } = γE QA (st+1 , b∗ ) − QB (st+1 , a∗ )|Pt t t ≤ γE QA (st+1 , a∗ ) − QB (st+1 , a∗ )|Pt ≤ γ ∆BA t t t . [sent-200, score-0.111]
67 The settings are the gambling game of roulette and a small grid world. [sent-208, score-0.133]
68 For Double Q-learning nt (s, a) = nA (s, a) if QA is updated and nt (s, a) = nB (s, a) if QB is updated, t t where nA and nB store the number of updates for each action for the corresponding value function. [sent-215, score-0.26]
69 1 Roulette In roulette, a player chooses between 170 betting actions, including betting on a number, on either of the colors black or red, and so on. [sent-218, score-0.134]
70 The payoff for each of these bets is chosen such that almost all 1 bets have an expected payout of 38 $36 = $0. [sent-219, score-0.167]
71 1 We assume all betting actions transition back to the same state and there is one action that stops playing, yielding $0. [sent-222, score-0.33]
72 Figure 1 shows the mean action values over all actions, as found by Q-learning and Double Qlearning. [sent-224, score-0.164]
73 After 100,000 trials, Q-learning with a linear learning rate values all betting actions at more than $20 and there is little progress. [sent-226, score-0.122]
74 With polynomial learning rates the performance improves, but Double Q-learning converges much more quickly. [sent-227, score-0.103]
75 1 Only the so called ‘top line’ which pays $6 per dollar when 00, 0, 1, 2 or 3 is hit has a slightly lower expected value of -$0. [sent-235, score-0.101]
76 6 Figure 1: The average action values according to Q-learning and Double Q-learning when playing roulette. [sent-237, score-0.164]
77 The second row shows the maximal action value in the starting state S. [sent-242, score-0.269]
78 Each time the agent selects an action that walks off the grid, the agent stays in the same state. [sent-248, score-0.216]
79 In the goal state every action yields +5 and ends an episode. [sent-250, score-0.208]
80 The exploration was ǫ-greedy with ǫ(s) = 1/ n(s) where n(s) is the number of times state s has been visited, assuring infinite exploration in the limit which is a theoretical requirement for the convergence of both Q-learning and Double Q-learning. [sent-253, score-0.166]
81 Such an ǫ-greedy setting is beneficial for Q-learning, since this implies that actions with large overestimations are selected more often than realistically valued actions. [sent-254, score-0.206]
82 Figure 2 shows the average rewards in the first row and the maximum action value in the starting state in the second row. [sent-256, score-0.33]
83 Double Q-learning performs much better in terms of its average rewards, but this does not imply that the estimations of the action values are accurate. [sent-257, score-0.186]
84 The optimal value of 3 the maximally valued action in the starting state is 5γ 4 − k=0 γ k ≈ 0. [sent-258, score-0.263]
85 However, even if the error of the action values is comparable, the policies found by Double Q-learning are clearly much better. [sent-261, score-0.164]
86 5 Discussion We note an important difference between the well known heuristic exploration technique of optimism in the face of uncertainty [21, 22] and the overestimation bias. [sent-262, score-0.174]
87 Optimism about uncertain events can be beneficial, but Q-learning can overestimate actions that have been tried often and the estimations can be higher than any realistic optimistic estimate. [sent-263, score-0.134]
88 For instance, in roulette our initial action value estimate of $0 can be considered optimistic, since no action has an actual expected value higher than this. [sent-264, score-0.517]
89 However, even after trying 100,000 actions Q-learning on average estimated each gambling action to be worth almost $10. [sent-265, score-0.263]
90 In contrast, although Double Q-learning can underestimate 7 the values of some actions, it is easy to set the initial action values high enough to ensure optimism for actions that have experienced limited updates. [sent-266, score-0.303]
91 The value of such a cluster in itself is an estimation task and the proposed methods included taking the mean value, which would result in an underestimation of the actual value, and taking the maximum value, which is a case of the single estimator and results in an overestimation. [sent-270, score-0.231]
92 It would be interesting to see how the double estimator approach fares in such a setting. [sent-271, score-0.554]
93 When the rewards are deterministic but the state transitions are stochastic, the same pattern of overestimations due to this noise can occur and the same conclusions continue to hold. [sent-273, score-0.245]
94 6 Conclusion We have presented a new algorithm called Double Q-learning that uses a double estimator approach to determine the value of the next state. [sent-274, score-0.578]
95 To our knowledge, this is the first off-policy value based reinforcement learning algorithm that does not have a positive bias in estimating the action values in stochastic environments. [sent-275, score-0.316]
96 According to our analysis, Double Q-learning sometimes underestimates the action values, but does not suffer from the overestimation bias that Q-learning does. [sent-276, score-0.351]
97 Additionally, the fact that we can construct positively biased and negatively biased off-policy algorithms raises the question whether it is also possible to construct an unbiased off-policy reinforcement-learning algorithm, without the high variance of unbiased on-policy Monte-Carlo methods [24]. [sent-280, score-0.267]
98 Unfortunately, the size of the overestimation is dependent on the number of actions and the unknown distributions of the rewards and transitions, making this a non-trivial extension. [sent-282, score-0.23]
99 For instance, Delayed Q-learning can suffer from similar overestimations, although it does have polynomial convergence guarantees. [sent-284, score-0.101]
100 This is similar to the polynomial learning rates: although performance is improved from an exponential to a polynomial rate [14], the algorithm still suffers from the inherent overestimation bias due to the single estimator approach. [sent-285, score-0.353]
wordName wordTfidf (topN-words)
[('qb', 0.437), ('qa', 0.413), ('double', 0.403), ('st', 0.375), ('ba', 0.173), ('action', 0.164), ('estimator', 0.151), ('pt', 0.148), ('maxi', 0.136), ('maxa', 0.111), ('overestimations', 0.106), ('unbiased', 0.101), ('def', 0.093), ('overestimation', 0.093), ('qt', 0.079), ('ftba', 0.075), ('roulette', 0.075), ('ft', 0.072), ('reinforcement', 0.07), ('actions', 0.069), ('rewards', 0.068), ('bets', 0.06), ('fmax', 0.06), ('estimators', 0.059), ('fi', 0.057), ('delayed', 0.057), ('betting', 0.053), ('fia', 0.053), ('pdfs', 0.053), ('dx', 0.051), ('xt', 0.048), ('expected', 0.047), ('pdf', 0.046), ('optimism', 0.046), ('fta', 0.045), ('ftb', 0.045), ('ftq', 0.045), ('psa', 0.045), ('rsa', 0.045), ('state', 0.044), ('rt', 0.042), ('polynomial', 0.039), ('updated', 0.038), ('cdf', 0.037), ('maximal', 0.037), ('rates', 0.037), ('xi', 0.035), ('exploration', 0.035), ('lemma', 0.034), ('experiences', 0.032), ('underestimates', 0.032), ('update', 0.031), ('suffer', 0.031), ('fitted', 0.031), ('bias', 0.031), ('valued', 0.031), ('convergence', 0.031), ('dollar', 0.03), ('gambling', 0.03), ('wiering', 0.03), ('mdp', 0.03), ('maximum', 0.03), ('proper', 0.029), ('si', 0.029), ('discount', 0.029), ('reward', 0.028), ('grid', 0.028), ('player', 0.028), ('transitions', 0.027), ('converges', 0.027), ('stochastic', 0.027), ('axa', 0.026), ('overestimate', 0.026), ('underestimation', 0.026), ('watkins', 0.026), ('agent', 0.026), ('littman', 0.025), ('szepesv', 0.025), ('ct', 0.025), ('underestimate', 0.024), ('winner', 0.024), ('xa', 0.024), ('value', 0.024), ('biased', 0.024), ('les', 0.024), ('overestimates', 0.023), ('fj', 0.023), ('policy', 0.022), ('iid', 0.022), ('estimations', 0.022), ('nb', 0.022), ('mdps', 0.021), ('max', 0.021), ('limit', 0.021), ('estimate', 0.019), ('bandit', 0.018), ('estimates', 0.018), ('positively', 0.017), ('optimistic', 0.017), ('nt', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 66 nips-2010-Double Q-learning
Author: Hado V. Hasselt
Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation. 1
2 0.14085256 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
Author: Han Liu, Kathryn Roeder, Larry Wasserman
Abstract: A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.
3 0.13217074 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
Author: Martha White, Adam White
Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisfies traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1
4 0.12840444 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
5 0.11884424 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
Author: Yi Sun, Jürgen Schmidhuber, Faustino J. Gomez
Abstract: We present a new way of converting a reversible finite Markov chain into a nonreversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition graph. Our result confirms that non-reversible chains are fundamentally better than reversible ones in terms of asymptotic performance, and suggests interesting directions for further improving MCMC. 1
6 0.11090344 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
7 0.10274957 152 nips-2010-Learning from Logged Implicit Exploration Data
8 0.097974822 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies
9 0.096168816 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions
10 0.075958185 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
11 0.073366947 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
12 0.071831927 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
13 0.069017939 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
14 0.06891682 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
15 0.068532132 168 nips-2010-Monte-Carlo Planning in Large POMDPs
16 0.068431996 222 nips-2010-Random Walk Approach to Regret Minimization
17 0.066571563 203 nips-2010-Parametric Bandits: The Generalized Linear Case
18 0.065934986 229 nips-2010-Reward Design via Online Gradient Ascent
19 0.065540656 39 nips-2010-Bayesian Action-Graph Games
20 0.06508863 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
topicId topicWeight
[(0, 0.152), (1, -0.139), (2, 0.022), (3, 0.037), (4, -0.004), (5, 0.029), (6, -0.06), (7, -0.023), (8, -0.009), (9, 0.145), (10, -0.075), (11, 0.022), (12, -0.031), (13, -0.014), (14, 0.036), (15, -0.025), (16, -0.062), (17, -0.046), (18, -0.04), (19, 0.115), (20, -0.09), (21, 0.102), (22, -0.037), (23, 0.109), (24, 0.037), (25, 0.041), (26, 0.043), (27, 0.014), (28, -0.02), (29, 0.104), (30, 0.04), (31, 0.017), (32, -0.045), (33, 0.091), (34, 0.017), (35, 0.211), (36, 0.014), (37, -0.072), (38, -0.066), (39, -0.009), (40, 0.095), (41, -0.101), (42, 0.017), (43, 0.135), (44, 0.014), (45, -0.011), (46, -0.098), (47, 0.04), (48, -0.004), (49, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.9594062 66 nips-2010-Double Q-learning
Author: Hado V. Hasselt
Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation. 1
2 0.63175124 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
Author: Han Liu, Kathryn Roeder, Larry Wasserman
Abstract: A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.
3 0.56590402 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
Author: Martha White, Adam White
Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisfies traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1
4 0.55566686 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies
Author: Harold Pashler, Matthew Wilder, Robert Lindsey, Matt Jones, Michael C. Mozer, Michael P. Holmes
Abstract: For over half a century, psychologists have been struck by how poor people are at expressing their internal sensations, impressions, and evaluations via rating scales. When individuals make judgments, they are incapable of using an absolute rating scale, and instead rely on reference points from recent experience. This relativity of judgment limits the usefulness of responses provided by individuals to surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that transform internal states to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, decontamination is fundamentally a problem of inferring latent states (internal sensations) which, because of the relativity of judgment, have temporal dependencies. We propose a decontamination solution using a conditional random field with constraints motivated by psychological theories of relative judgment. Our exploration of decontamination models is supported by two experiments we conducted to obtain ground-truth rating data on a simple length estimation task. Our decontamination techniques yield an over 20% reduction in the error of human judgments. 1
5 0.54545844 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
Author: Rina Foygel, Mathias Drton
Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1
6 0.53760159 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
7 0.44500753 19 nips-2010-A rational decision making framework for inhibitory control
8 0.43625179 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
10 0.43033203 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions
11 0.41565174 203 nips-2010-Parametric Bandits: The Generalized Linear Case
12 0.4132784 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
13 0.40959495 168 nips-2010-Monte-Carlo Planning in Large POMDPs
14 0.40557152 212 nips-2010-Predictive State Temporal Difference Learning
15 0.40303683 4 nips-2010-A Computational Decision Theory for Interactive Assistants
16 0.40096158 39 nips-2010-Bayesian Action-Graph Games
17 0.39048186 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
18 0.365049 152 nips-2010-Learning from Logged Implicit Exploration Data
19 0.3623977 108 nips-2010-Graph-Valued Regression
20 0.35679573 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
topicId topicWeight
[(13, 0.033), (17, 0.012), (27, 0.097), (30, 0.038), (35, 0.01), (39, 0.019), (45, 0.149), (50, 0.047), (52, 0.033), (60, 0.051), (76, 0.297), (77, 0.065), (78, 0.012), (90, 0.039)]
simIndex simValue paperId paperTitle
1 0.72622311 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
Author: Sohan Seth, Park Il, Austin Brockmeier, Mulugeta Semework, John Choi, Joseph Francis, Jose Principe
Abstract: Hypothesis testing on point processes has several applications such as model fitting, plasticity detection, and non-stationarity detection. Standard tools for hypothesis testing include tests on mean firing rate and time varying rate function. However, these statistics do not fully describe a point process, and therefore, the conclusions drawn by these tests can be misleading. In this paper, we introduce a family of non-parametric divergence measures for hypothesis testing. A divergence measure compares the full probability structure and, therefore, leads to a more robust test of hypothesis. We extend the traditional Kolmogorov–Smirnov and Cram´ r–von-Mises tests to the space of spike trains via stratification, and e show that these statistics can be consistently estimated from data without any free parameter. We demonstrate an application of the proposed divergences as a cost function to find optimally matched point processes. 1
same-paper 2 0.71493089 66 nips-2010-Double Q-learning
Author: Hado V. Hasselt
Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation. 1
3 0.56169963 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.56136644 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
Author: Ryan Kelly, Matthew Smith, Robert Kass, Tai S. Lee
Abstract: Activity of a neuron, even in the early sensory areas, is not simply a function of its local receptive field or tuning properties, but depends on global context of the stimulus, as well as the neural context. This suggests the activity of the surrounding neurons and global brain states can exert considerable influence on the activity of a neuron. In this paper we implemented an L1 regularized point process model to assess the contribution of multiple factors to the firing rate of many individual units recorded simultaneously from V1 with a 96-electrode “Utah” array. We found that the spikes of surrounding neurons indeed provide strong predictions of a neuron’s response, in addition to the neuron’s receptive field transfer function. We also found that the same spikes could be accounted for with the local field potentials, a surrogate measure of global network states. This work shows that accounting for network fluctuations can improve estimates of single trial firing rate and stimulus-response transfer functions. 1
5 0.56085062 161 nips-2010-Linear readout from a neural population with partial correlation data
Author: Adrien Wohrer, Ranulfo Romo, Christian K. Machens
Abstract: How much information does a neural population convey about a stimulus? Answers to this question are known to strongly depend on the correlation of response variability in neural populations. These noise correlations, however, are essentially immeasurable as the number of parameters in a noise correlation matrix grows quadratically with population size. Here, we suggest to bypass this problem by imposing a parametric model on a noise correlation matrix. Our basic assumption is that noise correlations arise due to common inputs between neurons. On average, noise correlations will therefore reflect signal correlations, which can be measured in neural populations. We suggest an explicit parametric dependency between signal and noise correlations. We show how this dependency can be used to ”fill the gaps” in noise correlations matrices using an iterative application of the Wishart distribution over positive definitive matrices. We apply our method to data from the primary somatosensory cortex of monkeys performing a two-alternativeforced choice task. We compare the discrimination thresholds read out from the population of recorded neurons with the discrimination threshold of the monkey and show that our method predicts different results than simpler, average schemes of noise correlations. 1
6 0.55471689 17 nips-2010-A biologically plausible network for the computation of orientation dominance
7 0.55340213 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
8 0.5527758 98 nips-2010-Functional form of motion priors in human motion perception
9 0.55085719 268 nips-2010-The Neural Costs of Optimal Control
10 0.54929227 81 nips-2010-Evaluating neuronal codes for inference using Fisher information
11 0.54785722 117 nips-2010-Identifying graph-structured activation patterns in networks
12 0.5472452 127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics
13 0.54640722 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
14 0.54494083 194 nips-2010-Online Learning for Latent Dirichlet Allocation
15 0.54440236 96 nips-2010-Fractionally Predictive Spiking Neurons
16 0.54310179 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
17 0.54113448 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data
18 0.5407204 19 nips-2010-A rational decision making framework for inhibitory control
19 0.53963721 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication
20 0.53949416 155 nips-2010-Learning the context of a category