nips nips2011 nips2011-10 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de 2 Abstract In this paper, we consider the problem of policy evaluation for continuousstate systems. [sent-4, score-0.274]
2 We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. [sent-5, score-0.382]
3 Furthermore, we also present a unified view of several well-known policy evaluation methods. [sent-7, score-0.29]
4 1 Introduction Value functions are an essential concept for determining optimal policies in both optimal control [1] and reinforcement learning [2, 3]. [sent-10, score-0.124]
5 Given the value function of a policy, an improved policy is straightforward to compute. [sent-11, score-0.273]
6 The improved policy can subsequently be evaluated to obtain a new value function. [sent-12, score-0.325]
7 This loop of computing value functions and determining better policies is known as policy iteration. [sent-13, score-0.327]
8 However, the main bottleneck in policy iteration is the computation of the value function for a given policy. [sent-14, score-0.29]
9 Using the Bellman equation, only two classes of systems have been solved exactly: tabular discrete state and action problems [4] as well as linear-quadratic regulation problems [5]. [sent-15, score-0.108]
10 The exact computation of the value function remains an open problem for most systems with continuous state spaces [6]. [sent-16, score-0.108]
11 As an alternative to exact solutions, approximate policy evaluation methods have been developed in reinforcement learning. [sent-18, score-0.344]
12 These approaches include Monte Carlo methods, temporal difference learning, and residual gradient methods. [sent-19, score-0.107]
13 Residual gradient approaches are biased unless multiple samples are taken from the same states [9], which is often not possible for real continuous systems. [sent-22, score-0.1]
14 In this paper, we propose a non-parametric method for continuous-state policy evaluation. [sent-23, score-0.227]
15 The proposed method uses a kernel density estimate to represent the system in a flexible manner. [sent-24, score-0.196]
16 We subsequently show that the true value function for this model has a Nadaraya-Watson kernel regression form [12, 13]. [sent-26, score-0.193]
17 The 1 resulting method is called Non-Parametric Dynamic Programming (NPDP), and is a stable as well as consistent approach to policy evaluation. [sent-28, score-0.227]
18 The second contribution of this paper is to provide a unified view of several sample-based algorithms for policy evaluation, including the NPDP algorithm. [sent-29, score-0.258]
19 In reinforcement learning, the uncontrolled system is usually represented by a Markov Decision Process (MDP). [sent-32, score-0.111]
20 Actions a are selected according to the stochastic policy π(a|s). [sent-34, score-0.227]
21 The term system will refer jointly to the agent’s policy and the MDP. [sent-38, score-0.268]
22 The value of a state V (s), for a specific policy π, is defined as the expected discounted sum of rewards that an agent will receive after visiting state s and executing policy π; i. [sent-39, score-0.634]
23 (1) can be rewritten as the Bellman equation ´ ´ V (s) = A S π (a|s) p (s |s, a) [r (s, a) + γV π (s )] ds da. [sent-43, score-0.29]
24 (1) (2) The advantage of using the Bellman equation is that it describes the relationship between the value function at one state s and its immediate follow-up states s ∼ p(s |s, a). [sent-44, score-0.203]
25 2 Non-Parametric Model-based Dynamic Programming We begin describing the NPDP approach by introducing the kernel density estimation framework used to represent the system. [sent-47, score-0.155]
26 The true value function for this model has a kernel regression form, which can be computed by using Galerkin’s projection method. [sent-48, score-0.223]
27 Using Bayes rule and marginalization, one can compute the transition probabilities p(s |s, a) and the current policy π(a|s) from this joint distribution; e. [sent-52, score-0.294]
28 The ith sample includes the current state si ∈ S, the selected action ai ∈ A, and the follow-up state si ∈ S, as well as the immediate reward ri ∈ R. [sent-57, score-0.632]
29 We propose using kernel density estimation to represent the joint distribution [17, 18] in a non-parametric manner. [sent-59, score-0.179]
30 The system’s joint distribution is therefore modeled as p(s, a, s ) = n n−1 i=1 ψi (s ) ϕi (a) φi (s), where ψi (s ) = ψ (s , si ), ϕi (a) = ϕ (a, ai ), and φi (s) = φ (s, si ) are symmetric kernel functions. [sent-61, score-0.453]
31 In practice, the kernel functions ψ and φ will often be the same. [sent-62, score-0.109]
32 2 The reward function r(s, a) must also be represented. [sent-72, score-0.097]
33 Given the kernel density estimate representation, the expected reward for a state-action pair is denoted as [12] n k=1 rk ϕk (a) φk (s) . [sent-73, score-0.232]
34 n i=1 ϕi (a) φi (s) r(s, a) = E[r|s, a] = Having specified the model of the system dynamics and rewards, the next step is to derive the corresponding value function. [sent-74, score-0.105]
35 Every policy has a unique value function, which fulfills the Bellman equation, Eq. [sent-78, score-0.273]
36 Hence, the goal is to solve the Bellman equation for the entire state space, and not just at the sampled states. [sent-80, score-0.11]
37 The Galerkin method involves first projecting the integral equation into the space spanned by a set of basis functions. [sent-82, score-0.155]
38 The integral equation is then solved in this projected space. [sent-83, score-0.093]
39 (2), is rearranged as ´ ´ ´ ´ V (s) = π (a|s) r (s, a) p (s |s, a) ds da + S A p (s |s, a) γV (s ) π (a|s) dads , A S ˆ ˆ p (s) V (s) = p (a, s) r (s, a) da + γ p (s , s) V (s ) ds . [sent-85, score-0.57]
40 Expanding the reward function and joint distributions, as defined in Section 2. [sent-87, score-0.121]
41 Given that p(s) = n n−1 j=1 φj (s), the true value function of the kernel density estimate system has a Nadaraya-Watson kernel regression [12, 13] form −1 n i=1 θi φi (s) . [sent-89, score-0.346]
42 n j=1 φj (s) V (s) = (4) Having computed the true form of the value function, the Galerkin projection method can be used to compute the value weights θ. [sent-90, score-0.172]
43 The projection is performed by taking the expectation of the integral equation with respect to each of the n basis function φi . [sent-91, score-0.192]
44 The resulting n simultaneous equations can be written as the vector equation ˆ ˆ ˆ ˆ T T φ (s) n−1 φ (s) ψ (s ) V (s )ds ds, φ (s) n−1 φ (s) rds+γ φ (s) p(s)V (s)ds = S S S S where the ith elements of the vectors are given by [r]i = ri , [φ (s)]i = φi (s), and [ψ (s )]i = ψi (s ). [sent-92, score-0.102]
45 Expanding the value functions gives ˆ ˆ T S ˆ ˆ T φ (s) φ (s) θds = S T T φ (s) φ (s) rds + γ φ (s) φ (s) ψ (s ) S S φ (s ) θ ds ds, n i=1 φi (s ) Cθ = Cr + γCλθ, ´ T n where C = S φ (s) φ (s) ds, and λ = S ( i=1 φi (s ))−1 ψ (s ) φ (s ) ds is a stochastic matrix; i. [sent-93, score-0.592]
46 1 has a Nadaraya-Watson kernel regression form, as shown in Eq. [sent-101, score-0.097]
47 The non-parametric dynamic programming algorithm is summarized in Alg. [sent-104, score-0.129]
48 Precision refers to how close the predicted value function is to the true value function of the model, while accuracy refers to how close the model is to the true system. [sent-110, score-0.146]
49 One of the key contributions of this paper is providing the true form of the value function for policy evaluation with the non-parametric model described in Section 2. [sent-111, score-0.347]
50 It is important that the model, and thus the value function, converges to that of the true system as the number of samples increases; i. [sent-117, score-0.148]
51 In fact, kernel density estimation can be proven to have almost sure convergence to the true distribution for a wide range of kernels [22]. [sent-120, score-0.187]
52 Similar to other kernel-based policy evaluation methods [23, 24], NPDP has a computational complexity of O(n3 ) when performed naively. [sent-125, score-0.274]
53 1 Least Squares Temporal Difference Learning The LSTD algorithm allows the value function V (s) to be represented by a set of m arbitrary m ˆ ˆ T ˆ ˆ ˆ ˆ basis functions φi (s), see [14]. [sent-132, score-0.142]
54 In order to re-derive ˆ of coefficients learned during policy evaluation, and [φ the LSTD policy evaluation, the joint distribution is represented as a set of delta functions n p (s, a, s ) = n−1 i=1 δi (s, a, s ), where δi (s, a, s ) is a Dirac delta function centered on (si , ai , si ). [sent-134, score-0.788]
55 Using Galerkin’s method, the integral equation is projected into the space of ˆ the basis functions φ (s). [sent-135, score-0.189]
56 This equation is also solved by LSTD, including the incremental updates of A and b as new samples are acquired [14]. [sent-139, score-0.101]
57 Therefore, LSTD can be seen as computing the transitions between the basis functions using a Monte Carlo approach. [sent-140, score-0.117]
58 The computed value function will always be a projection of the true value function into the space of these basis functions [8]. [sent-143, score-0.268]
59 If the true value function does not lie within the space of these basis functions, the resulting approximation may be arbitrarily inaccurate, regardless of the number of acquired samples. [sent-144, score-0.153]
60 However, using predefined basis functions only requires inverting an m × m matrix, which results in a lower computational complexity than NPDP. [sent-145, score-0.115]
61 The LSTD may also need to be regularized, as the inversion of A becomes ill-posed if the basis functions are too densely spaced. [sent-146, score-0.125]
62 2 Kernelized Temporal Difference Learning Methods The proposed approach is of course not the first to use kernels for policy evaluation. [sent-149, score-0.252]
63 Methods such as kernelized least-squares temporal difference learning [24] and Gaussian process temporal difference learning [23] have also employed kernels in policy evaluation. [sent-150, score-0.533]
64 The KTD approach assumes that the reward and value functions can be represented by ˆ kernelized linear least-squares regression; i. [sent-153, score-0.286]
65 , r(s) = k(s)T K −1 r and V (s) = k(s)T θ, ˆ where [k(s)]i = k(s, si ), [K]ij = k(si , sj ), [r]i = ri , and θ is a weight vector. [sent-155, score-0.33]
66 The Galerkin method projects the integral equation ˇ ˇ ˇ into the space of the Kronecker delta functions [δ(s)]i = δi (s, ai , si ), where δi (s, a, s ) = 1 ˇi (s, a, s ) = 0. [sent-157, score-0.377]
67 In this manner, the ˆ KTD approach computes a weighting θ such that the difference in the value at si and the discounted value at si equals the observed empirical reward ri . [sent-161, score-0.593]
68 Thus, only the finite set of sampled states are regarded for policy evaluation. [sent-162, score-0.309]
69 Gaussian process temporal difference learning [23], require that the samples are obtained from a single trajectory to ensure that si = si+1 . [sent-165, score-0.284]
70 In the original paper [15], the KTD algorithm represents the transitions ˆ ˆ by using linear kernelized regression k(s ) = k(s)T K −1 K , where [k(s )]i = E[k(s , si )]. [sent-168, score-0.316]
71 For clarity, we use a box-cart kernel with a width of one k(si , sj ) = 1 iff si − sj ≤ 0. [sent-174, score-0.465]
72 ˇ Given a system with q discrete states, the value function has the form V (s) = δ(s)T v, ˇ where δ(s) is a vector of q Kronecker delta functions centered on the discrete states. [sent-189, score-0.198]
73 The ˇ corresponding reward function is r(s) = δ(s)T r . [sent-190, score-0.097]
74 Galerkin’s method projects the integral equation into the space of the i=1 ˇ states δ(s). [sent-192, score-0.156]
75 (3) becomes ˆ ˆ ˆ ˇ ˇ ˇ ˇ ˇ ˇ δ (s) p (s) δ(s)T vds = δ (s) p (s) δ(s)T r ds + γ δ (s) p (s, s ) δ(s )T vds ds, ¯ S S S ˆ ˇ ˇ Iv = I r + γ δ (s) δ(s)T P δ(s )δ(s )T vds ds, ¯ S v = r + γP v, ¯ v = (I − γP )−1 r , ¯ (6) which is the same computation used by DSDP [16]. [sent-194, score-0.394]
76 While NPDP uses a kernel density estimation, the DSDP algorithm uses a histogram representation. [sent-196, score-0.135]
77 The DSDP algorithm has also been the basis for continuous-state policy evaluation algorithms [26, 27]. [sent-198, score-0.336]
78 These algorithms first use the sampled states as the discrete states of an MDP and compute the corresponding values. [sent-199, score-0.128]
79 Unlike these methods, NPDP explicitly performs policy evaluation for a continuous set of states. [sent-201, score-0.294]
80 6 4 Numerical Evaluation In this section, we compare the different policy evaluation methods discussed in the previous section, with the proposed NPDP method, on an illustrative benchmark system. [sent-202, score-0.291]
81 The reward functions consist of three Gaussians, as shown by the black line in Fig. [sent-210, score-0.131]
82 The KTD method was implemented using a Gaussian kernel function and regularization. [sent-212, score-0.091]
83 The hyper-parameters of all four methods, including the number of basis functions for LSTD and DSDP, were carefully tuned to achieve the best performance. [sent-216, score-0.096]
84 The policy evaluations performed by the tested methods were always based on only 500 samples; i. [sent-218, score-0.227]
85 Two of the performance measures are based on the weighted Mean Squared Error (MSE) ´1 2 [2] E(V ) = 0 W (s) (V (s) − V (s)) ds where V is the true value function and W (s) ≥ 0, ´1 for all states, is a weighting distribution 0 W (s)ds = 1. [sent-225, score-0.314]
86 Thus, Esamp is an indicator of the accuracy in the space of the samples, while Eunif is an indicator of how well the computed value function generalizes to the entire state space. [sent-228, score-0.104]
87 This performance measure is the basis of a bound on the overall value function approximation [20]. [sent-230, score-0.108]
88 0449 Table 1: Each row corresponds to one of the four tested algorithms for policy evaluation. [sent-258, score-0.227]
89 The LSTD, KTD, DSDP, and NPDP methods evaluated the policy using only 500 points. [sent-284, score-0.256]
90 3 Discussion The LSTD algorithm achieved a relatively low Eunif value, which indicates that the tuned basis functions could accurately represent the true value function. [sent-286, score-0.189]
91 However, the performance of LSTD is sensitive to the choice of basis functions and the number of samples per basis function. [sent-287, score-0.192]
92 Using 20 basis functions instead of 15 reduces the performance of LSTD to Eunif = 2. [sent-288, score-0.096]
93 The performance of NPDP could be further improved by using adaptive kernel density estimation [28] to locally adapt the kernels’ bandwidths according to the sampling density. [sent-295, score-0.135]
94 5 Conclusion This paper presents two key contributions to continuous-state policy evaluation. [sent-297, score-0.227]
95 The first contribution is the Non-Parametric Dynamic Programming algorithm for policy evaluation. [sent-298, score-0.242]
96 The proposed method uses a kernel density estimate to generate a consistent representation of the system. [sent-299, score-0.135]
97 It was shown that the true form of the value function for this model is given by a Nadaraya-Watson kernel regression. [sent-300, score-0.148]
98 As a kernel-based approach, NPDP simultaneously addresses the problems of function approximation and policy evaluation. [sent-302, score-0.227]
99 These four approaches were also evaluated and compared on an empirical problem with a continuous state space and non-linear reward function, wherein the NPDP algorithm out-performed the other methods. [sent-305, score-0.188]
100 Three connectionist implementations of dynamic programming for optimal control: A preliminary comparative analysis. [sent-356, score-0.129]
wordName wordTfidf (topN-words)
[('npdp', 0.479), ('ktd', 0.325), ('dsdp', 0.308), ('lstd', 0.301), ('galerkin', 0.274), ('erence', 0.259), ('ds', 0.241), ('policy', 0.227), ('si', 0.164), ('di', 0.15), ('sj', 0.113), ('kernelized', 0.109), ('eunif', 0.103), ('reward', 0.097), ('temporal', 0.086), ('esamp', 0.086), ('kernel', 0.075), ('bellman', 0.07), ('reinforcement', 0.07), ('dynamic', 0.065), ('programming', 0.064), ('basis', 0.062), ('density', 0.06), ('ri', 0.053), ('vds', 0.051), ('equation', 0.049), ('evaluation', 0.047), ('states', 0.046), ('value', 0.046), ('integral', 0.044), ('da', 0.044), ('delta', 0.043), ('transition', 0.043), ('state', 0.042), ('emax', 0.042), ('system', 0.041), ('projection', 0.037), ('samples', 0.034), ('functions', 0.034), ('rds', 0.03), ('uni', 0.03), ('sk', 0.029), ('inversion', 0.029), ('carlo', 0.029), ('evaluated', 0.029), ('monte', 0.028), ('true', 0.027), ('rewards', 0.027), ('ai', 0.026), ('aj', 0.026), ('mse', 0.026), ('kernels', 0.025), ('dimitri', 0.025), ('parr', 0.025), ('tabular', 0.025), ('action', 0.024), ('joint', 0.024), ('mdp', 0.024), ('subsequently', 0.023), ('discounted', 0.023), ('regression', 0.022), ('residual', 0.021), ('transitions', 0.021), ('athena', 0.02), ('erent', 0.02), ('policies', 0.02), ('continuous', 0.02), ('represent', 0.02), ('immediate', 0.02), ('inverting', 0.019), ('sampled', 0.019), ('expanding', 0.019), ('kronecker', 0.018), ('acquired', 0.018), ('derive', 0.018), ('clarity', 0.018), ('numerical', 0.017), ('projects', 0.017), ('icml', 0.017), ('discrete', 0.017), ('regarded', 0.017), ('bottleneck', 0.017), ('benchmark', 0.017), ('inaccurate', 0.016), ('computed', 0.016), ('disadvantage', 0.016), ('actions', 0.016), ('implemented', 0.016), ('discount', 0.016), ('view', 0.016), ('contribution', 0.015), ('precision', 0.015), ('rafael', 0.015), ('bersini', 0.015), ('dirk', 0.015), ('dominik', 0.015), ('excessively', 0.015), ('gavin', 0.015), ('kimeldorf', 0.015), ('neumann', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
2 0.20465659 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
3 0.14979292 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
4 0.14915675 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
5 0.13189618 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
Author: Amir-massoud Farahmand
Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1
6 0.13160369 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
7 0.12664285 283 nips-2011-The Fixed Points of Off-Policy TD
8 0.11732192 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
9 0.094621144 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
10 0.094115235 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
11 0.090289444 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
12 0.081895843 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
13 0.074264728 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
14 0.071679138 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database
15 0.067752898 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
16 0.065647453 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
17 0.057503305 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
18 0.05695485 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
19 0.053822331 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
20 0.052341875 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
topicId topicWeight
[(0, 0.145), (1, -0.137), (2, 0.085), (3, 0.171), (4, -0.218), (5, 0.047), (6, -0.014), (7, 0.027), (8, -0.001), (9, 0.028), (10, -0.012), (11, 0.035), (12, 0.026), (13, 0.026), (14, -0.055), (15, -0.064), (16, 0.022), (17, 0.082), (18, 0.012), (19, 0.028), (20, -0.005), (21, -0.049), (22, 0.075), (23, -0.013), (24, -0.03), (25, -0.023), (26, 0.021), (27, -0.018), (28, 0.015), (29, 0.01), (30, -0.019), (31, 0.023), (32, 0.069), (33, -0.058), (34, -0.016), (35, 0.049), (36, -0.005), (37, 0.064), (38, 0.033), (39, 0.064), (40, -0.04), (41, -0.05), (42, -0.003), (43, 0.011), (44, 0.005), (45, 0.032), (46, -0.13), (47, 0.046), (48, -0.03), (49, -0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.93588132 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
2 0.82845098 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
3 0.80826271 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
4 0.73866159 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama
Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1
5 0.72808462 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern
Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper defines a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both offline and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1
6 0.71946466 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
7 0.70954102 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
8 0.70118546 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
9 0.62211376 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
10 0.61647397 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
11 0.57946736 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
12 0.5496704 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
13 0.53381658 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
14 0.48420024 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
15 0.44718993 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
16 0.4467811 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
17 0.4351007 283 nips-2011-The Fixed Points of Off-Policy TD
18 0.41813204 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
19 0.38656765 256 nips-2011-Solving Decision Problems with Limited Information
20 0.32615644 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
topicId topicWeight
[(0, 0.015), (4, 0.041), (11, 0.028), (20, 0.016), (26, 0.019), (31, 0.122), (33, 0.02), (43, 0.052), (45, 0.094), (57, 0.038), (65, 0.01), (74, 0.041), (83, 0.031), (97, 0.282), (99, 0.079)]
simIndex simValue paperId paperTitle
1 0.80045825 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
Author: David P. Reichert, Peggy Series, Amos J. Storkey
Abstract: It has been argued that perceptual multistability reflects probabilistic inference performed by the brain when sensory input is ambiguous. Alternatively, more traditional explanations of multistability refer to low-level mechanisms such as neuronal adaptation. We employ a Deep Boltzmann Machine (DBM) model of cortical processing to demonstrate that these two different approaches can be combined in the same framework. Based on recent developments in machine learning, we show how neuronal adaptation can be understood as a mechanism that improves probabilistic, sampling-based inference. Using the ambiguous Necker cube image, we analyze the perceptual switching exhibited by the model. We also examine the influence of spatial attention, and explore how binocular rivalry can be modeled with the same approach. Our work joins earlier studies in demonstrating how the principles underlying DBMs relate to cortical processing, and offers novel perspectives on the neural implementation of approximate probabilistic inference in the brain. 1
2 0.76764715 90 nips-2011-Evaluating the inverse decision-making approach to preference learning
Author: Alan Jern, Christopher G. Lucas, Charles Kemp
Abstract: Psychologists have recently begun to develop computational accounts of how people infer others’ preferences from their behavior. The inverse decision-making approach proposes that people infer preferences by inverting a generative model of decision-making. Existing data sets, however, do not provide sufficient resolution to thoroughly evaluate this approach. We introduce a new preference learning task that provides a benchmark for evaluating computational accounts and use it to compare the inverse decision-making approach to a feature-based approach, which relies on a discriminative combination of decision features. Our data support the inverse decision-making approach to preference learning. A basic principle of decision-making is that knowing people’s preferences allows us to predict how they will behave: if you know your friend likes comedies and hates horror films, you can probably guess which of these options she will choose when she goes to the theater. Often, however, we do not know what other people like and we can only infer their preferences from their behavior. If you know that a different friend saw a comedy today, does that mean that he likes comedies in general? The conclusion you draw will likely depend on what else was playing and what movie choices he has made in the past. A goal for social cognition research is to develop a computational account of people’s ability to infer others’ preferences. One computational approach is based on inverse decision-making. This approach begins with a model of how someone’s preferences lead to a decision. Then, this model is inverted to determine the most likely preferences that motivated an observed decision. An alternative approach might simply learn a functional mapping between features of an observed decision and the preferences that motivated it. For instance, in your friend’s decision to see a comedy, perhaps the more movie options he turned down, the more likely it is that he has a true preference for comedies. The difference between the inverse decision-making approach and the feature-based approach maps onto the standard dichotomy between generative and discriminative models. Economists have developed an instance of the inverse decision-making approach known as the multinomial logit model [1] that has been widely used to infer consumer’s preferences from their choices. This model has recently been explored as a psychological model [2, 3, 4], but there are few behavioral data sets for evaluating it as a model of how people learn others’ preferences. Additionally, the data sets that do exist tend to be drawn from the developmental literature, which focuses on simple tasks that collect only one or two judgments from children [5, 6, 7]. The limitations of these data sets make it difficult to evaluate the multinomial logit model with respect to alternative accounts of preference learning like the feature-based approach. In this paper, we use data from a new experimental task that elicits a detailed set of preference judgments from a single participant in order to evaluate the predictions of several preference learning models from both the inverse decision-making and feature-based classes. Our task requires each participant to sort a large number of observed decisions on the basis of how strongly they indicate 1 (a) (b) (c) d c c (d) b b a a x d x d c b a x 1. Number of chosen effects (−/+) 2. Number of forgone effects (+/+) 3. Number of forgone options (+/+) 4. Number of forgone options containing x (−/−) 5. Max/min number of effects in a forgone option (+/−) 6. Is x in every option? (−/−) 7. Chose only option with x? (+/+) 8. Is x the only difference between options? (+/+) 9. Do all options have same number of effects? (+/+) 10. Chose option with max/min number of effects? (−/−) Figure 1: (a)–(c) Examples of the decisions used in the experiments. Each column represents one option and the boxes represent different effects. The chosen option is indicated by the black rectangle. (d) Features used by the weighted feature and ranked feature models. Features 5 and 10 involved maxima in Experiment 1, which focused on all positive effects, and minima in Experiment 2, which focused on all negative effects. The signs in parentheses indicate the direction of the feature that suggests a stronger preference in Experiment 1 / Experiment 2. a preference for a chosen item. Because the number of decisions is large and these decisions vary on multiple dimensions, predicting how people will order them offers a challenging benchmark on which to compare computational models of preference learning. Data sets from these sorts of detailed tasks have proved fruitful in other domains. For example, data reported by Shepard, Hovland, and Jenkins [8]; Osherson, Smith, Wilkie, L´ pez, and Shafir [9]; and Wasserman, Elek, Chatlosh, o and Baker [10] have motivated much subsequent research on category learning, inductive reasoning, and causal reasoning, respectively. We first describe our preference learning task in detail. We then present several inverse decisionmaking and feature-based models of preference learning and compare these models’ predictions to people’s judgments in two experiments. The data are well predicted by models that follow the inverse decision-making approach, suggesting that this computational approach may help explain how people learn others’ preferences. 1 Multi-attribute decisions and revealed preferences We designed a task that can be used to elicit a large number of preference judgments from a single participant. The task involves a set of observed multi-attribute decisions, some examples of which are represented visually in Figure 1. Each decision is among a set of options and each option produces a set of effects. Figure 1 shows several decisions involving a total of five effects distributed among up to five options. The differently colored boxes represent different effects and the chosen option is marked by a black rectangle. For example, 1a shows a choice between an option with four effects and an option with a single effect; here, the decision maker chose the second option. In our task, people are asked to rank a large number of these decisions by how strongly they suggest that the decision maker had a preference for a particular effect (e.g., effect x in Figure 1). By imposing some minimal constraints, the space of unique multi-attribute decisions is finite and we can obtain rankings for every decision in the space. For example, Figure 2c shows a complete list of 47 unique decisions involving up to five effects, subject to several constraints described later. Three of these decisions are shown in Figure 1. If all the effects are positive—pieces of candy, for example—the first decision (1a) suggests a strong preference for candy x, because the decision maker turned down four pieces in favor of one. The second decision (1b), however, offers much weaker evidence because nearly everyone would choose four pieces of candy over one, even without a specific preference for x. The third decision (1c) provides evidence that is strong but perhaps not quite as strong as the first decision. When all effects are negative—like electric shocks at different body locations—decision makers may still find some effects more tolerable than others, but different inferences are sometimes supported. For example, for negative effects, 1a provides weak evidence that x is relatively tolerable because nearly everyone would choose one shock over four. 2 A computational account of preference learning We now describe a simple computational model for learning a person’s preferences after observing that person make a decision like the ones in Figure 1. We assume that there are n available options 2 {o1 , . . . , on }, each of which produces one or more effects from the set {f1 , f2 , ..., fm }. For simplicity, we assume that effects are binary. Let ui denote the utility the decision maker assigns to effect fi . We begin by specifying a model of decision-making that makes the standard assumptions that decision makers tend to choose things with greater utility and that utilities are additive. That is, if fj is a binary vector indicating the effects produced by option oj and u is a vector of utilities assigned to each of the m effects, then the total utility associated with option oj can be expressed as Uj = fj T u. We complete the specification of the model by applying the Luce choice rule [11], a common psychological model of choice behavior, as the function that chooses among the options: p(c = oj |u, f ) = exp(Uj ) = exp(Uk ) n k=1 exp(fj T u) n T k=1 exp(fk u) (1) where c denotes the choice made. This model can predict the choice someone will make among a specified set of options, given the utilities that person assigns to the effects in each option. To obtain estimates of someone’s utilities, we invert this model by applying Bayes’ rule: p(u|c, F) = p(c|u, F)p(u) p(c|F) (2) where F = {f1 , . . . , fn } specifies the available options and their corresponding effects. This is the multinomial logit model [1], a standard econometric model. In order to apply Equation 2 we must specify a prior p(u) on the utilities. We adopt a standard approach that places independent Gaussian priors on the utilities: ui ∼ N (µ, σ 2 ). For decisions where effects are positive—like candies—we set µ = 2σ, which corresponds to a prior distribution that places approximately 2% of the probability mass below zero. Similarly, for negative effects—like electric shocks—we set µ = −2σ. 2.1 Ordering a set of observed decisions Equation 2 specifies a posterior probability distribution over utilities for a single observed decision but does not provide a way to compare the inferences drawn from multiple decisions for the purposes of ordering them. Suppose we are interested in a decision maker’s preference for effect x and we wish to order a set of decisions by how strongly they support this preference. Two criteria for ordering the decisions are as follows: Absolute utility Relative utility p(c|ux , F)p(ux ) p(c|F) p(c|∀j ux ≥ uj , F)p(∀j ux ≥ uj ) p(∀j ux ≥ uj |c, F) = p(c|F) E(ux |c, F) = Eux The absolute utility model orders decisions by the mean posterior utility for effect x. This criterion is perhaps the most natural way to assess how much a decision indicates a preference for x, but it requires an inference about the utility of x in isolation, and research suggests that people often think about the utility of an effect only in relation to other salient possibilities [12]. The relative utility model applies this idea to preference learning by ordering decisions based on how strongly they suggest that x has a greater utility than all other effects. The decisions in Figures 1b and 1c are cases where the two models lead to different predictions. If the effects are all negative (e.g., electric shocks), the absolute utility model predicts that 1b provides stronger evidence for a tolerance for x because the decision maker chose to receive four shocks instead of just one. The relative utility model predicts that 1c provides stronger evidence because 1b offers no way to determine the relative tolerance of the four chosen effects with respect to one another. Like all generative models, the absolute and relative models incorporate three qualitatively different components: the likelihood term p(c|u, F), the prior p(u), and the reciprocal of the marginal likelihood 1/p(c|F). We assume that the total number of effects is fixed in advance and, as a result, the prior term will be the same for all decisions that we consider. The two other components, however, will vary across decisions. The inverse decision-making approach predicts that both components should influence preference judgments, and we will test this prediction by comparing our 3 two inverse decision-making models to two alternatives that rely only one of these components as an ordering criterion: p(c|∀j ux ≥ uj , F) 1/p(c|F) Representativeness Surprise The representativeness model captures how likely the observed decision would be if the utility for x were high, and previous research has shown that people sometimes rely on a representativeness computation of this kind [13]. The surprise model captures how unexpected the observed decision is overall; surprising decisions may be best explained in terms of a strong preference for x, but unsurprising decisions provide little information about x in particular. 2.2 Feature-based models We also consider a class of feature-based models that use surface features to order decisions. The ten features that we consider are shown in Figure 1d, where x is the effect of interest. As an example, the first feature specifies the number of effects chosen; because x is always among the chosen effects, decisions where few or no other effects belong to the chosen option suggest the strongest preference for x (when all effects are positive). This and the second feature were previously identified by Newtson [14]; we included the eight additional features shown in Figure 1d in an attempt to include all possible features that seemed both simple and relevant. We consider two methods for combining this set of features to order a set of decisions by how strongly they suggest a preference for x. The first model is a standard linear regression model, which we refer to as the weighted feature model. The model learns a weight for each feature, and the rank of a given decision is determined by a weighted sum of its features. The second model is a ranked feature model that sorts the observed decisions with respect to a strict ranking of the features. The top-ranked feature corresponds to the primary sort key, the second-ranked feature to the secondary sort key, and so on. For example, suppose that the top-ranked feature is the number of chosen effects and the second-ranked feature is the number of forgone options. Sorting the three decisions in Figure 1 according to this criterion produces the following ordering: 1a,1c,1b. This notion of sorting items on the basis of ranked features has been applied before to decision-making [15, 16] and other domains of psychology [17], but we are not aware of any previous applications to preference learning. Although our inverse decision-making and feature-based models represent two very different approaches, both may turn out to be valuable. An inverse decision-making approach may be the appropriate account of preference learning at Marr’s [18] computational level, and a feature-based approach may capture the psychological processes by which the computational-level account is implemented. Our goal, therefore, is not necessarily to accept one of these approaches and dismiss the other. Instead, we entertain three distinct possibilities. First, both approaches may account well for the data, which would support the idea that they are valid accounts operating at different levels of analysis. Second, the inverse decision-making approach may offer a better account, suggesting that process-level accounts other than the feature-based approach should be explored. Finally, the feature-based approach may offer a better account, suggesting that inverse decision-making does not constitute an appropriate computational-level account of preference learning. 3 Experiment 1: Positive effects Our first experiment focuses on decisions involving only positive effects. The full set of 47 decisions we used is shown in Figure 2c. This set includes every possible unique decision with up to five different effects, subject to the following constraints: (1) one of the effects (effect x) must always appear in the chosen option, (2) there are no repeated options, (3) each effect may appear in an option at most once, (4) only effects in the chosen option may be repeated in other options, and (5) when effects appear in multiple options, the number of effects is held constant across options. The first constraint is necessary for the sorting task, the second two constraints create a finite space of decisions, and the final two constraints limit attention to what we deemed the most interesting cases. Method 43 Carnegie Mellon undergraduates participated for course credit. Each participant was given a set of cards, with one decision printed on each card. The decisions were represented visually 4 (a) (c) Decisions 42 40 45 Mean human rankings 38 30 23 20 22 17 13 12 11 10 9 8 7 6 19 18 31 34 28 21 26 36 35 33 37 27 29 32 25 24 16 15 14 5 4 3 2 1 Absolute utility model rankings (b) Mean human rankings (Experiment 1) 47 43 44 46 45 38 37 36 34 35 30 32 33 31 29 28 24 26 27 25 21 19 22 20 18 16 17 12 13 7 6 11 5 9 4 10 8 1 2 3 42 40 41 39 47 46 44 41 43 39 23 15 14 Mean human rankings (Experiment 2) 1. dcbax 2. cbax 3. bax 4. ax 5. x 6. dcax | bcax 7. dx | cx | bx | ax 8. cax | bax 9. bdx | bcx | bax 10. dcx | bax 11. bx | ax 12. bdx | cax | bax 13. cx | bx | ax 14. d | cbax 15. c | bax 16. b | ax 17. d | c | bax 18. dc | bax 19. c | b | ax 20. dc | bx | ax 21. bdc | bax 22. ad | cx | bx | ax 23. d | c | b | ax 24. bad | bcx | bax 25. ac | bx | ax 26. cb | ax 27. cbad | cbax 28. dc | b | ax 29. ad | ac | bx | ax 30. ab | ax 31. bad | bax 32. dc | ab | ax 33. dcb | ax 34. a | x 35. bad | bac | bax 36. ac | ab | ax 37. ad | ac | ab | ax 38. b | a | x 39. ba | x 40. c | b | a | x 41. cb | a | x 42. d | c | b | a | x 43. cba | x 44. dc | ba | x 45. dc | b | a | x 46. dcb | a | x 47. dcba | x Figure 2: (a) Comparison between the absolute utility model rankings and the mean human rankings for Experiment 1. Each point represents one decision, numbered with respect to the list in panel c. (b) Comparison between the mean human rankings in Experiments 1 and 2. In both scatter plots, the solid diagonal lines indicate a perfect correspondence between the two sets of rankings. (c) The complete set of decisions, ordered by the mean human rankings from Experiment 1. Options are separated by vertical bars and the chosen option is always at the far right. Participants were always asked about a preference for effect x. as in Figure 1 but without the letter labels. Participants were told that the effects were different types of candy and each option was a bag containing one or more pieces of candy. They were asked to sort the cards by how strongly each decision suggested that the decision maker liked a particular target candy, labeled x in Figure 2c. They sorted the cards freely on a table but reported their final rankings by writing them on a sheet of paper, from weakest to strongest evidence. They were instructed to order the cards as completely as possible, but were told that they could assign the same ranking to a set of cards if they believed those cards provided equal evidence. 3.1 Results Two participants were excluded as outliers based on the criterion that their rankings for at least five decisions were at least three standard deviations from the mean rankings. We performed a hierarchical clustering analysis of the remaining 41 participants’ rankings using rank correlation as a similarity metric. Participants’ rankings were highly correlated: cutting the resulting dendrogram at 0.2 resulted in one cluster that included 33 participants and the second largest cluster included 5 Surprise MAE = 17.8 MAE = 7.0 MAE = 4.3 MAE = 17.3 MAE = 9.5 Human rankings Experiment 2 Negative effects Representativeness MAE = 2.3 MAE = 6.7 Experiment 1 Positive effects Relative utility MAE = 2.3 Human rankings Absolute utility Model rankings Model rankings Model rankings Model rankings Figure 3: Comparison between human rankings in both experiments and predicted rankings from four models. The solid diagonal lines indicate a perfect correspondence between human and model rankings. only 3 participants. Thus, we grouped all participants together and analyzed their mean rankings. The 0.2 threshold was chosen because it produced the most informative clustering in Experiment 2. Inverse decision-making models We implemented the inverse decision-making models using importance sampling with 5 million samples drawn from the prior distribution p(u). Because all the effects were positive, we used a prior on utilities that placed nearly all probability mass above zero (µ = 4, σ = 2). The mean human rankings are compared with the absolute utility model rankings in Figure 2a, and the mean human rankings are listed in order in 2c. Fractional rankings were used for both the human data and the model predictions. The human rankings in the figure are the means of participants’ fractional rankings. The first row of Figure 3 contains similar plots that allow comparison of the four models we considered. In these plots, the solid diagonal lines indicate a perfect correspondence between model and human rankings. Thus, the largest deviations from this line represent the largest deviations in the data from the model’s predictions. Figure 3 shows that the absolute and relative utility models make virtually identical predictions and both models provide a strong account of the human rankings as measured by mean absolute error (MAE = 2.3 in both cases). Moreover, both models correctly predict the highest ranked decision and the set of lowest ranked decisions. The only clear discrepancy between the model predictions and the data is the cluster of points at the lower left, labeled as Decisions 6–13 in Figure 2a. These are all cases in which effect x appears in all options and therefore these decisions provide no information about a decision maker’s preference for x. Consequently, the models assign the same ranking to this group as to the group of decisions in which there is only a single option (Decisions 1–5). Although people appeared to treat these groups somewhat differently, the models still correctly predict that the entire group of decisions 1–13 is ranked lower than all other decisions. The surprise and representativeness models do not perform nearly as well (MAE = 7.0 and 17.8, respectively). Although the surprise model captures some of the general trends in the human rankings, it makes several major errors. For example, consider Decision 7: dx|cx|bx|ax. This decision provides no information about a preference for x because it appears in every option. The decision is surprising, however, because a decision maker choosing at random from these options would make the observed choice only 1/4 of the time. The representativeness model performs even worse, primarily because it does not take into account alternative explanations for why an option was chosen, such as the fact that no other options were available (e.g., Decision 1 in Figure 2c). The failure of these models to adequately account for the data suggests that both the likelihood p(c|u, F) and marginal likelihood p(c|F) are important components of the absolute and relative utility models. Feature-based models We compared the performance of the absolute and relative utility models to our two feature-based models: the weighted feature and ranked feature models. For each participant, 6 (b) Ranked feature 10 10 5 Figure 4: Results of the feature-based model analysis from Experiment 1 for (a) the weighted feature models and (b) the ranked feature models. The histograms show the minimum number of features needed to match the accuracy (measured by MAE) of the absolute utility model for each participant. 15 5 1 2 3 4 5 6 >6 15 1 2 3 4 5 6 7 8 9 10 >10 Number of participants (a) Weighted feature Number of features needed we considered every subset of features1 in Figure 1d in order to determine the minimum number of features needed by the two models to achieve the same level of accuracy as the absolute utility model, as measured by mean absolute error. The results of these analyses are shown in Figure 4. For the majority of participants, at least four features were needed by both models to match the accuracy of the absolute utility model. For the weighted feature model, 14 participants could not be fit as well as the absolute utility model even when all ten features were considered. These results indicate that a feature-based account of people’s inferences in our task must be supplied with a relatively large number of features. By contrast, the inverse decision-making approach provides a relatively parsimonious account of the data. 4 Experiment 2: Negative effects Experiment 2 focused on a setting in which all effects are negative, motivated by the fact that the inverse decision-making models predict several major differences in orderings when effects are negative rather than positive. For instance, the absolute utility model’s relative rankings of the decisions in Figures 1a and 1b are reversed when all effects are negative rather than positive. Method 42 Carnegie Mellon undergraduates participated for course credit. The experimental design was identical to Experiment 1 except that participants were told that the effects were electric shocks at different body locations. They were asked to sort the cards on the basis of how strongly each decision suggested that the decision maker finds shocks at the target location relatively tolerable. The model predictions were derived in the same way as for Experiment 1, but with a prior distribution on utilities that placed nearly all probability mass below zero (µ = −4, σ = 2) to reflect the fact that effects were all negative. 4.1 Results Three participants were excluded as outliers by the same criterion applied in Experiment 1. The resulting mean rankings are compared with the corresponding rankings from Experiment 1 in Figure 2b. The figure shows that responses based on positive and negative effects were substantially different in a number of cases. Figure 3 shows how the mean rankings compare to the predictions of the four models we considered. Although the relative utility model is fairly accurate, no model achieves the same level of accuracy as the absolute and relative utility models in Experiment 1. In addition, the relative utility model provides a poor account of the responses of many individual participants. To better understand responses at the individual level, we repeated the hierarchical clustering analysis described in Experiment 1, which revealed that 29 participants could be grouped into one of four clusters, with the remaining participants each in their own clusters. We analyzed these four clusters independently, excluding the 10 participants that could not be naturally grouped. We compared the mean rankings of each cluster to the absolute and relative utility models, as well as all one- and two-feature weighted feature and ranked feature models. Figure 5 shows that the mean rankings of participants in Cluster 1 (N = 8) were best fit by the absolute utility model, the mean rankings of participants in Cluster 2 (N = 12) were best fit by the relative utility model, and the mean rankings of participants in Clusters 3 (N = 3) and 4 (N = 6) were better fit by feature-based models than by either the absolute or relative utility models. 1 A maximum of six features was considered for the ranked feature model because considering more features was computationally intractable. 7 Cluster 4 N =6 MAE = 4.9 MAE = 14.0 MAE = 7.9 MAE = 5.3 MAE = 2.6 MAE = 13.0 MAE = 6.2 Human rankings Relative utility Cluster 3 N =3 MAE = 2.6 Absolute utility Cluster 2 N = 12 Human rankings Cluster 1 N =8 Factors: 1,3 Factors: 1,8 MAE = 2.3 MAE = 5.2 Model rankings Best−fitting weighted feature Factors: 6,7 MAE = 4.0 Model rankings Model rankings Model rankings Human rankings Factors: 3,8 MAE = 4.8 Figure 5: Comparison between human rankings for four clusters of participants identified in Experiment 2 and predicted rankings from three models. Each point in the plots corresponds to one decision and the solid diagonal lines indicate a perfect correspondence between human and model rankings. The third row shows the predictions of the best-fitting two-factor weighted feature model for each cluster. The two factors listed refer to Figure 1d. To examine how well the models accounted for individuals’ rankings within each cluster, we compared the predictions of the inverse decision-making models to the best-fitting two-factor featurebased model for each participant. In Cluster 1, 7 out of 8 participants were best fit by the absolute utility model; in Cluster 2, 8 out of 12 participants were best fit by the relative utility model; in Clusters 3 and 4, all participants were better fit by feature-based models. No single feature-based model provided the best fit for more than two participants, suggesting that participants not fit well by the inverse decision-making models were not using a single alternative strategy. Applying the feature-based model analysis from Experiment 1 to the current results revealed that the weighted feature model required an average of 6.0 features to match the performance of the absolute utility model for participants in Cluster 1, and an average of 3.9 features to match the performance of the relative utility model for participants in Cluster 2. Thus, although a single model did not fit all participants well in the current experiment, many participants were fit well by one of the two inverse decision-making models, suggesting that this general approach is useful for explaining how people reason about negative effects as well as positive effects. 5 Conclusion In two experiments, we found that an inverse decision-making approach offered a good computational account of how people make judgments about others’ preferences. Although this approach is conceptually simple, our analyses indicated that it captures the influence of a fairly large number of relevant decision features. Indeed, the feature-based models that we considered as potential process models of preference learning could only match the performance of the inverse decision-making approach when supplied with a relatively large number of features. We feel that this result rules out the feature-based approach as psychologically implausible, meaning that alternative process-level accounts will need to be explored. One possibility is sampling, which has been proposed as a psychological mechanism for approximating probabilistic inferences [19, 20]. However, even if process models that use large numbers of features are considered plausible, the inverse decision-making approach provides a valuable computational-level account that helps to explain which decision features are informative. Acknowledgments This work was supported in part by the Pittsburgh Life Sciences Greenhouse Opportunity Fund and by NSF grant CDI-0835797. 8 References [1] D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor, Frontiers in Econometrics. Amademic Press, New York, 1973. [2] C. G. Lucas, T. L. Griffiths, F. Xu, and C. Fawcett. A rational model of preference learning and choice prediction by children. In Proceedings of Neural Information Processing Systems 21, 2009. [3] L. Bergen, O. R. Evans, and J. B. Tenenbaum. Learning structured preferences. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society, 2010. [4] A. Jern and C. Kemp. Decision factors that support preference learning. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011. [5] T. Kushnir, F. Xu, and H. M. Wellman. Young children use statistical sampling to infer the preferences of other people. Psychological Science, 21(8):1134–1140, 2010. [6] L. Ma and F. Xu. Young children’s use of statistical sampling evidence to infer the subjectivity of preferences. Cognition, in press. [7] M. J. Doherty. Theory of Mind: How Children Understand Others’ Thoughts and Feelings. Psychology Press, New York, 2009. [8] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classifications. Psychological Monographs, 75, Whole No. 517, 1961. [9] D. N. Osherson, E. E. Smith, O. Wilkie, A. L´ pez, and E. Shafir. Category-based induction. Psychological o Review, 97(2):185–200, 1990. [10] E. A. Wasserman, S. M. Elek, D. L. Chatlosh, and A. G. Baker. Rating causal relations: Role of probability in judgments of response-outcome contingency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19(1):174–188, 1993. [11] R. D. Luce. Individual choice behavior. John Wiley, 1959. [12] D. Ariely, G. Loewenstein, and D. Prelec. Tom Sawyer and the construction of value. Journal of Economic Behavior & Organization, 60:1–10, 2006. [13] D. Kahneman and A. Tversky. Subjective probability: A judgment of representativeness. Cognitive Psychology, 3(3):430–454, 1972. [14] D. Newtson. Dispositional inference from effects of actions: Effects chosen and effects forgone. Journal of Experimental Social Psychology, 10:489–496, 1974. [15] P. C. Fishburn. Lexicographic orders, utilities and decision rules: A survey. Management Science, 20(11):1442–1471, 1974. [16] G. Gigerenzer and P. M. Todd. Fast and frugal heuristics: The adaptive toolbox. Oxford University Press, New York, 1999. [17] A. Prince and P. Smolensky. Optimality Theory: Constraint Interaction in Generative Grammar. WileyBlackwell, 2004. [18] D. Marr. Vision. W. H. Freeman, San Francisco, 1982. [19] A. N. Sanborn, T. L. Griffiths, and D. J. Navarro. Rational approximations to rational models: Alternative algorithms for category learning. Psychological Review, 117:1144–1167, 2010. [20] L. Shi and T. L. Griffiths. Neural implementation of Bayesian inference by importance sampling. In Proceedings of Neural Information Processing Systems 22, 2009. 9
same-paper 3 0.68684947 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
4 0.5624823 66 nips-2011-Crowdclustering
Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona
Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1
5 0.5540269 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
Author: George Konidaris, Scott Niekum, Philip S. Thomas
Abstract: We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an nstep return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which defines the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ (C), with a capacity parameter C. TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. 1
6 0.55357593 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
7 0.54783142 180 nips-2011-Multiple Instance Filtering
8 0.54670602 241 nips-2011-Scalable Training of Mixture Models via Coresets
9 0.54488671 229 nips-2011-Query-Aware MCMC
10 0.54422605 75 nips-2011-Dynamical segmentation of single trials from population neural data
11 0.54373169 102 nips-2011-Generalised Coupled Tensor Factorisation
12 0.54183817 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
13 0.54123533 249 nips-2011-Sequence learning with hidden units in spiking neural networks
14 0.53832293 301 nips-2011-Variational Gaussian Process Dynamical Systems
15 0.53788561 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
16 0.53760999 258 nips-2011-Sparse Bayesian Multi-Task Learning
17 0.53736383 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
18 0.53570461 219 nips-2011-Predicting response time and error rates in visual search
19 0.53464156 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
20 0.53422189 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models