nips nips2012 nips2012-364 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tsuyoshi Ueno, Kohei Hayashi, Takashi Washio, Yoshinobu Kawahara
Abstract: Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efficiently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). [sent-15, score-0.393]
2 In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efficiently learned through the weighted likelihood estimation. [sent-17, score-1.108]
3 Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood. [sent-19, score-0.168]
4 1 Introduction In the last decade, several direct policy search (DPS) methods have been developed in the field of reinforcement learning (RL) [1, 2, 3, 4, 5, 6, 7, 8, 9] and have been successfully applied to practical decision making applications [5, 7, 9]. [sent-20, score-0.437]
5 Unlike classical approaches [10], DPS characterizes a policy as a parametric model and explores parameters such that the expected reward is maximized in a given model space. [sent-21, score-0.567]
6 Hence, if one employs a model with a reasonable number of DoF (degrees of freedom), DPS could find a reasonable policy efficiently even when the target decision making task has a huge number of DoF. [sent-22, score-0.356]
7 Therefore, the development of an efficient model selection methodology for the policy is crucial in RL research. [sent-23, score-0.431]
8 In this paper, we propose weighted likelihood policy search (WLPS): an efficient iterative policy search algorithm that allows us to select an appropriate model automatically from a set of candidate models. [sent-24, score-0.982]
9 To this end, we first introduce a log-likelihood function weighted by the discounted sum of future rewards as the cost function for DPS. [sent-25, score-0.174]
10 In WLPS, the policy parameters are updated by iteratively maximizing the weighted log-likelihood for the obtained sample sequence. [sent-26, score-0.499]
11 A key property of WLPS is that the maximization of weighted log-likelihood corresponds to that of the lower bound of the expected reward and thus, WLPS is guaranteed to increase the expected reward monotonically at each iteration. [sent-27, score-0.51]
12 This can be shown to converge to the same solution as the expectation-maximization policy search (EMPS) [1, 4, 9]. [sent-28, score-0.362]
13 One benefit of this approach is that we can directly apply the information criterion scheme [11, 12], which is a familiar theory in statistics, to the weighted log-likelihood. [sent-30, score-0.211]
14 This enables us to construct a model selection strategy for the policy by comparing the information criterion based on the weighted log-likelihood. [sent-31, score-0.659]
15 We prove that each update to the policy parameters resulting from the maximization of the weighted log-likelihood has consistency and asymptotic normality, which have been not elucidated yet in DPS, and converges to the same solution as EMPS. [sent-35, score-0.631]
16 We introduce prior distribution on the policy parameter, and analyze the asymptotic behavior of the marginal weighted likelihood given by marginalizing out the policy parameter. [sent-37, score-0.942]
17 We then propose a measure of the goodness of the policy model based on the posterior probability of the model in a similar way as Bayesian information criterion [12]. [sent-38, score-0.485]
18 In addition, we construct the model selection strategy for the policy (Section 4). [sent-42, score-0.448]
19 Although these studies allow us to select a good model for a state-value function with theoretical supports, they cannot be applied to model selection for DPS directly. [sent-47, score-0.137]
20 [15] developed a model selection strategy for DPS by reusing the past observed sample sequences through the importance weighted cross-validation (IWCV). [sent-48, score-0.305]
21 Let πθ (u|x) := p(u|x, θ) be the stochastic parametrized policy followed by the agent, where an m-dimensional vector θ ∈ Θ, Θ ⊆ Rm means an adjustable parameter. [sent-53, score-0.334]
22 The general goal of DPS is to find an optimal policy parameter θ∗ ∈ Θ that maximizes the expected reward defined by ∫∫ η(θ) := lim pθ (x2:n , u1:n |x1 ) {Rn } dx2:n du1:n , (2) n→∞ ∑n where Rn := Rn (x1:n , u1:n ) = (1/n) i=1 r(xi , ui ). [sent-64, score-0.673]
23 [1, 4, 9] The following inequality holds for any distribution q(x2:n , u1:n |x1 ): { } ∫∫ pθ (x2:n , u1:n |x1 )Rn ln ηn (θ) ≥ Fn (q, θ) := q(x2:n , u1:n |x1 ) ln dx2:n du1:n , ∀n q(x2:n , u1:n |x1 ) 2 (3) ∫∫ where ηn (θ) = p(x2:n , u1:n |x1 ) {Rn } dx2:n du1:n . [sent-71, score-0.13]
24 This EMPS cycle is guaranteed to increase the value of the expected reward unless the parameters already correspond to a local maximum [1, 4, 9]. [sent-77, score-0.177]
25 Although many variants of model-free EMPS algorithms [4, 6, 9, 15] have hitherto been developed, their fundamental theoretical properties such as consistency and asymptotic normality at each iteration have not yet been elucidated. [sent-86, score-0.142]
26 3 Proposed framework: WLPS In this section, we newly introduce a weighted likelihood as the objective function for DPS (Definition 1), and derive the WLPS algorithm, executed by iterating two steps: evaluation and maximization of the weighted log-likelihood function (Algorithm 1). [sent-88, score-0.39]
27 2, we show that WLPS is guaranteed to increase the expected reward at each iteration, and to converge to the same solution as EMPS (Theorem 1). [sent-90, score-0.162]
28 1 Overview of WLPS In this study, we consider the following weighted likelihood function. [sent-92, score-0.206]
29 ln p(xi |xi−1 , ui−1 ), ∑n j=i (6) β j−i r(xj , uj ), and β Now, let us consider the maximization of weighted log-likelihood function (6). [sent-96, score-0.328]
30 ˆ (7) Note that when policy πθ is modeled by an exponential family, estimating equation (7) can easily be solved analytically or numerically using convex optimization techniques. [sent-98, score-0.442]
31 In WLPS, the update ˆ of the policy parameter is performed by evaluating estimating equation (7) and finding estimator θn iteratively from this equation. [sent-99, score-0.528]
32 Generate a sample sequence {x1:n , u1:n } by employing the current policy parameter θ, and evaluate estimating equation (7). [sent-103, score-0.508]
33 Find a new estimator by solving estimating equation (7) and check for convergence. [sent-105, score-0.175]
34 It should be noted that WLPS guarantees to monotonically increase the expected reward η(θ), and to converge asymptotically under certain conditions to the same solution as EMPS, given by Eq. [sent-107, score-0.181]
35 2 Convergence of WLPS ˆ To begin with, we show consistency and asymptotic normality of estimator θn given by Eq. [sent-111, score-0.208]
36 For any θ ∈ Θ, there exists a parameter value θ ∈ Θ such that ∞ ∑ θ Eπ1 ∼µθ ψθ (x1 , u1 ) β j−1 r(xj , uj ) = 0, ¯ x (8) j=1 θ where Eπ1 ∼µθ [·] denotes the expectation over {x2:∞ , u1:∞ } with respect to distribution x ∏n lim µθ (x1 )πθ (u1 |x1 ) i=2 p(xi |xi , ui )πθ (ui |xi ). [sent-128, score-0.273]
37 For any θ ∈ Θ, matrix A := A(θ) = Eπ1 ∼µ1 Kθ (x1 , u1 ) ¯ x ] β j−1 r(xj , uj ) is j=1 ∑∞ invertible, where Kθ (x, u) := ∂θ ψθ (x, u) = ∂ 2 /(∂θ∂θ⊤ ) ln πθ (u|x). [sent-132, score-0.144]
38 ˆ ¯ Under the conditions given in Assumptions 1-7, estimator θn converges to θ in probability, as shown in the following lemma. [sent-133, score-0.125]
39 If Assumptions 1-7 are satisfied, then estimator θn given by ˆn converges to parameter θ in probability. [sent-136, score-0.144]
40 , estimator θ 4 The proof is given in Section 2 in the supporting material. [sent-139, score-0.125]
41 Note that if the policy is characterized as an exponential family, we can replace Assumption 7 with Assumption 8 to prove the result in Lemma 3. [sent-140, score-0.334]
42 Next, we show the asymptotic convergence rate of the estimator given a consistent estimator. [sent-141, score-0.13]
43 If estimator θn , given by esti¯ in probability, then we have mating equation (7) converges to θ n n ∑∑ √ 1 ˆ ¯ n(θn − θ) = − √ A−1 β j−i ψθ (xi , ui )r(xj , uj ) + op (1). [sent-145, score-0.433]
44 π ′ θ Ex1 ∼µθ′ [(∑ ∞ j=1 β j−1 r(xj , uj ) ) (∑ ∞ j ′ =1+i βj ′ −1 a Gaussian distriand A−1 Σ(A−1 )⊤ , ¯ Here, Γi (θ) := ) ] r(xj ′ , uj ′ ) ψθ (x1 , u1 )ψθ (xi , ui )⊤ . [sent-148, score-0.312]
45 Then, the partial derivative of the lower ∗ bound with qθ′ satisfies [ ] ∞ ∑ j−1 ∂ πθ ′ ∗ β r(xj , uj ) , lim Fn (qθ′ , θ) = lim Ex1 ∼µθ′ ψθ (x1 , u1 ) n→∞ ∂θ β→1− j=1 where β → 1− denotes that β converges to 1 from below. [sent-161, score-0.211]
46 In the next section, we discuss model selection for policy πθ (u|x). [sent-170, score-0.431]
47 4 Model selection with WLPS Common model selection strategies are carried out by comparing candidate models, which are specified in advance, based on a criterion that evaluates the goodness of fit of the model estimated from the obtained samples. [sent-171, score-0.307]
48 Since the motivation for RL is to maximize the expected reward given in (2), it would be natural to seek an appropriate model for the policy through the computation of some reasonable measure to evaluate the expected reward from the sample sequences. [sent-172, score-0.665]
49 However, since different policy models give different generative models for sample sequences, we need to obtain new sample sequences to evaluate the measure each time the model is changed. [sent-173, score-0.42]
50 Therefore, employing a strategy of model selection based directly on the expected reward would be hopelessly inefficient. [sent-174, score-0.281]
51 If we can analyze the finite sample behavior of the expected reward with the WLPS estimator, we may obtain a better estimator by finding an optimal β in the sense of the maximization of the expected reward. [sent-176, score-0.314]
52 5 Instead, to develop a tractable model selection, we focus on the weighted likelihood given by Eq. [sent-179, score-0.228]
53 As mentioned before, the policy with the maximum weighted log-likelihood must satisfy the maximum of the lower bound of the expected reward asymptotically. [sent-181, score-0.655]
54 Moreover, since the weighted likelihood is defined under a certain fixed generative process for the sample sequences, unlike the expected reward case, the weighted likelihood can be evaluated using unique sample sequences even when the model has been changed. [sent-182, score-0.643]
55 In this study, we develop a criterion for choosing a suitable model by following the analogy of the Bayesian information criterion (BIC) [12], designed through asymptotic analysis of the posterior probability of the models given the data. [sent-184, score-0.222]
56 Let M1 , M2 , · · · , Mk be k candidate policy models, and assume that each model Mj is characterized by a parametric policy πθj (u|x) and the prior distribution p(θj |Mj ) of the policy parameter. [sent-185, score-1.098]
57 Also, define the marginal weighted likelihood of the j-th candidate model pθ′ ,j (x2:n , u1:n |x1 ) as ˆ ∫ n ∏ β Qβ 1 ′ ,j (x2:n , u1:n |x1 ) := pθ ˆ πθj (u1 |x1 ) πθj (ui |xi )Qi p(xi |xi−1 , ui−1 )p(θj |Mj )dθj . [sent-186, score-0.28]
58 (11) p(Mj |x1:n , u1:n ) := ∑k ′ ˆ′ ′ j ′ =1 pθ ,j (x2:n , u1:n |x1 )p(Mj ) and in our model selection strategy, we adopt the model with the largest posterior probability. [sent-189, score-0.143]
59 Assuming that the prior probability is uniform in all models, the model with the maximum posterior probability corresponds to that of the marginal weighted likelihood. [sent-191, score-0.25]
60 The behavior of the marginal weighted likelihood can be evaluated when the integrand of marginal weighted likelihood (10) is concentrated in a neighborhood of the weighted log-likelihood estimator given by estimating equation (7), as described in the following theorem. [sent-192, score-0.777]
61 are satisfied, the log marginal weighted likelihood can be calculated as ′ 1 ˆ ln pθ′ (x2:n , u1:n |x1 ) = Lθ (θn ) − m ln n + Op (1), ˆ n 2 where recall that m denotes the number of dimensional of the model (policy parameter). [sent-201, score-0.38]
62 Therefore, when evaluatn i=2 ing the posterior probability of the model, it is sufficient to compute the following model selection criterion: n ∑ β 1 (12) IC = ln πθn (ui |xi )Qi − m ln n. [sent-204, score-0.251]
63 ˆ 2 i=1 As can be seen, this model selection criterion consists of two terms, where the first term is the weighted log-likelihood of the policy and the second is a penalty term that penalizes highly complex models. [sent-205, score-0.642]
64 Also, since the first term is larger than the second term, this criterion gives the model with the maximum weighted log-likelihood asymptotically. [sent-206, score-0.248]
65 Generate a sample sequence {x1:n , u1:n } by employing the current policy parameter θ. [sent-210, score-0.417]
66 For all models, find estimator θn by solving estimating equation (7) and evaluate model selection criterion (12). [sent-212, score-0.337]
67 Choose the best model based on model selection criterion (12) and check for convergence. [sent-214, score-0.184]
68 In this problem, we characterized the state transition distribution p(xi |xi−1 , ui−1 ) as a Gaussian distribution N (xi |¯i , σ) with mean xi = xi−1 + ui−1 and variance σ = 0. [sent-218, score-0.148]
69 The reward function x ¯ was set to a quadratic function r(xi , ui ) = −x2 − u2 + c, where c is a positive scalar value for i i preventing the reward r(x, u) being negative. [sent-220, score-0.388]
70 The control signal ui was generated from a Gaussian distribution N (ui |¯i , σ ′ ) with mean ui and variance σ ′ = 0. [sent-221, score-0.368]
71 We used a linear model u ¯ ∑k j with polynomial basis functions defined as ui = ¯ j=1 θj xj + θ0 , where k is the order of the polynomial. [sent-223, score-0.215]
72 , the optimal policy can be obtained when the order of polynomial is selected as k = 1. [sent-226, score-0.334]
73 700 In this experiment, we validated whether the proposed model 600 proposed model selection selection method can detect the true order of the polynoweighted log-likelihood 500 mial. [sent-227, score-0.194]
74 To illustrate how our proposed model selection criterion works, we compared the performance of the proposed 400 model selection method with a na¨ve method based on the 300 ı weighted log-likelihood (6). [sent-228, score-0.405]
75 On the other hand, in the weighted log-likelihood method, the distribution of the orders converged to a one with two peaks at k = 1 and k = 4. [sent-237, score-0.146]
76 This result seems to show that the penalized term in our model selection criterion worked well. [sent-238, score-0.162]
77 5 Discussion In this study, we have discussed a DPS problem in the framework of weighted likelihood estimation. [sent-239, score-0.206]
78 We introduced a weighted likelihood function as the objective function of DPS, and proposed an incremental algorithm, WLPS, based on the iteration of maximum weighted log-likelihood estimation. [sent-240, score-0.367]
79 WLPS shows desirable theoretical properties, namely, consistency, asymptotic normality, and a monotonic increase in the expected reward at each iteration. [sent-241, score-0.27]
80 Furthermore, we have constructed a model selection strategy based on the posterior probability of the model given a sample sequence through asymptotic analysis of the marginal weighted likelihood. [sent-242, score-0.416]
81 Let us specify the state transition distribution p(x′ |x, u) as a parametric model pκ (x′ |x, u) := p(x′ |x, u, κ), where κ is an m′ -dimensional parameter vector. [sent-250, score-0.144]
82 As can be seen, estimating equation (13) corresponds to the likelihood equation, i. [sent-252, score-0.151]
83 This observation ˆ ˆ indicates that the weighted likelihood integrates two different objective functions: one for learning policy πθ (u|x), and the other for the state predictor, pκ (x′ |x, u). [sent-255, score-0.569]
84 Model-based WLPS fully specifies the weighted likelihood by using the parametric policy and parametric state transition models, and estimates all the parameters that appear in the parametric weighted likelihood. [sent-261, score-0.877]
85 Meanwhile, model-free WLPS partially specifies the weighted likelihood by only using the parametric policy model. [sent-263, score-0.584]
86 This can be seen as a semiparametric statistical model [22, 23], which includes not only parameters of interest, but also additional nuisance parameters with possibly infinite DoF, where only the policy is modeled parametrically and the other unspecified part corresponds to the nuisance parameters. [sent-264, score-0.486]
87 Hence, the difference between model-based and model-free WLPS methods can be interpreted as the difference between parametric and semiparametric statistical inference. [sent-266, score-0.132]
88 The theoretical aspects of both parametric and semiparametric inference have been actively investigated and several approaches for combining their estimators have been proposed [23, 24, 25]. [sent-267, score-0.192]
89 2 Variance reduction technique for WLPS In order to perform fast learning of the policy, it is necessary to find estimators that can reduce the estimation variance of the policy parameters in DPS. [sent-270, score-0.444]
90 , instead of considering the estimation variance of the policy parameters, they reduce the estimation variance of the moments necessary to learn the policy parameter. [sent-273, score-0.794]
91 Unfortunately, these variance reduction techniques do not guarantee decreasing the estimation variance of the policy parameters, and thus it is desirable to develop a direct approach that can evaluate and reduce the estimation variance of the policy parameters. [sent-274, score-0.892]
92 The estimating function method is a powerful tool for the design of consistent estimators and the evaluation of the estimation variance of parameters in a semiparametric inference problem. [sent-277, score-0.232]
93 The advantage of considering the estimating function is the ability 1) to characterize an entire set of consistent estimators, and 2) to find the optimal estimator with the minimum parameter estimation variance from the set of estimators [23, 29]. [sent-278, score-0.25]
94 Therefore, by applying this to WLPS, we can characterize an entire set of estimators, which maximizes the expected reward without identifying the state transition distribution, and find the optimal estimator with the minimum estimation variance. [sent-279, score-0.314]
95 Alt¨ n, “Relative entropy policy search,” in Proceedings of the 24-th National u u Conference on Artificial Intelligence, 2010. [sent-326, score-0.334]
96 Szepesv´ ri, “Model selection in reinforcement learning,” Machine Learning, pp. [sent-351, score-0.135]
97 Pineau, “PAC-Bayesian model selection for reinforcement learning,” in Advances in Neural Information Processing Systems 22, 2010. [sent-356, score-0.157]
98 Sugiyama, “Reward-weighted regression with sample reuse for direct policy search in reinforcement learning,” Neural Computation, vol. [sent-360, score-0.456]
99 Munos, “Finite-sample analysis of least-squares policy iteration,” Journal of Machine Learning Research, vol. [sent-395, score-0.334]
100 Sugiyama, “Analysis and improvement of policy gradient estimation,” Neural Networks, 2011. [sent-437, score-0.334]
wordName wordTfidf (topN-words)
[('wlps', 0.753), ('policy', 0.334), ('dps', 0.291), ('emps', 0.184), ('ui', 0.154), ('weighted', 0.146), ('reward', 0.117), ('estimator', 0.084), ('uj', 0.079), ('selection', 0.075), ('semiparametric', 0.068), ('criterion', 0.065), ('ln', 0.065), ('likelihood', 0.06), ('rl', 0.06), ('reinforcement', 0.06), ('estimating', 0.055), ('mj', 0.055), ('xi', 0.052), ('normality', 0.052), ('ueno', 0.046), ('asymptotic', 0.046), ('peters', 0.045), ('fn', 0.045), ('parametric', 0.044), ('converges', 0.041), ('supporting', 0.041), ('xj', 0.039), ('op', 0.039), ('maximization', 0.038), ('mixing', 0.037), ('variance', 0.037), ('equation', 0.036), ('geometrically', 0.032), ('hachiya', 0.031), ('iwcv', 0.031), ('kawanabe', 0.031), ('lqg', 0.031), ('candidate', 0.03), ('transition', 0.03), ('estimators', 0.029), ('state', 0.029), ('expected', 0.028), ('discounted', 0.028), ('satis', 0.028), ('desirable', 0.028), ('rn', 0.028), ('search', 0.028), ('dof', 0.027), ('schaal', 0.027), ('washio', 0.027), ('osaka', 0.027), ('mdp', 0.027), ('derivative', 0.027), ('sequences', 0.026), ('munos', 0.026), ('consistency', 0.026), ('estimation', 0.026), ('kawahara', 0.025), ('posterior', 0.024), ('kappen', 0.024), ('baxter', 0.024), ('sequence', 0.023), ('control', 0.023), ('lemma', 0.023), ('model', 0.022), ('employing', 0.022), ('partial', 0.022), ('suppose', 0.022), ('marginal', 0.022), ('toussaint', 0.021), ('nuisance', 0.021), ('sugiyama', 0.021), ('lim', 0.021), ('uniform', 0.021), ('assumptions', 0.02), ('stationary', 0.02), ('framed', 0.02), ('statistical', 0.02), ('qi', 0.02), ('parameter', 0.019), ('bic', 0.019), ('sample', 0.019), ('monotonically', 0.019), ('clarify', 0.019), ('ergodic', 0.018), ('goodness', 0.018), ('reduction', 0.018), ('theoretical', 0.018), ('bar', 0.017), ('insight', 0.017), ('inference', 0.017), ('strategy', 0.017), ('assumption', 0.017), ('increase', 0.017), ('analytically', 0.017), ('actively', 0.016), ('monotonic', 0.016), ('direct', 0.015), ('maximum', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
Author: Tsuyoshi Ueno, Kohei Hayashi, Takashi Washio, Yoshinobu Kawahara
Abstract: Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efficiently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.
2 0.19827212 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Author: Bruno Scherrer, Boris Lesner
Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1
3 0.19400989 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
4 0.18514845 38 nips-2012-Algorithms for Learning Markov Field Policies
Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer
Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1
5 0.17527935 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
6 0.16788839 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
7 0.15915753 348 nips-2012-Tractable Objectives for Robust Policy Optimization
8 0.1495972 160 nips-2012-Imitation Learning by Coaching
9 0.14514104 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
10 0.12819546 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
11 0.1257523 344 nips-2012-Timely Object Recognition
12 0.12306009 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
13 0.11083715 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
14 0.094500683 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
15 0.094197921 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
16 0.092668466 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
17 0.089125939 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
18 0.087546453 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
19 0.084917516 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
20 0.082900167 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
topicId topicWeight
[(0, 0.172), (1, -0.31), (2, 0.016), (3, -0.052), (4, -0.058), (5, 0.03), (6, 0.006), (7, 0.003), (8, 0.04), (9, -0.058), (10, -0.004), (11, -0.011), (12, 0.009), (13, -0.007), (14, 0.057), (15, -0.048), (16, 0.088), (17, -0.0), (18, 0.033), (19, 0.002), (20, -0.008), (21, -0.041), (22, 0.103), (23, 0.013), (24, -0.049), (25, 0.042), (26, -0.028), (27, 0.006), (28, 0.036), (29, -0.013), (30, 0.012), (31, 0.004), (32, 0.024), (33, -0.012), (34, 0.048), (35, -0.005), (36, 0.04), (37, -0.015), (38, 0.012), (39, 0.009), (40, 0.006), (41, -0.031), (42, -0.078), (43, 0.005), (44, 0.123), (45, -0.058), (46, -0.07), (47, 0.047), (48, 0.041), (49, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.91462511 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
Author: Tsuyoshi Ueno, Kohei Hayashi, Takashi Washio, Yoshinobu Kawahara
Abstract: Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efficiently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.
2 0.81725204 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Author: Bruno Scherrer, Boris Lesner
Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1
3 0.79004997 38 nips-2012-Algorithms for Learning Markov Field Policies
Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer
Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1
4 0.76675838 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
Author: Aaron Wilson, Alan Fern, Prasad Tadepalli
Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1
5 0.76610994 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
6 0.75152934 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
7 0.7267893 160 nips-2012-Imitation Learning by Coaching
8 0.67019045 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
9 0.65706426 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
10 0.6500088 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
11 0.63450861 348 nips-2012-Tractable Objectives for Robust Policy Optimization
12 0.61242151 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
13 0.60629779 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
14 0.58941787 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
15 0.57696664 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
16 0.56464052 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
17 0.55093497 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
18 0.5099237 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
19 0.50854796 358 nips-2012-Value Pursuit Iteration
20 0.49844164 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds
topicId topicWeight
[(0, 0.019), (21, 0.016), (38, 0.129), (39, 0.013), (42, 0.028), (53, 0.021), (54, 0.106), (55, 0.028), (74, 0.044), (76, 0.143), (80, 0.099), (92, 0.044), (94, 0.192)]
simIndex simValue paperId paperTitle
1 0.95129043 362 nips-2012-Waveform Driven Plasticity in BiFeO3 Memristive Devices: Model and Implementation
Author: Christian Mayr, Paul Stärke, Johannes Partzsch, Love Cederstroem, Rene Schüffny, Yao Shuai, Nan Du, Heidemarie Schmidt
Abstract: Memristive devices have recently been proposed as efficient implementations of plastic synapses in neuromorphic systems. The plasticity in these memristive devices, i.e. their resistance change, is defined by the applied waveforms. This behavior resembles biological synapses, whose plasticity is also triggered by mechanisms that are determined by local waveforms. However, learning in memristive devices has so far been approached mostly on a pragmatic technological level. The focus seems to be on finding any waveform that achieves spike-timing-dependent plasticity (STDP), without regard to the biological veracity of said waveforms or to further important forms of plasticity. Bridging this gap, we make use of a plasticity model driven by neuron waveforms that explains a large number of experimental observations and adapt it to the characteristics of the recently introduced BiFeO3 memristive material. Based on this approach, we show STDP for the first time for this material, with learning window replication superior to previous memristor-based STDP implementations. We also demonstrate in measurements that it is possible to overlay short and long term plasticity at a memristive device in the form of the well-known triplet plasticity. To the best of our knowledge, this is the first implementations of triplet plasticity on any physical memristive device. 1
2 0.87862128 276 nips-2012-Probabilistic Event Cascades for Alzheimer's disease
Author: Jonathan Huang, Daniel Alexander
Abstract: Accurate and detailed models of neurodegenerative disease progression are crucially important for reliable early diagnosis and the determination of effective treatments. We introduce the ALPACA (Alzheimer’s disease Probabilistic Cascades) model, a generative model linking latent Alzheimer’s progression dynamics to observable biomarker data. In contrast with previous works which model disease progression as a fixed event ordering, we explicitly model the variability over such orderings among patients which is more realistic, particularly for highly detailed progression models. We describe efficient learning algorithms for ALPACA and discuss promising experimental results on a real cohort of Alzheimer’s patients from the Alzheimer’s Disease Neuroimaging Initiative. 1
3 0.84321231 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities
Author: Xaq Pitkow
Abstract: This paper shows how sparse, high-dimensional probability distributions could be represented by neurons with exponential compression. The representation is a novel application of compressive sensing to sparse probability distributions rather than to the usual sparse signals. The compressive measurements correspond to expected values of nonlinear functions of the probabilistically distributed variables. When these expected values are estimated by sampling, the quality of the compressed representation is limited only by the quality of sampling. Since the compression preserves the geometric structure of the space of sparse probability distributions, probabilistic computation can be performed in the compressed domain. Interestingly, functions satisfying the requirements of compressive sensing can be implemented as simple perceptrons. If we use perceptrons as a simple model of feedforward computation by neurons, these results show that the mean activity of a relatively small number of neurons can accurately represent a highdimensional joint distribution implicitly, even without accounting for any noise correlations. This comprises a novel hypothesis for how neurons could encode probabilities in the brain. 1
same-paper 4 0.81258321 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
Author: Tsuyoshi Ueno, Kohei Hayashi, Takashi Washio, Yoshinobu Kawahara
Abstract: Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efficiently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.
5 0.79380614 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition
Author: Shulin Yang, Liefeng Bo, Jue Wang, Linda G. Shapiro
Abstract: Fine-grained recognition refers to a subordinate level of recognition, such as recognizing different species of animals and plants. It differs from recognition of basic categories, such as humans, tables, and computers, in that there are global similarities in shape and structure shared cross different categories, and the differences are in the details of object parts. We suggest that the key to identifying the fine-grained differences lies in finding the right alignment of image regions that contain the same object parts. We propose a template model for the purpose, which captures common shape patterns of object parts, as well as the cooccurrence relation of the shape patterns. Once the image regions are aligned, extracted features are used for classification. Learning of the template model is efficient, and the recognition results we achieve significantly outperform the stateof-the-art algorithms. 1
6 0.78891551 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
7 0.78889918 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
8 0.78834641 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
9 0.7854386 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
10 0.77573144 344 nips-2012-Timely Object Recognition
11 0.76389962 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
12 0.76203072 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
13 0.75916797 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
14 0.75700355 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
15 0.7562114 38 nips-2012-Algorithms for Learning Markov Field Policies
16 0.75518548 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
17 0.75426853 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
18 0.75391001 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
19 0.75182569 160 nips-2012-Imitation Learning by Coaching
20 0.75094116 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization