nips nips2004 nips2004-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Ng Computer Science Department Stanford University Stanford, CA 94305 Abstract First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. [sent-2, score-0.152]
2 If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. [sent-3, score-0.128]
3 But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. [sent-4, score-0.12]
4 Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. [sent-7, score-0.144]
5 1 Introduction First-order Markov models have enjoyed numerous successes in many sequence modeling and in many control tasks, and are now a workhorse of machine learning. [sent-8, score-0.096]
6 [5] When the parameters of a Markov model are not known a priori, they are often estimated from data using maximum likelihood (ML) (and perhaps smoothing). [sent-10, score-0.09]
7 If the dynamics are truly governed by a first-order system, then the longer-range interactions would also be well modeled. [sent-13, score-0.082]
8 But if the system is not first-order, then interactions on longer time scales are often poorly approximated by a model fit using maximum likelihood. [sent-14, score-0.181]
9 In reinforcement learning and control tasks where the goal is to maximize our long-term expected rewards, the predictive accuracy of a model on long time scales can have a significant impact on the attained performance. [sent-15, score-0.119]
10 Letting St denote the state at time t, we initialize the system to S0 = 0, and let St = St−1 + εt , where the increments εt ∈ {−1, +1} are equally likely to be −1 or +1. [sent-19, score-0.145]
11 Further, regardless of the true correlation in the data, using maximum likelihood (ML) to estimate the model parameters from training √ data would return the same model with E[|ST |] = O( T ). [sent-24, score-0.152]
12 To see how these effects can lead to poor performance on a control task, consider learning to control a vehicle (such as a car or a helicopter) under disturbances εt due to very strong winds. [sent-25, score-0.18]
13 The influence of the disturbances on the vehicle’s position over one time step may be small, but if the disturbances εt are highly correlated, their cumulative effect over time can be substantial. [sent-26, score-0.124]
14 If our model completely ignores these correlations, we may overestimate our ability to control the vehicle (thinking our variance in position is O(T ) rather than O(T 2 )), and try to follow overly narrow/dangerous paths. [sent-27, score-0.087]
15 There, the consensus (assuming there is ample training data) seems to be that it is usually better to directly minimize the loss with respect to the ultimate performance measure, rather than an intermediate loss function such as the likelihood of the training data. [sent-30, score-0.087]
16 By analogy, when modeling a dynamical system for a control task, we are interested in having a model that accurately predicts the performance of different control policies—so that it can be used to select a good policy—and not in maximizing the likelihood of the observed sequence data. [sent-35, score-0.215]
17 , [2, 15, 11]), but most of this work has focused on speeding up exploration and planning rather than on accurately modeling non-Markovian dynamics. [sent-44, score-0.109]
18 We define our notation in Section 2, then formulate the model learning problem ignoring actions in Section 3, and propose a learning algorithm in Section 4. [sent-46, score-0.096]
19 A policy π is a mapping from states to probability distributions over actions. [sent-65, score-0.095]
20 Then the utility of π is ∞ ∞ U (π) = Es0 ∼D [V π (s0 )] = E[ t=0 γ t R(st )|π] = t=0 γ t st P (St = st |π)R(st ). [sent-67, score-1.776]
21 The second expectation above is with respect to the random state sequence s0 , s1 , . [sent-68, score-0.076]
22 drawn by starting from s0 ∼ D, picking actions according to π and transitioning according to P . [sent-71, score-0.078]
23 We deˆ ˆ note by U (π) the utility of the policy π in an MDP whose first-order transition probabilities ˆ are given by Pθ (and similarly V π the value function in the same MDP). [sent-73, score-0.225]
24 Thus, we have2 ˆ ∞ ˆ ˆ ˆ ˆ ∞ U (π) = Es0 ∼D [V π (s0 )] = E[ t=0 γ t R(st )|π] = t=0 γ t st Pθ (St = st |π)R(st ). [sent-74, score-1.746]
25 ˆ ˆ Note that if |U (π) − U (π)| ≤ ε for all π, then finding the optimal policy in the estimated MDP that uses parameters Pθ (using value iteration or any other algorithm) will give a ˆ policy whose utility is within 2ε of the optimal utility. [sent-75, score-0.243]
26 Often we will also abbreviate P (St = st ) by P (st ). [sent-77, score-0.873]
27 Section 5 will discuss how actions can be incorporated into the model. [sent-79, score-0.078]
28 We have ∞ ˆ |V (s0 ) − V (s0 )| ∞ γt = t=0 ≤ Rmax st ∞ Pθ (st |s0 )R(st ) − ˆ γt t=0 st P (st |s0 )R(st ) Pθ (st |s0 ) − P (st |s0 ) . [sent-81, score-1.746]
29 ˆ st t=0 γt (1) ˆ ˆ So, to ensure that V (s0 ) is an accurate estimate of V (s0 ), we would like the parameters θ of the model to minimize the right hand side of (1). [sent-82, score-0.873]
30 The term st Pθ (st |s0 ) − P (st |s0 ) is ˆ exactly (twice) the variational distance between the two conditional distributions Pθ (·|s0 ) ˆ and P (·|s0 ). [sent-83, score-0.873]
31 So, given a training sequence s0:T sampled from P , we propose to estimate the transition probabilities Pθ by ˆ T −1 T −t ˆ θ = arg max θ t=0 k=1 γ k log Pθ (st+k |st ). [sent-93, score-0.181]
32 (3) 2 Since Pθ is a first-order model, it explicitly parameterizes only Pθ (St+1 = st+1 |St = st , At = ˆ ˆ at ). [sent-104, score-0.873]
33 We use Pθ (St = st |π) to denote the probability that St = st in an MDP with one-step ˆ transition probabilities Pθ (St+1 = st+1 |St = st , At = at ) and initial state distribution D when ˆ acting according to the policy π. [sent-105, score-2.916]
34 Maximum likelihood (ML) estimation maximizes fM L (θ) = log Pθ (s1 |s0 ) + log Pθ (s2 |s1 ) + log Pθ (s3 |s2 ). [sent-115, score-0.16]
35 2) is fM L (θ) + log Pθ (s2 |s0 ) + log Pθ (s3 |s1 ) + log Pθ (s3 |s0 ). [sent-118, score-0.117]
36 , [12, 10]), the pairwise marginals can be computed by a forward propagation (computing the forward messages), a backward propagation (computing the backward messages), and then combining the forward and backward messages. [sent-131, score-0.458]
37 In this case we simply have Pθ (St+1 = j, St = i|St = st , St+1 = st+1 ) = 1{i = st }1{j = st+1 }. [sent-136, score-1.746]
38 ˆ where we initialize m→t (i) = 1{i = st }, and mt+k← (i) = 1{i = st+k }. [sent-137, score-0.873]
39 The pairwise marginals can be computed by combining the forward and backward messages: Pθ (St+l+1 = j, St+l = i|St = st , St+k = st+k ) = m→t+l (i)Pθ (j|i)mt+l+1← (j). [sent-138, score-1.079]
40 (6) ˆ ˆ For the term log Pθ (st+k |st ), we end up performing 2(k − 1) message computations, and combining messages into pairwise marginals k − 1 times. [sent-139, score-0.232]
41 Doing this for all terms in the objective results in O(T 3 ) message computations and O(T 3 ) computations of pairwise marginals from these messages. [sent-140, score-0.143]
42 In practice, the objective (2) can be approximated by considering only the terms in the summation with k ≤ H, where H is some time horizon. [sent-141, score-0.099]
43 The forward messages computed for the term log Pθ (st+k |st ) depend only on the value of st . [sent-145, score-1.059]
44 So the forward messages computed for the terms {log Pθ (st+k |st )}H are the k=1 same as the forward messages computed just for the term log Pθ (st+H |st ). [sent-146, score-0.333]
45 As a result, we need to compute only O(T H) messages (as opposed to O(T H 2 ) in the naive algorithm). [sent-148, score-0.087]
46 Consider two terms in the objective log Pθ (st1 +k |st1 ) and log Pθ (st2 +k |st2 ). [sent-150, score-0.115]
47 If st1 = st2 and st1 +k = st2 +k , then both terms will have exactly the same pairwise marginals and contribution to the expected counts. [sent-151, score-0.08]
48 As a consequence, the running time for each iteration (once we have made an initial pass over the data to count the number of occurrences of the triples) is only O(|S|2 H 2 ), which is independent of the size of the training data. [sent-153, score-0.086]
49 5 Incorporating actions In decision processes, actions influence the state transition probabilities. [sent-154, score-0.304]
50 To generate training data, suppose we choose an exploration policy and take actions in the DP using this policy. [sent-155, score-0.26]
51 (7) The EM algorithm is straightforwardly extended to this setting, by conditioning on the actions during the E-step, and updating state-action transition probabilities Pθ (j|i, a) in the M-step. [sent-158, score-0.195]
52 As before, forward messages need to be computed only once for each value of t, and backward messages only once for each value of t + k. [sent-159, score-0.3]
53 In particular, now the contribution of a triple i, j, k (one for which (St = i, St+k = j) occurs in the training data) depends on the action sequence at:t+k−1 . [sent-162, score-0.089]
54 The number of possible sequences of actions at:t+k−1 grows exponentially with k. [sent-163, score-0.078]
55 However, a single deterministic exploration policy, by definition, cannot explore all state-action pairs. [sent-165, score-0.093]
56 Thus, we will instead use a combination of several deterministic exploration policies, which jointly can explore all state-action pairs. [sent-166, score-0.093]
57 In this case, the running time for the E-step becomes O(|S|2 H 2 |Π|), where |Π| is the number of different deterministic exploration policies used. [sent-167, score-0.177]
58 ) 6 Because of the discount term γ k in the objective (2), one can safely truncate the summation over k after about O(1/(1 − γ)) terms without incurring too much error. [sent-170, score-0.082]
59 (b) Grid-world experimental results, showing the utilities of policies obtained from the MDP estimated using ML (dash-dot line), and utilities of policies obtained from the MDP estimated using our objective (solid line). [sent-182, score-0.345]
60 safest path Consider an agent acting for 100 time steps in the grid-world in Figure 2(a). [sent-192, score-0.103]
61 The initial state is marked by S, and the absorbing goal state by G. [sent-193, score-0.131]
62 This DP has four actions that (try to) move in each of the four compass directions, and succeed with probability 1 − p. [sent-195, score-0.104]
63 In this problem, if there is no noise (p = 0), the optimal policy is to follow one of the shortest paths to the goal that do not pass through gray squares, such as path A. [sent-198, score-0.16]
64 For higher noise levels, the optimal policy is to stay as far away as possible from the gray squares, and try to follow a longer path such as B to the goal. [sent-199, score-0.182]
65 7 At intermediate noise levels, the optimal policy is strongly dependent on how correlated the noise is between successive time steps. [sent-200, score-0.22]
66 8 Figure 2(b) shows the utilities obtained by the two different models, under different degrees of correlation in the noise. [sent-203, score-0.133]
67 Empirically, when the noise correlation is high, our algorithm seems to be fitting a first-order model with a larger “effective” noise level. [sent-205, score-0.145]
68 In contrast, the ML estimate performs poorly in this problem because it tends to underestimate how far sideways the agent tends to move due to the noise (cf. [sent-207, score-0.09]
69 Experimental details: The noise is governed by an (unobserved) Markov chain with four states corresponding to the four compass directions. [sent-213, score-0.094]
70 If an action at time t is not successful, the agent moves in the direction corresponding to the state of this Markov chain. [sent-214, score-0.155]
71 A 200,000 length state-action sequence for the grid-world, generated using a random exploration policy, was used for model fitting, and a constant noise level p = 0. [sent-218, score-0.126]
72 Given a learned MDP model, value iteration was used to find the optimal policy for it. [sent-220, score-0.095]
73 2 Queue We consider a service queue in which the average arrival rate is p. [sent-224, score-0.35]
74 Also, for each action i, let qi denote the service rate under that action (thus, qi = P (a customer is served in one time step|action = i)). [sent-226, score-0.243]
75 In our problem, there are three service rates q0 < q1 < q2 with respective rewards 0, −1, −10. [sent-227, score-0.106]
76 The maximum queue size is 20, and the reward for any state of the queue is 0, except when the queue becomes full, which results in a reward of -1000. [sent-228, score-0.463]
77 So the inexpensive service rate q1 is sufficient to keep up with arrivals on average. [sent-231, score-0.155]
78 However, even though the average arrival rate is p, the arrivals come in “bursts,” and even the high service rate q2 is insufficient to keep the queue small during the bursts of many consecutive arrivals. [sent-232, score-0.449]
79 9 Experimental results on the queue are shown in Figure 2(c). [sent-233, score-0.109]
80 We plot the utilities obtained using each of the two algorithms for high arrival correlations. [sent-234, score-0.207]
81 ) We see that the policies obtained with our algorithm consistently outperform those obtained using maximum likelihood to fit the model parameters. [sent-236, score-0.128]
82 For learning the model parameters, we used three deterministic exploration policies, each corresponding to always taking one of the three actions. [sent-240, score-0.093]
83 A single EM iteration for the experiments on the queue took 6 minutes for the original version of the algorithm, but took only 3 seconds for the more efficient version; this represents more than a 100-fold speedup. [sent-243, score-0.109]
84 7 Conclusions We proposed a method for learning a first-order Markov model that captures the system’s dynamics on longer time scales than a single time step. [sent-244, score-0.109]
85 The first mode has a very low arrival rate, and the second mode has a very high arrival rate. [sent-273, score-0.406]
86 We denote the steady state distribution over the two modes by (φ1 , φ2 ). [sent-274, score-0.098]
87 , the system spends a fraction φ1 of the time in the low arrival rate mode, and a fraction φ2 = 1 − φ1 of the time in high arrival rate mode. [sent-277, score-0.386]
88 ) Given the steady state distribution, the state transition matrix [a 1 − a; 1 − b b] has only one remaining degree of freedom, which (essentially) controls how often the system switches between the two modes. [sent-278, score-0.243]
89 fast] mode, of staying in the same mode the next time step. [sent-281, score-0.089]
90 In our experiments, q0 = 0, and q1 was equal to the average arrival rate p = φ1 P (arrival|mode 1) + φ2 P (arrival|mode 2). [sent-291, score-0.159]
91 Training data was generated using 8000 simulations of 25 time steps each, in which the queue length is initialized randomly, and the same (randomly chosen) action is taken on all 25 time steps. [sent-294, score-0.202]
92 95 and approximated utilities by truncating at a finite horizon of 100. [sent-298, score-0.088]
93 Note that because of discounting, the objective is slightly different from the standard setting of learning the parameters of a Markov chain with unobserved variables in the training data. [sent-388, score-0.083]
94 Here, the summation is over all possible state sequences St+1:t+k−1 . [sent-393, score-0.077]
95 So we have T −1 T −t k ˆ t=0 k=1 γ log Pθ (st+k |st , at:t+k−1 ) = ≥ T −1 t=0 γ log Pθ (st+1 |st , at ) + ˆ T −1 t=0 T −t k=2 γ k log Qt,k (St+1:t+k−1 ) St+1:t+k−1 Qt,k (St+1:t+k−1 ) Pθ (st+k |St+k−1 , at+k−1 )Pθ (St+k−1 |St+k−2 , at+k−2 ) . [sent-394, score-0.117]
96 (8) by alternately optimizing with respect to the distributions Qt,k (E-step), and the transition probabilities Pθ (·|·, ·) (M-step). [sent-403, score-0.129]
97 , St+k−1 |St = st , St+k = st+k , At:t+k−1 = at:t+k−1 ). [sent-407, score-0.873]
98 (9) ˆ Optimizing with respect to the transition probabilities Pθ (·|·, ·) (M-step) for Qt,k ˆ ˆ ˆ fixed as in Eqn. [sent-408, score-0.1]
99 (9) is done by updating θ to θnew such that ∀ i, j ∈ S, ∀ a ∈ A we have that Pθnew (j|i, a) = stats(j, i, a)/ k∈S stats(k, i, a), where ˆ T −1 T −t k−1 k stats(j, i, a) = ˆ t=0 k=1 l=0 γ Pθ (St+l+1 = j, St+l = i|St = st , St+k = st+k , At:t+k−1 = at:t+k−1 )1{at+l = a}. [sent-409, score-0.873]
100 Note that only the pairwise marginals Pθ (St+l+1 , St+l |St , St+k , At:t+k−1 ) are needed in the M-step, and so it is sufficient to ˆ compute only these when optimizing with respect to the Qt,k variables in the E-step. [sent-410, score-0.109]
wordName wordTfidf (topN-words)
[('st', 0.873), ('arrival', 0.137), ('ml', 0.121), ('queue', 0.109), ('mdp', 0.108), ('markov', 0.101), ('policy', 0.095), ('messages', 0.087), ('stats', 0.082), ('service', 0.082), ('actions', 0.078), ('transition', 0.071), ('utilities', 0.07), ('mode', 0.066), ('backward', 0.066), ('exploration', 0.065), ('correlation', 0.063), ('policies', 0.061), ('forward', 0.06), ('state', 0.056), ('control', 0.054), ('marginals', 0.052), ('arrivals', 0.051), ('action', 0.047), ('increments', 0.044), ('likelihood', 0.043), ('noise', 0.041), ('log', 0.039), ('disturbances', 0.039), ('objective', 0.037), ('em', 0.035), ('mt', 0.035), ('interactions', 0.033), ('processes', 0.033), ('vehicle', 0.033), ('utility', 0.03), ('probabilities', 0.029), ('agent', 0.029), ('optimizing', 0.029), ('dp', 0.028), ('reward', 0.028), ('pairwise', 0.028), ('deterministic', 0.028), ('savings', 0.028), ('acting', 0.027), ('governed', 0.027), ('levels', 0.027), ('message', 0.026), ('compass', 0.026), ('bursts', 0.026), ('stanford', 0.025), ('discount', 0.024), ('unobserved', 0.024), ('modes', 0.024), ('rewards', 0.024), ('path', 0.024), ('maximum', 0.024), ('paragraph', 0.023), ('triples', 0.023), ('reinforcement', 0.023), ('counts', 0.023), ('estimated', 0.023), ('time', 0.023), ('system', 0.022), ('optimizes', 0.022), ('planning', 0.022), ('dynamics', 0.022), ('longer', 0.022), ('rmax', 0.022), ('customer', 0.022), ('exposition', 0.022), ('occurrences', 0.022), ('pomdps', 0.022), ('appendix', 0.022), ('modeling', 0.022), ('training', 0.022), ('rate', 0.022), ('summation', 0.021), ('decision', 0.021), ('fm', 0.021), ('poorly', 0.02), ('sequence', 0.02), ('correlated', 0.02), ('switches', 0.02), ('considers', 0.019), ('mixed', 0.019), ('scales', 0.019), ('initial', 0.019), ('estimator', 0.019), ('transitions', 0.019), ('structured', 0.019), ('ignoring', 0.018), ('steady', 0.018), ('squares', 0.018), ('substantial', 0.018), ('approximated', 0.018), ('tting', 0.018), ('stochastic', 0.017), ('conditioning', 0.017), ('var', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
2 0.2718614 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
Author: Ruei-sung Lin, David A. Ross, Jongwoo Lim, Ming-Hsuan Yang
Abstract: This paper presents an adaptive discriminative generative model that generalizes the conventional Fisher Linear Discriminant algorithm and renders a proper probabilistic interpretation. Within the context of object tracking, we aim to find a discriminative generative model that best separates the target from the background. We present a computationally efficient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not significantly change as time progresses, our method adapts a discriminative generative model to reflect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Numerous experiments show that our method is able to learn a discriminative generative model for tracking target objects undergoing large pose and lighting changes.
3 0.2376049 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
4 0.17692205 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
5 0.13403997 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
6 0.11879866 138 nips-2004-Online Bounds for Bayesian Algorithms
7 0.097753398 39 nips-2004-Coarticulation in Markov Decision Processes
8 0.089931853 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition
9 0.089677162 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
10 0.085201003 24 nips-2004-Approximately Efficient Online Mechanism Design
11 0.079504885 28 nips-2004-Bayesian inference in spiking neurons
12 0.078622438 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
13 0.078268938 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
14 0.069727413 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
15 0.069107294 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
16 0.065958031 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
17 0.056566827 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
18 0.055086702 116 nips-2004-Message Errors in Belief Propagation
19 0.054535564 50 nips-2004-Dependent Gaussian Processes
20 0.053075645 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
topicId topicWeight
[(0, -0.163), (1, -0.046), (2, 0.239), (3, -0.136), (4, -0.117), (5, 0.048), (6, -0.016), (7, -0.079), (8, -0.029), (9, 0.13), (10, 0.149), (11, 0.031), (12, 0.188), (13, 0.154), (14, 0.005), (15, -0.066), (16, -0.083), (17, -0.182), (18, 0.035), (19, 0.136), (20, 0.004), (21, -0.032), (22, 0.037), (23, -0.006), (24, -0.2), (25, -0.029), (26, -0.06), (27, 0.054), (28, -0.001), (29, -0.074), (30, -0.032), (31, 0.126), (32, 0.063), (33, -0.057), (34, -0.048), (35, -0.185), (36, -0.094), (37, -0.096), (38, 0.045), (39, 0.13), (40, -0.148), (41, 0.09), (42, -0.104), (43, 0.024), (44, 0.087), (45, -0.044), (46, -0.064), (47, -0.031), (48, -0.103), (49, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.98539472 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
2 0.73592561 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
3 0.5596928 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
Author: Ruei-sung Lin, David A. Ross, Jongwoo Lim, Ming-Hsuan Yang
Abstract: This paper presents an adaptive discriminative generative model that generalizes the conventional Fisher Linear Discriminant algorithm and renders a proper probabilistic interpretation. Within the context of object tracking, we aim to find a discriminative generative model that best separates the target from the background. We present a computationally efficient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not significantly change as time progresses, our method adapts a discriminative generative model to reflect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Numerous experiments show that our method is able to learn a discriminative generative model for tracking target objects undergoing large pose and lighting changes.
4 0.54298216 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
5 0.46510628 138 nips-2004-Online Bounds for Bayesian Algorithms
Author: Sham M. Kakade, Andrew Y. Ng
Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1
6 0.37872475 39 nips-2004-Coarticulation in Markov Decision Processes
7 0.3435536 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
8 0.32956362 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
9 0.30543607 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
10 0.28513923 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
11 0.28412426 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
12 0.27281436 24 nips-2004-Approximately Efficient Online Mechanism Design
13 0.27163452 28 nips-2004-Bayesian inference in spiking neurons
14 0.26517895 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
15 0.26264468 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons
16 0.25965077 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
17 0.25714096 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
18 0.25335893 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
19 0.23863314 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
20 0.23610139 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition
topicId topicWeight
[(13, 0.177), (15, 0.141), (26, 0.047), (31, 0.035), (33, 0.216), (35, 0.03), (39, 0.035), (50, 0.028), (71, 0.013), (82, 0.016), (86, 0.144)]
simIndex simValue paperId paperTitle
1 0.95593941 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror
Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1
same-paper 2 0.92408937 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
3 0.92041528 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery
Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng
Abstract: Semantic taxonomies such as WordNet provide a rich source of knowledge for natural language processing applications, but are expensive to build, maintain, and extend. Motivated by the problem of automatically constructing and extending such taxonomies, in this paper we present a new algorithm for automatically learning hypernym (is-a) relations from text. Our method generalizes earlier work that had relied on using small numbers of hand-crafted regular expression patterns to identify hypernym pairs. Using “dependency path” features extracted from parse trees, we introduce a general-purpose formalization and generalization of these patterns. Given a training set of text containing known hypernym pairs, our algorithm automatically extracts useful dependency paths and applies them to new corpora to identify novel pairs. On our evaluation task (determining whether two nouns in a news article participate in a hypernym relationship), our automatically extracted database of hypernyms attains both higher precision and higher recall than WordNet. 1
4 0.9023279 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
Author: Jochen Triesch
Abstract: This paper explores the computational consequences of simultaneous intrinsic and synaptic plasticity in individual model neurons. It proposes a new intrinsic plasticity mechanism for a continuous activation model neuron based on low order moments of the neuron’s firing rate distribution. The goal of the intrinsic plasticity mechanism is to enforce a sparse distribution of the neuron’s activity level. In conjunction with Hebbian learning at the neuron’s synapses, the neuron is shown to discover sparse directions in the input. 1
5 0.89652371 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
6 0.89621347 163 nips-2004-Semi-parametric Exponential Family PCA
7 0.88810879 70 nips-2004-Following Curved Regularized Optimization Solution Paths
8 0.88676888 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
9 0.88510036 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
10 0.88370883 124 nips-2004-Multiple Alignment of Continuous Time Series
11 0.8831712 116 nips-2004-Message Errors in Belief Propagation
12 0.88234216 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
13 0.88226587 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
14 0.88111979 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
15 0.87792701 178 nips-2004-Support Vector Classification with Input Data Uncertainty
16 0.87682277 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid
17 0.87648928 28 nips-2004-Bayesian inference in spiking neurons
18 0.87589079 127 nips-2004-Neighbourhood Components Analysis
19 0.87531167 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
20 0.872379 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data