nips nips2006 nips2006-143 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. [sent-10, score-0.158]
2 Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. [sent-11, score-0.264]
3 We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. [sent-12, score-0.19]
4 Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. [sent-13, score-0.156]
5 1 Introduction Optimising the performance of existing road networks is a cheap way to reduce the environmental, social, and financial impact of ever increasing volumes of traffic. [sent-16, score-0.176]
6 Road traffic optimisation can be naturally cast as a reinforcement learning (RL) problem. [sent-17, score-0.15]
7 One contribution of this work is an efficient online version of NAC that avoids large gradient steps, which cannot be guaranteed to be an improvement in the presence of stochastic gradients [2]. [sent-21, score-0.146]
8 We compare this online NAC to a simple online policy-gradient (PG) approach, demonstrating that NAC converges in orders of magnitude less environment interactions. [sent-22, score-0.16]
9 2 Traffic Control The optimisation problem consists of finding signalling schedules for all intersections in the system that minimise the average travel time, or similar objectives. [sent-28, score-0.466]
10 A stream is a sequence of cars making the same turn (or going straight) through an intersection. [sent-31, score-0.274]
11 A phase is an interval where a subset of the lights at an intersection are green such that a set of streams that will not collide have right of way. [sent-32, score-0.407]
12 A signal cycle is completed when each phase has been on once, the cycle time being the sum of phase times. [sent-33, score-0.312]
13 Traditionally, control algorithms optimise traffic flow via the phase scheme, the split, and the offset. [sent-34, score-0.173]
14 The phase scheme groups the signal lights into phases and determines their order. [sent-35, score-0.182]
15 Offsets can be introduced to coordinate neighbouring intersections, creating a “green wave” for the vehicles travelling along a main road. [sent-37, score-0.159]
16 There is an additional layer of online adaption based on a saturation-balancing algorithm, i. [sent-54, score-0.144]
17 , SCATS calculates phase timings so that all phases are utilised by vehicles for a fixed percentage of the phase time. [sent-56, score-0.345]
18 The system is controlled by stochastic actions at ∈ A drawn from a random variable (RV) conditioned on the current policy parameters θt , and an observation of the current state o(st ), according to Pr(at |o(st ), θt ). [sent-61, score-0.253]
19 In the case of road traffic these distributions are continuous and complex, making even approximate methods for model based planning in POMDPs difficult to apply. [sent-67, score-0.176]
20 The loop sensors only count cars that pass over them, or are stationary on them. [sent-68, score-0.244]
21 The controller needs access to the history to attempt to resolve some hidden state. [sent-69, score-0.167]
22 We later describe observations constructed from the sensor readings that incorporate some important elements of history for intersection control. [sent-70, score-0.327]
23 1 Local Rewards A common objective function in traffic control is the average vehicle travel time. [sent-72, score-0.215]
24 However, it is impossible to identify a particular car in the system, let alone what its travel time was. [sent-73, score-0.217]
25 For the majority of our experiments we treat each intersection as a local MDP. [sent-76, score-0.278]
26 It is trivial to count all cars that enter the intersection with loop detectors. [sent-77, score-0.497]
27 Therefore, we chose the instant reward rt,i to be the number of cars that entered intersection i over the last time step. [sent-78, score-0.52]
28 The objective for each intersection i is to maximise the normalised discounted throughput: Ri (θ) = E (1 − γ) ∞ t=0 γ t rt,i θ . [sent-79, score-0.294]
29 Discounting is important because it ensures that the controller prefers to pass cars through as early as possible. [sent-80, score-0.384]
30 While suboptimal policies (in terms of travel time) may achieve the optimal average throughput over a time window, the discounted throughput criterion effectively minimises the total waiting time at an intersection in the finite-horizon case [7]. [sent-81, score-0.63]
31 Ignoring effects such as road saturation and driver adaption (which we explore in our experiments), this would result in minimisation of the system wide travel time. [sent-82, score-0.553]
32 The use of local rewards speeds up learning, especially as the number of intersections grows. [sent-83, score-0.221]
33 Thus changes to the policy of neighbouring intersections can adversely impact intersection i, by influencing the distribution of si . [sent-86, score-0.645]
34 We may fail to find the globally optimal cooperative policy without some communication of rewards, but it has proven very effective empirically. [sent-88, score-0.179]
35 log Pr(a|o(s), θt )) w = r(s, at ) + γ (2) S ˆ The surprising choice of θ log Pr(a|o(s), θ) as features for A(s, a) has the nice property that the parameters w turn out to be the naturalised gradient of the long-term average reward (and it is compatible with the policy parameterisation [9]). [sent-99, score-0.323]
36 Again, we substitute the linear approximation A(s, a) for A(s, a) and make use of the fact that our policy is actually a function of o := o(s) and θ: θ R(θ) = Pr(s|θ) S θ A Pr(a|o, θ)( θ log Pr(a|o, θ)) w da ds. [sent-102, score-0.206]
37 Rewriting as a linear system yields (ot − γot+1 ) vt + σ(ot , st , st+1 ) + ( θ log Pr(at |ot , θt )) wt = r(st , at ). [sent-108, score-0.234]
38 By analogy with other second-order gradient methods we can view A as containing curvature information about the optimisation manifold, accelerating learning. [sent-111, score-0.146]
39 , if wt is not accurate a line search can step a long way toward a suboptimal policy and get stuck because the soft-max function used to generate action distributions has a minima for all large parameter values. [sent-118, score-0.275]
40 Thus, we propose the online version of NAC in Algorithm 1, making a small parameter update at every time step which potentially accelerates convergence because the policy can improve at every step. [sent-120, score-0.259]
41 The O LPOMDP, or vanilla, algorithm [12] produces per step gradient estimates from the discounted sum of θ log Pr(at |ot , θt ), multiplied by the instant reward rt . [sent-128, score-0.158]
42 The local search, Monte-Carlo gradient estimates, and local rewards improve scalability. [sent-133, score-0.144]
43 of raw sensor information to controls means that we avoid modelling, and also creates a controller that can react immediately to fluctuations in traffic density. [sent-141, score-0.213]
44 We emphasise that, while SCATS has to adapt a single behaviour slowly, our policy is a rich mapping of sensor data to many different behaviours. [sent-142, score-0.225]
45 We modelled the phase control protocol and sensor system based on information from the Sydney traffic authority. [sent-147, score-0.228]
46 Given that the learning algorithm does not depend directly on a model of the system, just the ability to interact with it, our controller can be plugged into a more accurate simulation without modification. [sent-148, score-0.167]
47 Our simplifying assumptions include: all vehicles move at uniform speed; road length is a multiple of the distance cars travel in one step; we ignore interactions between cars or the relative positions of cars within one road segment except in intersection queues. [sent-149, score-1.494]
48 Every intersection has two queues per incoming road: one queue for right turns and one for going straight or left. [sent-151, score-0.349]
49 Every vehicle has a destination intersection that it navigates to via a shortest path. [sent-153, score-0.253]
50 If a driver can select either intersection queue without changing route distance, they choose the one that is currently green, or has the fewest cars in the queue. [sent-154, score-0.588]
51 To account for the gap between a phase ending and the next starting (inter-green time), and the fact that cars start up slowly, we restrict the number of cars that pass through an intersection in the first green step. [sent-156, score-0.814]
52 We represent saturated traffic conditions through a road capacity parameter. [sent-158, score-0.176]
53 Cars are only allowed to move into the next road segment if the number of cars there does not exceed 20. [sent-159, score-0.393]
54 At each time step (corresponding to about 5 seconds real-time) the controller decides which phase to activate in the next step. [sent-163, score-0.268]
55 We do not restrict the order of phases, but to ensure a reasonable policy we enforce the constraint that all phases must be activated at least once within 16 time steps. [sent-164, score-0.26]
56 The controller input for intersection i is ot,i , constructed as follows: Cycle duration: 16 bits, where the nth bit is on in the nth step of the cycle, supporting time based decisions like offsets. [sent-165, score-0.42]
57 3: The intersection model, rent duration, counting the total time spent in each phase in the showing 2 phases and detectors. [sent-169, score-0.463]
58 ˜ The controller maps observations ot,i for intersection i to a probability distribution at,i over the P phases using a linear approximator with outputs xt,i and the soft-max function. [sent-174, score-0.529]
59 Let θi be the P × |ot,i | matrix of parameters for intersection i. [sent-175, score-0.253]
60 We implemented two baseline controllers for comparison: (1) a uniform controller giving equal length to all phases; (2) A SCATS inspired adaptive controller called S AT that tries to achieve a saturation of 90% (thought to be used by SCATS) for all phases. [sent-181, score-0.483]
61 It updates the policy once per cycle depending on the current flows [7]. [sent-184, score-0.234]
62 This scenario focuses on an intersection in the centre of a crossroads. [sent-189, score-0.362]
63 Thus the optimal policy at the centre intersection also oscillates with the traffic volume. [sent-191, score-0.484]
64 This scenario is realistic because upstream intersections release periodic bursts of traffic, which then disperse as they travel along the road. [sent-192, score-0.402]
65 The road length is 3 units, and vehicles travel through 3 intersections and along 4 roads, leading to an optimal travel time of 12. [sent-194, score-0.732]
66 On average 3 cars enter the system per time step from each direction. [sent-195, score-0.259]
67 1 shows that NAC and O LPOMDP both improve upon the uniform controller and S AT. [sent-198, score-0.167]
68 We also ran the original NAC algorithm, using batch estimates of the direction in a line search to test whether an online NAC is advantageous. [sent-206, score-0.144]
69 We were able to construct a line search that converged faster than online NAC, but always to a significantly worse policy (TT 23 instead of 14. [sent-207, score-0.284]
70 Removing multiple observations caused a smooth degradation of the policy [7]. [sent-211, score-0.207]
71 In this scenario, we make use of only the neighbours feature in the observations, so the controller must use the detector counts of its neighbours to anticipate traffic. [sent-214, score-0.167]
72 The road network is the same as in the Fluctuating scenario. [sent-215, score-0.176]
73 A steady stream of 1 car per step travels EW, so that it is usually good for the centre controller to give maximum time to the EW-straight phase. [sent-216, score-0.35]
74 02 per step, we input a group of 15 cars from the north, travelling south. [sent-218, score-0.243]
75 When this happens, the controller should interrupt its normal policy and give more time to the NS-straight phase. [sent-219, score-0.346]
76 Table 1 shows that both algorithms learn a good policy in terms of travel time. [sent-220, score-0.355]
77 When examining the learned policies, we noted that the centre intersection controller had indeed learned to switch to the NS-straight phase just in time before the cars arrive, something that S AT and SCATS cannot do. [sent-221, score-0.79]
78 This scenario demonstrates learning an offset between neighbouring intersections, a feature that needs to be hand-tuned in SCATS. [sent-224, score-0.167]
79 The road length is two units for 4 roads, resulting in an optimal travel time of 8. [sent-226, score-0.352]
80 We restricted the observations to the cycle duration, meaning that our controllers learned from time information only. [sent-227, score-0.171]
81 We discovered, however, that learning an optimal policy is difficult, e. [sent-230, score-0.179]
82 we failed for a road length of 3 (given limited time). [sent-232, score-0.176]
83 Learning is hard because intersection n + 1 can only begin to learn when intersection n has already converged to an optimal policy. [sent-233, score-0.506]
84 In this scenario, local reward optimisation fails to find the global optimum, so we use the number of cars in the system as the global reward (which minimises the average travel time assuming constant input of cars). [sent-235, score-0.696]
85 This reward is hard to estimate in the real world, but we want to demonstrate the ability of the system to learn cooperative policies using a global reward. [sent-236, score-0.144]
86 Like the previous crossroads scenarios, we have NS and EW streams that interact only at a central intersection (D, in Fig. [sent-237, score-0.28]
87 The streams have the same volume, so the optimal policy splits time uniformly between the NW-straight and EW-straight phases. [sent-239, score-0.206]
88 An additional stream of cars is generated in the south-west corner, at Intersection H, and travels diagonally east to Intersection E. [sent-240, score-0.33]
89 However, cars that turn east to join the northbound traffic flow must then turn east again at Intersection D, forcing the controller of that intersection to devote time to a third NS-right phase and forcing the main volume of traffic to pause. [sent-242, score-0.881]
90 The optimal strategy is actually to route all cars north from Intersection F, so they join the main eastbound traffic flow. [sent-243, score-0.284]
91 This scenario relies on a driver model that prefers routes with shorter waiting times at intersections among routes of equal distance. [sent-244, score-0.333]
92 Our observations for this scenario consist only of the phase durations, informing the controller how much time it has spent in each phase during the current cycle. [sent-245, score-0.482]
93 S AT routed cars equally north and east at the critical intersection F, whilst the PG algorithms routed cars north. [sent-247, score-0.837]
94 In this scenario PG even beat our hand-coded optimal deterministic policy because it used stochasticity to find a “mixed” policy, giving phases a fractional number of steps on average. [sent-249, score-0.317]
95 In a 10 × 10 intersection network, perhaps modelling a central business district, each node potentially produces two cars at each time step according to a randomly initialised probability between 0 and 0. [sent-251, score-0.47]
96 The two destinations for the cars from each source are initialised randomly, but stay fixed during the experiment to generate some consistent patterns within the network. [sent-253, score-0.217]
97 O LPOMDP gave an average travel time improvement of 20% over S AT even though this scenario was not tailored for our controller. [sent-256, score-0.233]
98 1: Comparison of best travel times (TT)for all methods and all scenarios. [sent-261, score-0.176]
99 We used it in a distributed road traffic problem to demonstrate where machine learning could improve upon existing proprietary traffic controllers. [sent-340, score-0.176]
100 Learning traffic control - towards practical traffic control using policy gradients. [sent-392, score-0.257]
wordName wordTfidf (topN-words)
[('traf', 0.469), ('nac', 0.309), ('ot', 0.269), ('intersection', 0.253), ('cars', 0.217), ('pr', 0.198), ('pg', 0.186), ('policy', 0.179), ('travel', 0.176), ('road', 0.176), ('controller', 0.167), ('scats', 0.148), ('intersections', 0.142), ('optimisation', 0.106), ('lpomdp', 0.104), ('phase', 0.101), ('st', 0.09), ('controllers', 0.088), ('phases', 0.081), ('online', 0.08), ('sydney', 0.074), ('neighbouring', 0.071), ('tt', 0.071), ('adaption', 0.064), ('driver', 0.064), ('vehicles', 0.062), ('zt', 0.061), ('critic', 0.059), ('east', 0.059), ('scenario', 0.057), ('cycle', 0.055), ('rewards', 0.054), ('bits', 0.052), ('policies', 0.052), ('ew', 0.052), ('centre', 0.052), ('reward', 0.05), ('sensor', 0.046), ('roads', 0.044), ('secs', 0.044), ('reinforcement', 0.044), ('fisher', 0.044), ('system', 0.042), ('straight', 0.041), ('discounted', 0.041), ('car', 0.041), ('gradient', 0.04), ('australia', 0.039), ('offset', 0.039), ('gt', 0.039), ('batch', 0.039), ('control', 0.039), ('north', 0.039), ('wt', 0.039), ('throughput', 0.039), ('vanilla', 0.039), ('vt', 0.036), ('steady', 0.036), ('routes', 0.035), ('transportation', 0.035), ('ns', 0.035), ('pomdps', 0.033), ('optimise', 0.033), ('action', 0.032), ('state', 0.032), ('saturation', 0.031), ('rl', 0.031), ('adaptive', 0.03), ('actor', 0.03), ('drivers', 0.03), ('freiburg', 0.03), ('minimises', 0.03), ('going', 0.029), ('duration', 0.028), ('observations', 0.028), ('route', 0.028), ('bt', 0.028), ('spent', 0.028), ('stream', 0.028), ('ict', 0.028), ('realistic', 0.027), ('streams', 0.027), ('loop', 0.027), ('log', 0.027), ('observable', 0.026), ('green', 0.026), ('gradients', 0.026), ('rv', 0.026), ('travelling', 0.026), ('routed', 0.026), ('quote', 0.026), ('cities', 0.026), ('travels', 0.026), ('queue', 0.026), ('cooperate', 0.026), ('coordinated', 0.026), ('zy', 0.026), ('search', 0.025), ('volume', 0.025), ('local', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
2 0.24096598 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
3 0.15881056 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
4 0.15273121 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
5 0.14282867 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
6 0.1401052 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
7 0.11405683 96 nips-2006-In-Network PCA and Anomaly Detection
8 0.088365003 154 nips-2006-Optimal Change-Detection and Spiking Neurons
9 0.083102867 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
10 0.079295583 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
11 0.069802634 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
12 0.06772352 197 nips-2006-Uncertainty, phase and oscillatory hippocampal recall
13 0.066704385 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
14 0.0647365 13 nips-2006-A Scalable Machine Learning Approach to Go
15 0.060889646 61 nips-2006-Convex Repeated Games and Fenchel Duality
16 0.060468853 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
17 0.051917303 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
18 0.051893577 146 nips-2006-No-regret Algorithms for Online Convex Programs
19 0.048003439 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models
20 0.046052214 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
topicId topicWeight
[(0, -0.155), (1, -0.047), (2, -0.212), (3, -0.16), (4, 0.027), (5, -0.099), (6, 0.059), (7, -0.048), (8, 0.289), (9, 0.032), (10, -0.038), (11, -0.048), (12, -0.029), (13, -0.048), (14, 0.015), (15, -0.0), (16, -0.001), (17, -0.026), (18, 0.042), (19, 0.122), (20, 0.024), (21, -0.182), (22, -0.093), (23, -0.0), (24, -0.01), (25, -0.148), (26, 0.115), (27, 0.118), (28, 0.046), (29, -0.102), (30, 0.014), (31, -0.028), (32, 0.119), (33, 0.057), (34, 0.008), (35, -0.059), (36, -0.012), (37, -0.041), (38, -0.014), (39, -0.02), (40, 0.056), (41, -0.036), (42, 0.018), (43, 0.085), (44, -0.105), (45, 0.026), (46, 0.031), (47, -0.084), (48, -0.025), (49, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.96810174 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
2 0.86399072 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
3 0.64080656 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
4 0.62439638 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng
Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1
5 0.60348403 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
6 0.37563825 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
7 0.37024814 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
8 0.33491555 96 nips-2006-In-Network PCA and Anomaly Detection
9 0.3090446 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
10 0.30814952 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
11 0.30719444 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
12 0.30347177 154 nips-2006-Optimal Change-Detection and Spiking Neurons
13 0.27352306 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
14 0.26908678 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
15 0.2631222 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
16 0.25857639 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
17 0.25254518 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
18 0.24758874 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
19 0.24649228 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
20 0.23458898 129 nips-2006-Map-Reduce for Machine Learning on Multicore
topicId topicWeight
[(1, 0.068), (2, 0.031), (3, 0.028), (7, 0.063), (8, 0.014), (9, 0.044), (20, 0.013), (22, 0.057), (44, 0.062), (51, 0.33), (57, 0.061), (59, 0.013), (65, 0.032), (69, 0.031), (71, 0.033), (83, 0.012), (90, 0.01), (93, 0.011)]
simIndex simValue paperId paperTitle
1 0.84021974 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
Author: David Barber, Bertrand Mesot
Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and defines the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the filtered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively find the filtered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the filtered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To find a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We find p(ht+1 |st , it , st+1 , v1:t+1 ) by first computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The first factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefficients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are defined, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to find a suitable way to ‘correct’ the filtered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to find a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difficulty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the filtered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to find p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we find that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the filtered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is defined by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difficulties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically finessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difficult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conflicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best filtered results are given using ADF, since this is better able to represent the variance in the filtered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the filtered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF filtered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring significant benefits. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we briefly present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classifier, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difficult, mainly because the explicit influence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefficients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we define ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the first row contains the AR coefficients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the first component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefficients cr (s) are copied into the first row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a finite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main benefit of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.
same-paper 2 0.80814511 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
3 0.41029716 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
4 0.41012433 174 nips-2006-Similarity by Composition
Author: Oren Boiman, Michal Irani
Abstract: We propose a new approach for measuring similarity between two signals, which is applicable to many machine learning tasks, and to many signal types. We say that a signal S1 is “similar” to a signal S2 if it is “easy” to compose S1 from few large contiguous chunks of S2 . Obviously, if we use small enough pieces, then any signal can be composed of any other. Therefore, the larger those pieces are, the more similar S1 is to S2 . This induces a local similarity score at every point in the signal, based on the size of its supported surrounding region. These local scores can in turn be accumulated in a principled information-theoretic way into a global similarity score of the entire S1 to S2 . “Similarity by Composition” can be applied between pairs of signals, between groups of signals, and also between different portions of the same signal. It can therefore be employed in a wide variety of machine learning problems (clustering, classification, retrieval, segmentation, attention, saliency, labelling, etc.), and can be applied to a wide range of signal types (images, video, audio, biological data, etc.) We show a few such examples. 1
5 0.41005614 175 nips-2006-Simplifying Mixture Models through Function Approximation
Author: Kai Zhang, James T. Kwok
Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.
6 0.40893215 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
7 0.40777355 65 nips-2006-Denoising and Dimension Reduction in Feature Space
8 0.4077552 167 nips-2006-Recursive ICA
9 0.40773571 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
10 0.407134 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
11 0.40704492 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
12 0.40660846 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
13 0.40610492 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
14 0.40586716 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
15 0.40543681 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
16 0.40536436 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
17 0.40524411 98 nips-2006-Inferring Network Structure from Co-Occurrences
18 0.40505004 158 nips-2006-PG-means: learning the number of clusters in data
19 0.40370044 20 nips-2006-Active learning for misspecified generalized linear models
20 0.40365174 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons