nips nips2006 nips2006-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. [sent-7, score-0.429]
2 Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. [sent-8, score-0.521]
3 More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. [sent-10, score-1.003]
4 Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. [sent-12, score-0.323]
5 It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. [sent-13, score-0.326]
6 However, in many realworld scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. [sent-15, score-0.429]
7 [22], and Hansen and Zhou [10] proposed various algorithms that speed up planning in partially observable domains by exploiting the decompositions induced by a hierarchy. [sent-22, score-0.32]
8 However these approaches assume that a policy hierarchy is specified by the user, so an important question arises: how can we automate the discovery of a policy hierarchy? [sent-23, score-0.852]
9 In fully observable domains, there exists a large body of work on hierarchical Markov decision processes and reinforcement learning [6, 21, 7, 15] and several hierarchy discovery techniques have been proposed [23, 13, 11, 20]. [sent-24, score-0.614]
10 However those techniques rely on the assumption that states are fully observable to detect abstractions and subgoals, which prevents their use in partially observable domains. [sent-25, score-0.327]
11 We propose to frame hierarchy and policy discovery as an optimization problem with variables corresponding to the hierarchy and policy parameters. [sent-26, score-1.147]
12 We present an approach that searches in the space of hierarchical controllers [10] for a good hierarchical policy. [sent-27, score-0.661]
13 The search leads to a difficult non-convex optimization problem that we tackle using three approaches: generic non-linear solvers, a mixed-integer non-linear programming approximation or an alternating optimization technique that can be thought as a form of hierarchical bounded policy iteration. [sent-28, score-0.622]
14 We also generalize Hansen and Zhou’s hierarchical controllers [10] to allow recursive controllers. [sent-29, score-0.675]
15 These are controllers that may recursively call themselves, with the ability of representing policies with a finite number of parameters that would otherwise require infinitely many parameters. [sent-30, score-0.492]
16 Recursive policies are likely to arise in language processing tasks such as dialogue management and text generation due to the recursive nature of language models. [sent-31, score-0.469]
17 1), which is the framework used throughout the paper for planning in partially observable domains. [sent-34, score-0.282]
18 Then we review how to represent POMDP policies as finite state controllers (Sect. [sent-35, score-0.515]
19 2) as well as some algorithms to optimize controllers of a fixed size (Sect. [sent-37, score-0.377]
20 1 POMDPs POMDPs have emerged as a popular framework for planning in partially observable domains [12]. [sent-41, score-0.32]
21 Thus we define a policy π as a mapping from histories of past actions and observations to actions. [sent-47, score-0.409]
22 The value V π of policy π when starting with belief b is measured by the expected sum of the future rewards: V π (b) = t R(bt , π(bt )), where R(b, a) = s b(s)R(s, a). [sent-54, score-0.317]
23 An optimal policy π ∗ is a policy with the highest value V ∗ for all beliefs (i. [sent-55, score-0.568]
24 2 Policy Representation A convenient representation for an important class of policies is that of finite state controllers [9]. [sent-60, score-0.515]
25 A finite state controller consists of a finite state automaton (N, E) with a set N of nodes and a set E of directed edges Each node n has one outgoing edge per observation. [sent-61, score-0.621]
26 A controller encodes a policy π = (α, β) by mapping each node to an action (i. [sent-62, score-0.728]
27 , α(n) = a) and each edge (referred by its observation label o and its parent node n) to a successor node (i. [sent-64, score-0.34]
28 At runtime, the policy encoded by a controller is executed by doing the action at = α(nt ) associated with the node nt traversed at time step t and following the edge labelled with observation ot to reach the next node nt+1 = β(nt , ot ). [sent-67, score-0.937]
29 Stochastic controllers [18] can also be used to represent stochastic policies by redefining α and β as distributions over actions and successor nodes. [sent-68, score-0.666]
30 More precisely, let Prα (a|n) be the distribution from which an action a is sampled in node n and let Prβ (n |n, a, o) be the distribution from which the successor node n is sampled after executing a and receiving o in node n. [sent-69, score-0.565]
31 Given time and memory constraints, it is common practice to search for the best controller with a bounded number of nodes [18]. [sent-71, score-0.385]
32 However, when the number of nodes is fixed, the best controller is not necessarily deterministic. [sent-72, score-0.35]
33 This explains why searching in the space of stochastic controllers may be advantageous. [sent-73, score-0.413]
34 Table 1: Quadratically constrained optimization program for bounded stochastic controllers [1]. [sent-74, score-0.506]
35 3 x n x i ∀n, s y ∀n, o x Optimization of Stochastic Controllers The optimization of a stochastic controller with a fixed number of nodes can be formulated as a quadratically constrained optimization problem (QCOP) [1]. [sent-78, score-0.532]
36 Several techniques have been tried including gradient ascent [14], stochastic local search [3], bounded policy iteration (BPI) [18] and a general non-linear solver called SNOPT (based on sequential quadratic programming) [1, 8]. [sent-85, score-0.387]
37 Given a policy with fixed parameters Pr(a, n |n, o), policy evaluation solves the linear system in Eq 1 to find Vn (s) for all n, s. [sent-90, score-0.568]
38 1 3 Hierarchical controllers Hansen and Zhou [10] recently proposed hierarchical finite-state controllers as a simple and intuitive way of encoding hierarchical policies. [sent-93, score-1.038]
39 A hierarchical controller consists of a set of nodes and edges as in a flat controller, however some nodes may be abstract, corresponding to sub-controllers themselves. [sent-94, score-0.65]
40 As with flat controllers, concrete nodes are parameterized with an action mapping α and edges outgoing concrete nodes are parameterized by a successor node mapping β. [sent-95, score-0.787]
41 In contrast, abstract nodes are parameterized by a child node mapping indicating in which child node the subcontroller should start. [sent-96, score-0.772]
42 In fully observable domains, it is customary to stop the subcontroller once a goal state (from a predefined set of terminal states) is reached. [sent-99, score-0.448]
43 This strategy cannot work in partially observable domains, so Hansen and Zhou propose to terminate a subcontroller when an end node (from a predefined set of terminal nodes) is reached. [sent-100, score-0.632]
44 Since the decision to reach a terminal node is made according to the successor node mapping β, the timing for returning control is implicitly optimized. [sent-101, score-0.474]
45 Terminal nodes do not have any outgoing edges nor any action mapping since they already have an action assigned. [sent-104, score-0.349]
46 The hierarchy of the controller is assumed to be finite and specified by the programmer. [sent-105, score-0.46]
47 Subcontrollers at the bottom level are made up only of concrete nodes and therefore can be optimized as usual using any controller optimization technique. [sent-107, score-0.482]
48 Hence, the immediate reward of an abstract node n corresponds to the value Vα(¯ ) (s) of its child node α(¯ ). [sent-110, score-0.384]
49 Similarly, the probability of reach¯ n n ing state s after executing the subcontroller of an abstract node n corresponds to the probability ¯ Pr(send |s, α(¯ )) of terminating the subcontroller in send when starting in s at child node α(¯ ). [sent-111, score-1.032]
50 Hansen’s hierarchical controllers have two limitations: the hierarchy must have a finite number of levels and it must be specified by hand. [sent-122, score-0.783]
51 In the next section we describe recursive controllers which may have infinitely many levels. [sent-123, score-0.533]
52 We also describe an algorithm to discover a suitable hierarchy by simultaneously optimizing the controller parameters and hierarchy. [sent-124, score-0.488]
53 1 Recursive Controllers In some domains, policies are naturally recursive in the sense that they decompose into subpolicies that may call themselves. [sent-126, score-0.298]
54 Assuming POMDP dialogue management eventually handles decisions at the sentence level, recursive policies will naturally arise. [sent-129, score-0.397]
55 Similarly, language generation with POMDPs would naturally lead to recursive policies that reflect the recursive nature of language models. [sent-130, score-0.499]
56 We now propose several modifications to Hansen and Zhou’s hierarchical controllers that simplify things while allowing recursive controllers. [sent-131, score-0.675]
57 First, the subcontrollers of abstract nodes may be composed of any node (including the parent node itself) and transitions can be made to any node anywhere (whether concrete or abstract). [sent-132, score-0.752]
58 This allows recursive controllers and smaller controllers since nodes may be shared across levels. [sent-133, score-1.037]
59 Second, we use a single terminal node that has no action nor any outer edge. [sent-134, score-0.289]
60 Hence, the actions that were associated with the terminal nodes in Hansen and Zhou’s proposal are associated with the abstract nodes in our proposal. [sent-138, score-0.424]
61 This allows a uniform parameterization of actions for all nodes while reducing the number of terminal nodes to 1. [sent-139, score-0.424]
62 Fourth, the outer edges of abstract nodes are labelled with regular observations since an observation will be made following the execution of the action of an abstract node. [sent-140, score-0.293]
63 2 Hierarchy and Policy Optimization We formulate the search for a good stochastic recursive controller, including the automated hierarchy discovery, as an optimization problem (see Table 2). [sent-147, score-0.51]
64 The global maximum of this optimization problem corresponds to the optimal policy (and hierarchy) for a fixed set N of concrete nodes n ¯ and a fixed set N of abstract nodes n. [sent-148, score-0.67]
65 The variables consist of the value function Vn (s), the policy ¯ parameters Pr(n , a|n, o), the (stochastic) child node mapping Pr(n |¯ ) for each abstract node n and n ¯ the occupancy frequency oc(n, s|n0 , s0 ) of each (n, s)-pair when starting in (n0 , s0 ). [sent-149, score-0.651]
66 3) is the expected value s b0 (s)Vn0 (s) of starting the controller in node n0 with initial belief b0 . [sent-151, score-0.397]
67 The expected value of an abstract node corresponds to the sum of three terms: the expected value Vnbeg (s) of its subcontroller given by its child node nbeg , the reward R(send , an ) ¯ immediately after the termination of the subcontroller and the future rewards Vn (s ). [sent-153, score-1.062]
68 Note that the reward R(send , an ) depends on the state send in which the subcontroller terminates. [sent-158, score-0.464]
69 ¯ Hence we need to compute the probability that the last state visited in the subcontroller is send . [sent-159, score-0.425]
70 This probability is given by the occupancy frequency oc(send , nend |s, nbeg ), which is recursively defined in Eq. [sent-160, score-0.296]
71 Only the nodes labelled with numbers larger than the label of an abstract node can be children of that abstract node. [sent-167, score-0.324]
72 This constraint ensures that chains of child node mappings have a finite length, eventually reaching a concrete node where an action is executed. [sent-168, score-0.498]
73 Constraints, like the ones in Table 1, are also needed to guarantee that the policy parameters and the child node mappings are proper distributions. [sent-169, score-0.509]
74 html Table 2: Non-convex quarticly constrained optimization problem for hierarchy and policy discovery in bounded stochastic recursive controllers. [sent-190, score-0.853]
75 Note that fixing Vn (s ) and oc(s, n|s0 , n0 ) while varying the policy ¯ parameters is reminiscent of policy iteration, hence the name bounded hierarchical policy iteration. [sent-199, score-1.029]
76 4 Discussion Discovering a hierarchy offers many advantages over previous methods that assume the hierarchy is already known. [sent-201, score-0.474]
77 Note however that discovering the hierarchy while optimizing the policy is a much more difficult problem than simply optimizing the policy parameters. [sent-204, score-0.844]
78 Our approach can also be used when the hierarchy and the policy are partly known. [sent-208, score-0.521]
79 It is also interesting to note that hierarchical policies may be encoded with exponentially fewer nodes in a hierarchical controller than a flat controller. [sent-211, score-0.749]
80 Intuitively, when a subcontroller is called by k abstract nodes, this subcontroller is shared by all its abstract parents. [sent-212, score-0.43]
81 If a hierarchical controller has l levels with subcontrollers shared by k parents in each level, then the equivalent flat controller will need O(k l ) copies. [sent-214, score-0.723]
82 By allowing recursive controllers, policies may be represented even more compactly. [sent-215, score-0.271]
83 Recursive controllers allow abstract nodes to call subcontrollers that may contain themselves. [sent-216, score-0.612]
84 73 equivalent non-hierarchical controller would have to unroll the recursion by creating a separate copy of the subcontroller each time it is called. [sent-240, score-0.438]
85 Since recursive controllers essentially call themselves infinitely many times, they can represent infinitely large non-recursive controllers with finitely many nodes. [sent-241, score-0.91]
86 As a comparison, recursive controllers are to non-recursive hierarchical controllers what context-free grammars are to regular expressions. [sent-242, score-1.052]
87 Since the leading approaches for controller optimization fix the number of nodes [18, 1], one may be able to find a much better policy by considering hierarchical recursive controllers. [sent-243, score-0.99]
88 In addition, hierarchical controllers may be easier to understand and interpret than flat controllers given their natural decomposition into subcontrollers and their possibly smaller size. [sent-244, score-1.004]
89 We used the SNOPT package to directly solve the non-convex optimization problem in Table 2 and bounded hierarchical policy iteration (BHPI) to solve it iteratively. [sent-246, score-0.519]
90 We optimized hierarchical controllers of two levels with a fixed number of nodes reported in the column labelled “Num. [sent-249, score-0.729]
91 In particular, the hierarchy discovered for the paint problem matches the one hand coded by Pineau in her PhD thesis [16]. [sent-256, score-0.305]
92 5 Conclusion & Future Work This paper proposes the first approach for hierarchy discovery in partially observable planning problems. [sent-259, score-0.566]
93 We model the search for a good hierarchical policy as a non-convex optimization problem with variables corresponding to the hierarchy and policy parameters. [sent-260, score-1.005]
94 We propose to tackle the optimization problem using non-linear solvers such as SNOPT or by reformulating the problem as an approximate MINLP or as a sequence of MILPs that can be thought of as a form of hierarchical bounded policy iteration. [sent-261, score-0.567]
95 We also generalize Hansen and Zhou’s hierarchical controllers to recursive controllers. [sent-265, score-0.675]
96 Recursive controllers can encode policies with finitely many nodes that would otherwise require infinitely large 3 http://pomdp. [sent-266, score-0.619]
97 Further details about recursive controllers and our other contributions can be found in [5]. [sent-271, score-0.533]
98 We plan to further investigate the use of recursive controllers in dialogue management and text generation where recursive policies are expected to naturally capture the recursive nature of language models. [sent-272, score-1.122]
99 Automated hierarchy discovery for planning in partially observable domains. [sent-296, score-0.566]
100 An improved policy iteration algorithm for partially observable MDPs. [sent-318, score-0.47]
wordName wordTfidf (topN-words)
[('pr', 0.393), ('controllers', 0.377), ('policy', 0.284), ('hierarchy', 0.237), ('controller', 0.223), ('subcontroller', 0.215), ('nbeg', 0.188), ('send', 0.187), ('oc', 0.175), ('recursive', 0.156), ('vn', 0.143), ('hierarchical', 0.142), ('node', 0.141), ('nodes', 0.127), ('observable', 0.12), ('hansen', 0.119), ('policies', 0.115), ('nend', 0.108), ('snopt', 0.108), ('subcontrollers', 0.108), ('planning', 0.096), ('terminal', 0.09), ('pomdp', 0.085), ('dialogue', 0.081), ('actions', 0.08), ('pomdps', 0.079), ('concrete', 0.074), ('zhou', 0.072), ('bhpi', 0.067), ('minlp', 0.067), ('partially', 0.066), ('child', 0.063), ('successor', 0.058), ('action', 0.058), ('optimization', 0.058), ('labelled', 0.056), ('bpi', 0.054), ('ok', 0.054), ('outgoing', 0.053), ('nitely', 0.051), ('waterloo', 0.047), ('discovery', 0.047), ('reinforcement', 0.046), ('management', 0.045), ('absorb', 0.04), ('disjunctive', 0.04), ('lbx', 0.04), ('paint', 0.04), ('pineau', 0.04), ('ubx', 0.04), ('termination', 0.039), ('discovering', 0.039), ('transition', 0.039), ('reward', 0.039), ('domains', 0.038), ('stochastic', 0.036), ('language', 0.036), ('bounded', 0.035), ('nt', 0.034), ('belief', 0.033), ('ontario', 0.032), ('xh', 0.032), ('solver', 0.032), ('edges', 0.031), ('quadratically', 0.03), ('discount', 0.029), ('aaai', 0.028), ('discover', 0.028), ('thesis', 0.028), ('posmdps', 0.027), ('poupart', 0.027), ('subgoals', 0.027), ('subpolicies', 0.027), ('theocharous', 0.027), ('vnbeg', 0.027), ('discounting', 0.027), ('levels', 0.027), ('bb', 0.026), ('executing', 0.026), ('phd', 0.025), ('tackle', 0.025), ('table', 0.024), ('histories', 0.023), ('maze', 0.023), ('shuttle', 0.023), ('state', 0.023), ('automated', 0.023), ('solvers', 0.023), ('decision', 0.022), ('user', 0.022), ('mapping', 0.022), ('execution', 0.021), ('abstractions', 0.021), ('terminating', 0.021), ('rewards', 0.021), ('mappings', 0.021), ('constraints', 0.02), ('transitions', 0.02), ('programming', 0.02), ('notoriously', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
2 0.29242697 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
3 0.24096598 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
4 0.20319398 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
5 0.1858681 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
6 0.1726782 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
7 0.12628786 154 nips-2006-Optimal Change-Detection and Spiking Neurons
8 0.11553057 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
10 0.10028908 115 nips-2006-Learning annotated hierarchies from relational data
11 0.089036748 69 nips-2006-Distributed Inference in Dynamical Systems
12 0.077382669 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
13 0.062263448 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
14 0.062206235 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
15 0.060394678 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
16 0.052785188 124 nips-2006-Linearly-solvable Markov decision problems
17 0.052526191 47 nips-2006-Boosting Structured Prediction for Imitation Learning
18 0.049815562 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
19 0.048923619 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
20 0.046942592 153 nips-2006-Online Clustering of Moving Hyperplanes
topicId topicWeight
[(0, -0.156), (1, -0.015), (2, -0.219), (3, -0.236), (4, 0.104), (5, -0.055), (6, 0.052), (7, -0.022), (8, 0.398), (9, -0.023), (10, -0.03), (11, -0.086), (12, 0.019), (13, -0.114), (14, -0.082), (15, -0.014), (16, -0.027), (17, -0.034), (18, 0.08), (19, 0.103), (20, -0.087), (21, -0.168), (22, -0.182), (23, -0.017), (24, 0.023), (25, -0.132), (26, 0.158), (27, 0.016), (28, 0.044), (29, -0.056), (30, 0.07), (31, 0.029), (32, 0.105), (33, 0.076), (34, 0.007), (35, -0.046), (36, -0.023), (37, 0.019), (38, -0.004), (39, -0.037), (40, 0.075), (41, -0.004), (42, 0.047), (43, 0.14), (44, -0.054), (45, 0.062), (46, 0.065), (47, 0.034), (48, -0.081), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.9824307 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
2 0.84797418 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
3 0.6498735 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
4 0.64436954 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
5 0.55672681 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng
Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1
6 0.41760692 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
7 0.39900419 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
8 0.36667284 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
9 0.30951798 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
10 0.28328472 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
11 0.28155178 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
12 0.26941016 69 nips-2006-Distributed Inference in Dynamical Systems
13 0.25912201 115 nips-2006-Learning annotated hierarchies from relational data
14 0.25146395 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
15 0.2339143 154 nips-2006-Optimal Change-Detection and Spiking Neurons
16 0.21725358 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
17 0.19709639 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
18 0.19423378 124 nips-2006-Linearly-solvable Markov decision problems
19 0.19037533 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
20 0.18052295 153 nips-2006-Online Clustering of Moving Hyperplanes
topicId topicWeight
[(1, 0.081), (3, 0.017), (7, 0.049), (9, 0.049), (20, 0.017), (22, 0.032), (44, 0.073), (57, 0.079), (59, 0.388), (65, 0.042), (69, 0.028), (71, 0.027), (90, 0.016)]
simIndex simValue paperId paperTitle
1 0.72021621 53 nips-2006-Combining causal and similarity-based reasoning
Author: Charles Kemp, Patrick Shafto, Allison Berke, Joshua B. Tenenbaum
Abstract: Everyday inductive reasoning draws on many kinds of knowledge, including knowledge about relationships between properties and knowledge about relationships between objects. Previous accounts of inductive reasoning generally focus on just one kind of knowledge: models of causal reasoning often focus on relationships between properties, and models of similarity-based reasoning often focus on similarity relationships between objects. We present a Bayesian model of inductive reasoning that incorporates both kinds of knowledge, and show that it accounts well for human inferences about the properties of biological species. 1
same-paper 2 0.71038997 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
3 0.37650055 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
4 0.37403202 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
5 0.37360761 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
6 0.3733277 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
7 0.37283877 175 nips-2006-Simplifying Mixture Models through Function Approximation
8 0.37226596 154 nips-2006-Optimal Change-Detection and Spiking Neurons
9 0.37194809 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
10 0.37063435 34 nips-2006-Approximate Correspondences in High Dimensions
11 0.37010702 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
12 0.36940989 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
13 0.36835572 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
14 0.3675285 158 nips-2006-PG-means: learning the number of clusters in data
15 0.36709338 117 nips-2006-Learning on Graph with Laplacian Regularization
16 0.36488032 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
17 0.36480838 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
18 0.36408526 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
19 0.36401799 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
20 0.36327133 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure