nips nips2004 nips2004-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
Reference: text
sentIndex sentText sentNum sentScore
1 For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. [sent-10, score-0.765]
2 A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. [sent-11, score-0.479]
3 1 Introduction Many real-world planning problems involve concurrent optimization of a set of prioritized subgoals of the problem by dynamically merging a set of (previously learned) policies optimizing the subgoals. [sent-13, score-1.094]
4 A familiar example of this type of problem would be a driving task which may involve subgoals such as safely navigating the car, talking on the cell phone, and drinking coffee, with the first subgoal taking precedence over the others. [sent-14, score-0.665]
5 We assume that the agent has access to a set of learned activities modeled by a set of SMDP controllers ζ = {C1 , C2 , . [sent-18, score-0.543]
6 , Cn } each achieving a subgoal ωi from a set of subgoals Ω = {ω1 , ω2 , . [sent-21, score-0.621]
7 We further assume that the agent-environment interaction is an episodic task where at the be- ginning of each episode a subset of subgoals ω ⊆ Ω are introduced to the agent, where subgoals are ranked according to some priority ranking system. [sent-25, score-0.852]
8 The agent is to devise a global policy by merging the policies associated with the controllers into a global policy that simultaneously commits to them according to their degree of significance. [sent-26, score-1.617]
9 In general optimal policies of controllers do not offer flexibility required for the merging process. [sent-27, score-0.749]
10 Thus for every controller we also compute a set of admissible suboptimal policies that reflect the degree of flexibility we can afford in it. [sent-28, score-0.747]
11 Given a controller, an admissible policy is either an optimal policy, or it is a policy that ascends the optimal state-value function associated with the controller (i. [sent-29, score-1.313]
12 The first three actions lead the agent to states with higher values, in other words they ascend the state-value function, while action d descends it. [sent-38, score-0.545]
13 Figure 1(b) shows how introducing admissible policies enables simultaneously solving multiple subgoals. [sent-39, score-0.37]
14 In this figure, action a and c are optimal in controllers C 1 and C2 respectively, but they both descend the state-value function of the other controller. [sent-40, score-0.462]
15 In this paper we formally introduce the framework of redundant controllers in terms of the set of admissible policies associated with them and present an algorithm for merging such policies given a coarticulation task. [sent-47, score-1.292]
16 We also present a set of theoretical results analyzing various properties of such controllers, and also the performance of the policy merging algorithm. [sent-48, score-0.505]
17 The theoretical results are complemented by an experimental study that illustrates the trade-offs between the degree of flexibility of controllers and the performance of the policy generated by the merging process. [sent-49, score-0.846]
18 2 Redundant Controllers In this section we introduce the framework of redundant controllers and formally define the set of admissible policies in them. [sent-50, score-0.813]
19 A subgoal option can be viewed as a closed loop controller that achieves a subgoal of some kind. [sent-52, score-0.976]
20 Formally, a subgoal option of an MDP M = S, A, P, R is defined by a tuple C = MC , I, β . [sent-53, score-0.344]
21 The MDP MC = SC , AC , PC , RC is the option MDP induced by the option C in which SC ⊆ S, AC ⊆ A, PC is the transition probability function induced by P, and RC is chosen to reflect the subgoal of the option. [sent-54, score-0.407]
22 The policy component of such options are the solutions to the option MDP MC associated with them. [sent-55, score-0.465]
23 For theoretical reasons, in this paper we assume that each controller optimizes a minimum cost-to-goal problem. [sent-57, score-0.464]
24 Every action in a non-goal state incurs some negative reward and the agent receives a reward of zero in goal states. [sent-60, score-0.382]
25 A controller C is a minimum cost-to-goal controller, if MC optimizes a minimum cost-to-goal problem. [sent-61, score-0.504]
26 The controller also terminates with probability one in every goal state. [sent-62, score-0.4]
27 We are now ready to formally introduce the concept of ascending policies in an MDP: Definition 1: Given an MDP M = S, A, P, R , a function L : S → IR, and a deterministic policy π : S → A, let ρπ (s) = Es ∼P π(s) {L(s )} − L(s), where s Es ∼P π(s) {. [sent-63, score-0.738]
28 } is the expectation with respect to the distribution over next states s given the current state and the policy π. [sent-64, score-0.44]
29 Then π is ascending on L, if for every state s (except for the goal states if the MDP models a minimum cost-to-goal problem) we have ρπ (s) > 0. [sent-65, score-0.359]
30 For an ascending policy π on a function L, function ρ : S → IR+ gives a strictly positive value that measures how much the policy π ascends on L in state s. [sent-66, score-0.94]
31 A deterministic policy π is descending on L, if for some state s, ρπ (s) < 0. [sent-67, score-0.438]
32 In general we would like to study how a given policy behaves with respect to the optimal value function in a problem. [sent-68, score-0.358]
33 The above condition can be interpreted as follows: we are interested in policies that in average lead to states with higher values, or in other words ascend the state-value function surface. [sent-72, score-0.482]
34 The minimum and maximum rate at which an ascending policy in average ascends V ∗ are given by: Definition 2: Assume that the policy π is ascending on the optimal state value function V ∗ . [sent-74, score-1.189]
35 ¯ One problem with ascending policies is that Definition 1 ignores the immediate reward which the agent receives. [sent-78, score-0.561]
36 For example it could be the case that as a result of executing an ascending policy, the agent transitions to some state with a higher value, but receives a huge negative reward. [sent-79, score-0.42]
37 Here, measures how close the ascending policy π is to the optimal policy. [sent-81, score-0.5]
38 In section 3 we derive a lower bound on such that no policy for values smaller than this bound is ascending on V ∗ (in other words cannot be arbitrarily small). [sent-84, score-0.524]
39 Similarly, a deterministic policy π is called -ascending on C, if π is -ascending on M C . [sent-85, score-0.354]
40 Next, we introduce the framework of redundant controllers: Definition 4: A minimum cost-to-goal controller C is an -redundant controller if there exist multiple deterministic policies that are either optimal, or -ascending on C. [sent-86, score-1.09]
41 We represent the set of such admissible policies by χC . [sent-87, score-0.37]
42 Also, the minimum ascend rate of C is defined as: κ = minπ∈χC κπ , where κπ is the ascend rate of a ¯ policy π ∈ χC (see Definition 2). [sent-88, score-0.772]
43 We can compute the -redundant set of policies for a controller C as follows. [sent-89, score-0.574]
44 Next, we present an algorithm for merging policies associated with a set of prioritized redundant controllers that run in parallel. [sent-91, score-0.933]
45 For specifying the order of priority relation among the controllers we use the expression Cj Ci , where the relation “ ” expresses the subject-to relation (taken from [3]). [sent-92, score-0.455]
46 This equation should read: controller Cj subject-to controller Ci . [sent-93, score-0.702]
47 Without loss of generality we assume that the controllers are prioritized based on the following ranking system: {Cj Ci |i < j}. [sent-95, score-0.477]
48 , Cm ) 1: Input: current state s; the set of controllers Ci ; the redundant-sets ACii (s) for every controller Ci . [sent-100, score-0.782]
49 rithm, Λi (s) represents the ordered intersection of the redundant-sets ACjj up to the controller Ci (i. [sent-107, score-0.351]
50 In other words, each set Λi (s) contains a set of actions in state s that are all i -ascending with respect to the superior controllers C1 , C2 , . [sent-110, score-0.527]
51 This happens when none of the actions with respect to some controller Cj (i. [sent-115, score-0.435]
52 In this case the algorithm skips the controller C j , and continues the search in the redundant-sets of the remaining subordinate controllers. [sent-118, score-0.43]
53 In the next section, we theoretically analyze redundant controllers and the performance of the policy merging algorithm in various situations. [sent-120, score-0.911]
54 3 Theoretical Results In this section we present some of our theoretical results characterizing -redundant controllers, in terms of the bounds on the number of time steps it takes for a controller to complete its task, and the performance of the policy merging algorithm. [sent-121, score-0.891]
55 In section 2 we stated that there is a lower bound on such that there exist no -ascending policy for values smaller than this bound. [sent-123, score-0.356]
56 In the first theorem we compute this lower bound: Theorem 1 Let M = S, A, P, R be a minimum cost-to-goal MDP and let π |V ∗ | be an -ascending policy defined on M. [sent-124, score-0.417]
57 ¯ ¯ Such a lower bound characterizes the maximum flexibility we can afford in a redundant controller and gives us an insight on the range of values that we can choose for it. [sent-126, score-0.46]
58 In the second theorem we derive an upper bound on the expected number of steps that a minimum cost-to-goal controller takes to complete when executing an -ascending policy: Theorem 2 Let C be an -ascending minimum cost-to-goal controller and let s denote the current state of the controller. [sent-127, score-1.026]
59 Then any -ascending policy π on C will terminate the controller in some goal state with probability one. [sent-128, score-0.792]
60 Furthermore, ter∗ mination occurs in average in at most −V π(s) steps, where κπ is the guaranteed κ expected ascend rate of the policy π. [sent-129, score-0.543]
61 This result assures that the controller arrives in a goal state and will achieve its goal in a bounded number of steps. [sent-130, score-0.487]
62 We use this result when studying performance of running multiple redundant controllers in parallel. [sent-131, score-0.424]
63 Moreover, if ∀s ∈ S, AC1 (s) ∩ AC2 (s) = ∅, policy π will be ascending on both con1 2 trollers with the ascend rate at least κπ = min{κπ1 , κπ2 }. [sent-133, score-0.664]
64 This theorem states that merging policies of two controllers using Algorithm MergeController would generate a policy that remains 1 -ascending on the superior controller. [sent-134, score-1.164]
65 In the next theorem, we establish bounds on the expected number of steps that it takes for the policy obtained by Algorithm MergeController to achieve a set of prioritized subgoals ω = {ω1 , . [sent-136, score-0.809]
66 , ωm } by concurrently executing the associated controllers {C1 , . [sent-139, score-0.443]
67 Let the policy π denote the policy obtained by Algorithm MergeController based on the ranking system {Cj Ci |i < j}. [sent-149, score-0.692]
68 Let µζ (s) denote the expected number of steps for the policy π for achieving all the subgoals {ω1 , ω2 , . [sent-150, score-0.705]
69 Then the following expression holds: max i ∗ −Vi (s) ≤ µζ (s) ≤ π ηi m P(h) h∈H i=1 ∗ −Vi (h(i)) κi ¯ (1) π where ηi is the maximum possible achievable expected ascend rate for the controller Ci (see Definition 2), H is the set of sequences h = s, g1 , g2 , . [sent-154, score-0.543]
70 , gm in which gi is a goal state in controller Ci (i. [sent-157, score-0.532]
71 The probability distribution m C1 Ci P(h) = Psg1 i=2 Pgi−1 gi over sequences h ∈ H gives the probability of executing the set of controllers in sequence based on the order of priority starting in state s, and observing the goal state sequence g1 , . [sent-160, score-0.724]
72 Based on Theorem 3, when Algorithm MergeController always finds a policy π that optimizes all controllers (i. [sent-164, score-0.708]
73 , ∀s ∈ S, ∩m ACii (s) = ∅), policy π will ascend on all i=1 controllers. [sent-166, score-0.522]
74 Thus in average the total time for all controllers to terminate equals the time required for a controller that takes the most time to complete which has the −V ∗ (s) lower bound of maxi ηπi(s) . [sent-167, score-0.763]
75 The worst case happens when the policy π generated by Algorithm MergeController can not optimize more than one controller at a time. [sent-168, score-0.681]
76 In this case π always optimizes the controller with the highest priority until its termination, then optimizes the second highest priority controller and continues this process to the end in a sequential manner. [sent-169, score-1.062]
77 The right hand side of the inequality given by Equation 1 gives an upper bound for the expected time required for all controllers to complete when they are executed sequentially. [sent-170, score-0.385]
78 4 Experiments In this section we present our experimental results analyzing redundant controllers and the policy merging algorithm described in section 2. [sent-172, score-0.911]
79 Figure 2(a) shows a 10 × 10 grid world where an agent is to visit a set of prioritized locations marked by G1 , . [sent-173, score-0.341]
80 The agent’s goal is to achieve all of the subgoals by focusing on superior subgoals and coarticulating with the subordinate ones. [sent-177, score-0.8]
81 Intuitively, when the agent is navigating to some subgoal Gi of higher priority, if some subgoal of lower priority Gj is en route to Gi , or not too off from the optimal path to Gi , the agent may choose to visit Gj . [sent-178, score-1.133]
82 We model this problem by an MDP G1 G1 G1 G1 G3 G4 G2 (a) (b) (c) (d) Figure 2: (a) A 10 × 10 grid world where an agent is to visit a set of prioritized subgoal locations; (b) The optimal policy associated with the subgoal G 1 ; (c) The -ascending policy for = 0. [sent-179, score-1.583]
83 M = S, A, R, P , where S is the set of states consisting of 100 locations in the room, and A is the set of actions consisting of eight stochastic navigation actions (four actions in the compass direction, and four diagonal actions). [sent-182, score-0.373]
84 We assume that the gent has access to a set of controllers C1 , . [sent-188, score-0.341]
85 A controller Ci is a minimum cost-to-goal subgoal option Ci = MCi , I, β , where MCi = M, the initiation set I includes any locations except for the subgoal location, and β forces the option to terminate only in the subgoal location. [sent-195, score-1.435]
86 Figures 2(b)-(d) show examples of admissible policies for subgoal G1 : Figure 2(b) shows the optimal policy of the controller C1 (navigating the agent to the location G1 ). [sent-196, score-1.52]
87 Note that by reducing , we obtain a larger set of admissible policies although less optimal. [sent-200, score-0.37]
88 At the beginning of each episode the agent is placed in a random location, and a fixed number of subgoals (in our experiments m = 4) are randomly selected. [sent-203, score-0.526]
89 The concurrent planning method consistently outperforms the sequential planning in all starting locations. [sent-207, score-0.434]
90 05 (b) Figure 3: (a) Performance of both planning methods in terms of the average number of steps in every starting state; (b) Performance of the concurrent method for different values of . [sent-218, score-0.373]
91 same task, we measure how the performance of the concurrent method varies by varying , when computing the set of -ascending policies for every subgoal. [sent-219, score-0.414]
92 Figure 3(b) shows the performance of the concurrent method and Figure 4(a) shows the average number of subgoals coarticulated by the agent – averaged over all states – for different values of . [sent-220, score-0.762]
93 Note that for = 1, the only admissible policy is the optimal policy and thus it does not offer much flexibility with respect to the other subgoals. [sent-226, score-0.835]
94 This can be seen in Figure 3(b) in which the policy generated by the merging algorithm for = 1. [sent-227, score-0.487]
95 However, the more we reduce , the less optimal admissible policies we obtain. [sent-230, score-0.398]
96 Figure 4(a) also shows by relaxing optimality (reducing ), the policy generated by the merging algorithm commits to more subgoals simultaneously. [sent-233, score-0.878]
97 9 and varied the number of subgoals from m = 2 to m = 50 (all of these results are averaged over 100 episodes, each consisting of 10 trials). [sent-235, score-0.361]
98 It can be observed that the concurrent method consistently outperforms the sequential method by increasing the number of subgoals (top curve shows the performance of the sequential method and bottom curve shows that of concurrent method). [sent-237, score-0.746]
99 method is able to visit multiple subgoals of lower priority en route the primary subgoals, thus it can save more time. [sent-258, score-0.519]
100 We are currently investigating compact representation of the set of admissible policies by exploiting the structure of actions. [sent-262, score-0.37]
wordName wordTfidf (topN-words)
[('controller', 0.351), ('controllers', 0.341), ('subgoals', 0.34), ('policy', 0.33), ('subgoal', 0.281), ('policies', 0.223), ('ascend', 0.192), ('mergecontroller', 0.178), ('concurrent', 0.165), ('agent', 0.16), ('merging', 0.157), ('admissible', 0.147), ('ascending', 0.142), ('sg', 0.118), ('mdp', 0.117), ('priority', 0.114), ('planning', 0.105), ('prioritized', 0.104), ('actions', 0.084), ('redundant', 0.083), ('ci', 0.082), ('ascends', 0.074), ('coarticulation', 0.074), ('amherst', 0.07), ('state', 0.064), ('action', 0.063), ('option', 0.063), ('acii', 0.059), ('subordinate', 0.059), ('minimum', 0.058), ('massachusetts', 0.055), ('executing', 0.054), ('commits', 0.051), ('gm', 0.051), ('states', 0.046), ('erent', 0.046), ('navigating', 0.044), ('visit', 0.044), ('cj', 0.043), ('gi', 0.043), ('activities', 0.042), ('exibility', 0.04), ('mc', 0.039), ('sequential', 0.038), ('superior', 0.038), ('optimizes', 0.037), ('reward', 0.036), ('steps', 0.035), ('cm', 0.035), ('locations', 0.033), ('ranking', 0.032), ('ac', 0.031), ('acjj', 0.03), ('coarticulated', 0.03), ('descend', 0.03), ('epsilon', 0.03), ('grupen', 0.03), ('mahadevan', 0.03), ('mci', 0.03), ('mins', 0.03), ('vmax', 0.03), ('options', 0.029), ('theorem', 0.029), ('di', 0.029), ('platt', 0.028), ('optimal', 0.028), ('nition', 0.027), ('every', 0.026), ('redundancy', 0.026), ('episode', 0.026), ('episodes', 0.026), ('assures', 0.026), ('ord', 0.026), ('rohanimanesh', 0.026), ('smdp', 0.026), ('subtasks', 0.026), ('bound', 0.026), ('associated', 0.025), ('terminate', 0.024), ('deterministic', 0.024), ('concurrently', 0.023), ('maxs', 0.023), ('goal', 0.023), ('consisting', 0.021), ('route', 0.021), ('average', 0.021), ('starting', 0.021), ('descending', 0.02), ('abstraction', 0.02), ('continues', 0.02), ('formally', 0.019), ('robotics', 0.018), ('sc', 0.018), ('objectives', 0.018), ('executed', 0.018), ('solutions', 0.018), ('decision', 0.018), ('theoretical', 0.018), ('gj', 0.017), ('ir', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
2 0.21436593 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
3 0.18508604 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
Author: Pascal Poupart, Craig Boutilier
Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.
4 0.16437405 24 nips-2004-Approximately Efficient Online Mechanism Design
Author: David C. Parkes, Dimah Yanovsky, Satinder P. Singh
Abstract: Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision process (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the possibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement efficient policies, and retain truth-revelation as an approximate BayesianNash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1
5 0.1564883 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
6 0.1292827 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
7 0.12804098 123 nips-2004-Multi-agent Cooperation in Diverse Population Games
8 0.12715937 175 nips-2004-Stable adaptive control with online learning
9 0.10724667 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
10 0.1030542 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
11 0.097753398 102 nips-2004-Learning first-order Markov models for control
12 0.070405081 155 nips-2004-Responding to Modalities with Different Latencies
13 0.063104272 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems
14 0.060688544 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
15 0.048887085 33 nips-2004-Brain Inspired Reinforcement Learning
16 0.046756893 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
17 0.044855375 171 nips-2004-Solitaire: Man Versus Machine
18 0.04304748 183 nips-2004-Temporal-Difference Networks
19 0.04184166 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
20 0.034421414 87 nips-2004-Integrating Topics and Syntax
topicId topicWeight
[(0, -0.114), (1, -0.036), (2, 0.301), (3, -0.072), (4, -0.2), (5, 0.236), (6, -0.046), (7, -0.064), (8, -0.072), (9, 0.025), (10, 0.015), (11, -0.023), (12, 0.004), (13, 0.049), (14, -0.013), (15, 0.024), (16, -0.061), (17, -0.026), (18, -0.032), (19, 0.009), (20, 0.025), (21, -0.028), (22, -0.076), (23, 0.138), (24, -0.061), (25, -0.007), (26, 0.113), (27, -0.112), (28, 0.082), (29, -0.007), (30, 0.116), (31, -0.068), (32, -0.043), (33, -0.063), (34, 0.018), (35, -0.023), (36, -0.007), (37, 0.036), (38, 0.012), (39, 0.029), (40, 0.111), (41, 0.021), (42, 0.052), (43, 0.015), (44, -0.02), (45, -0.002), (46, 0.014), (47, 0.027), (48, 0.003), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.97279 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
2 0.77791846 24 nips-2004-Approximately Efficient Online Mechanism Design
Author: David C. Parkes, Dimah Yanovsky, Satinder P. Singh
Abstract: Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision process (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the possibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement efficient policies, and retain truth-revelation as an approximate BayesianNash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1
3 0.73946178 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
Author: Maxim Likhachev, Sebastian Thrun, Geoffrey J. Gordon
Abstract: Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (d) Planning map (b) Robotic helicopter (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c . This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2.1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . . . , sk , ak where a1 through ak−1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1 , whose outcome distribution is P (s | sk , ak ) and whose cost is k−1 c(si , ai , si+1 ) + c(sk , ak , s ) c(s1 , a, s ) = i=1 (See Figure 2(a) for an example of such a compressed action.) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . . . , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s.t. RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. It maintains the invariant that 0 ≤ v(s) ≤ v ∗ (s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c . The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot , and starts a search at that state. (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c .) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. Since v(s ) ≤ v ∗ (s ), one way to guarantee that RHS(spivot , a) ≤ v ∗ (spivot ) is to compute an optimal compression of (spivot , s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A∗ search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A∗ search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c . It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot ) to the returned value. This value is at least as large as RHS(spivot ). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than δ. For small enough δ, this property guarantees that we will be able to find a good policy (see section 2.3). 2.2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 ≤ v ≤ v ∗ (spivot ) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . That is, v ≥ RHS(spivot ) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v ≤ vlim ≤ v ∗ (spivot ) (Property 3). Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c .) Property 3 uses v ∗ rather than v ∗ since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more efficient for our purposes. MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v ∗ values, h(s) ≤ v ∗ (s). Finally, MCP-search speeds up repetitive searches by improving heuristic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s ∈ M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return ∞; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), Es ∈Succ(s,a) {c(s, a, s ) + h(s )}); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) = ∞; InsertUpdateCompAction(spivot , s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot , s, a) into A(spivot ) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v ∗ (u); 14 set the cost c(u, a, sgoal ) of the limit action a from u to v(u); ¯ ¯ procedure Search(spivot ) 15 CheckInitialize(sgoal ), CheckInitialize(spivot ); 16 g(spivot ) = 0; 17 OPEN = {{spivot , null}}; 18 {sbest , abest } = {null, null}, f best = ∞; 19 while(g(sgoal ) > min{s,a}∈OPEN (fvalue({s, a})) AND f best + θ > min{s,a}∈OPEN (fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s ∈ Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a ∈ A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) − c(s, a , s )); 27 if g(s ) > g(s) + c(s, a , s ) 28 g(s ) = g(s) + c(s, a , s ); 29 insert/update {s , null} into OPEN with fvalue({s , null}); 30 for each stochastic a ∈ A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot , s, a); 34 if f best > fvalue({s, a}) then {sbest , abest } = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal ) ≤ min{s,a}∈OPEN (fvalue({s, a})) AND OPEN = ∅) 36 reconstruct the path from spivot to sgoal ; 37 update/insert into A(spivot ) a deterministic action a leading to sgoal ; 38 if f best ≥ g(sgoal ) then {sbest , abest } = {sgoal , null}, f best = g(sgoal ); 39 return [f best; min{s,a}∈OPEN (fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es ∈Succ(s,a) (c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A∗ . When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best ≤ v ∗ (spivot ) and is used as a new estimate for v(spivot ). The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). This minimum f -value then provides a lower bound on v ∗ (spivot ). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v ∗ (s). Line 02 uses the fact that f best − g(s) is a lower bound on v ∗ (s) at the end of each previous call to Search; for more details see [4]. 2.3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theorems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. Upon termination, for every state s ∈ M c ∩ πgreedy we ∗ have RHS(s) − δ ≤ v(s) ≤ v (s). Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). Picking δ ≥ cmin is not guaranteed to result in a proper policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem described in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execution trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have implemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochasticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times faster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster than RTDP. On average MCP computed values for 127,442 states, while the size of M c was 3,713 states, and 24,052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart )) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA’s MARS program. All conclusions are our own. References [1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62, 2001. [3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16. Cambridge, MA: MIT Press, 2003. [4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report, Carnegie Mellon University, Pittsburgh, PA, 2004. [5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence, 23:13–27, 1984. [6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing, 1980. [7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processses. Mathematics of Operations Research, 12(3):441–450, 1987.
4 0.68478441 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
Author: Pascal Poupart, Craig Boutilier
Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.
5 0.6601935 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
Author: Daniela D. Farias, Benjamin V. Roy
Abstract: We introduce a new algorithm based on linear programming that approximates the differential value function of an average-cost Markov decision process via a linear combination of pre-selected basis functions. The algorithm carries out a form of cost shaping and minimizes a version of Bellman error. We establish an error bound that scales gracefully with the number of states without imposing the (strong) Lyapunov condition required by its counterpart in [6]. We propose a path-following method that automates selection of important algorithm parameters which represent counterparts to the “state-relevance weights” studied in [6]. 1
6 0.60869074 123 nips-2004-Multi-agent Cooperation in Diverse Population Games
7 0.60864961 64 nips-2004-Experts in a Markov Decision Process
8 0.55585116 88 nips-2004-Intrinsically Motivated Reinforcement Learning
9 0.54233122 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
10 0.44134003 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
11 0.36986226 175 nips-2004-Stable adaptive control with online learning
12 0.35144556 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
13 0.34974658 155 nips-2004-Responding to Modalities with Different Latencies
14 0.32814598 102 nips-2004-Learning first-order Markov models for control
15 0.30719 171 nips-2004-Solitaire: Man Versus Machine
16 0.25542647 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems
17 0.24394785 183 nips-2004-Temporal-Difference Networks
18 0.23181817 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
19 0.22376221 124 nips-2004-Multiple Alignment of Continuous Time Series
20 0.22354424 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning
topicId topicWeight
[(13, 0.037), (15, 0.064), (26, 0.037), (31, 0.017), (33, 0.68), (39, 0.011), (50, 0.014), (58, 0.014), (87, 0.015)]
simIndex simValue paperId paperTitle
1 0.99429309 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel a framework for deriving approximations for intractable probabilistic models. This framework is based on a free energy (negative log marginal likelihood) and can be seen as a generalization of adaptive TAP [1, 2, 3] and expectation propagation (EP) [4, 5]. The free energy is constructed from two approximating distributions which encode different aspects of the intractable model such a single node constraints and couplings and are by construction consistent on a chosen set of moments. We test the framework on a difficult benchmark problem with binary variables on fully connected graphs and 2D grid graphs. We find good performance using sets of moments which either specify factorized nodes or a spanning tree on the nodes (structured approximation). Surprisingly, the Bethe approximation gives very inferior results even on grids. 1
2 0.99245989 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse
Abstract: We study a method of optimal data-driven aggregation of classifiers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classifiers of activation patterns in fMRI based on locally trained SVM classifiers. The aggregation coefficients are then used to design a ”boosting map” of the brain needed to identify the regions with most significant impact on classification. 1
same-paper 3 0.99225277 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
4 0.99026167 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
Author: Tanzeem Choudhury, Sumit Basu
Abstract: In this work, we quantitatively investigate the ways in which a given person influences the joint turn-taking behavior in a conversation. After collecting an auditory database of social interactions among a group of twenty-three people via wearable sensors (66 hours of data each over two weeks), we apply speech and conversation detection methods to the auditory streams. These methods automatically locate the conversations, determine their participants, and mark which participant was speaking when. We then model the joint turn-taking behavior as a Mixed-Memory Markov Model [1] that combines the statistics of the individual subjects' self-transitions and the partners ' cross-transitions. The mixture parameters in this model describe how much each person 's individual behavior contributes to the joint turn-taking behavior of the pair. By estimating these parameters, we thus estimate how much influence each participant has in determining the joint turntaking behavior. We show how this measure correlates significantly with betweenness centrality [2], an independent measure of an individual's importance in a social network. This result suggests that our estimate of conversational influence is predictive of social influence. 1
5 0.9888199 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
Author: Robert D. Kleinberg
Abstract: In the multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when there is an infinite strategy set. Here we consider the case when the set of strategies is a subset of Rd , and the cost functions are continuous. In the d = 1 case, we improve on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. We also consider the case where d > 1 and the cost functions are convex, adapting a recent online convex optimization algorithm of Zinkevich to the sparser feedback model of the multi-armed bandit problem. 1
6 0.98726612 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
7 0.98607028 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces
8 0.98572785 115 nips-2004-Maximum Margin Clustering
9 0.90289938 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale
10 0.90200162 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
11 0.89504272 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
12 0.88949597 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers
13 0.8875953 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
14 0.88385737 44 nips-2004-Conditional Random Fields for Object Recognition
15 0.88365871 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
16 0.87940824 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
17 0.87171382 166 nips-2004-Semi-supervised Learning via Gaussian Processes
18 0.87099254 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
19 0.87007076 48 nips-2004-Convergence and No-Regret in Multiagent Learning
20 0.86758429 64 nips-2004-Experts in a Markov Decision Process