nips nips2004 nips2004-147 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Maxim Likhachev, Sebastian Thrun, Geoffrey J. Gordon
Abstract: Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (d) Planning map (b) Robotic helicopter (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c . This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2.1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . . . , sk , ak where a1 through ak−1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1 , whose outcome distribution is P (s | sk , ak ) and whose cost is k−1 c(si , ai , si+1 ) + c(sk , ak , s ) c(s1 , a, s ) = i=1 (See Figure 2(a) for an example of such a compressed action.) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . . . , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s.t. RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. It maintains the invariant that 0 ≤ v(s) ≤ v ∗ (s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c . The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot , and starts a search at that state. (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c .) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. Since v(s ) ≤ v ∗ (s ), one way to guarantee that RHS(spivot , a) ≤ v ∗ (spivot ) is to compute an optimal compression of (spivot , s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A∗ search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A∗ search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c . It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot ) to the returned value. This value is at least as large as RHS(spivot ). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than δ. For small enough δ, this property guarantees that we will be able to find a good policy (see section 2.3). 2.2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 ≤ v ≤ v ∗ (spivot ) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . That is, v ≥ RHS(spivot ) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v ≤ vlim ≤ v ∗ (spivot ) (Property 3). Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c .) Property 3 uses v ∗ rather than v ∗ since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more efficient for our purposes. MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v ∗ values, h(s) ≤ v ∗ (s). Finally, MCP-search speeds up repetitive searches by improving heuristic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s ∈ M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return ∞; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), Es ∈Succ(s,a) {c(s, a, s ) + h(s )}); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) = ∞; InsertUpdateCompAction(spivot , s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot , s, a) into A(spivot ) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v ∗ (u); 14 set the cost c(u, a, sgoal ) of the limit action a from u to v(u); ¯ ¯ procedure Search(spivot ) 15 CheckInitialize(sgoal ), CheckInitialize(spivot ); 16 g(spivot ) = 0; 17 OPEN = {{spivot , null}}; 18 {sbest , abest } = {null, null}, f best = ∞; 19 while(g(sgoal ) > min{s,a}∈OPEN (fvalue({s, a})) AND f best + θ > min{s,a}∈OPEN (fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s ∈ Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a ∈ A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) − c(s, a , s )); 27 if g(s ) > g(s) + c(s, a , s ) 28 g(s ) = g(s) + c(s, a , s ); 29 insert/update {s , null} into OPEN with fvalue({s , null}); 30 for each stochastic a ∈ A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot , s, a); 34 if f best > fvalue({s, a}) then {sbest , abest } = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal ) ≤ min{s,a}∈OPEN (fvalue({s, a})) AND OPEN = ∅) 36 reconstruct the path from spivot to sgoal ; 37 update/insert into A(spivot ) a deterministic action a leading to sgoal ; 38 if f best ≥ g(sgoal ) then {sbest , abest } = {sgoal , null}, f best = g(sgoal ); 39 return [f best; min{s,a}∈OPEN (fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es ∈Succ(s,a) (c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A∗ . When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best ≤ v ∗ (spivot ) and is used as a new estimate for v(spivot ). The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). This minimum f -value then provides a lower bound on v ∗ (spivot ). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v ∗ (s). Line 02 uses the fact that f best − g(s) is a lower bound on v ∗ (s) at the end of each previous call to Search; for more details see [4]. 2.3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theorems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. Upon termination, for every state s ∈ M c ∩ πgreedy we ∗ have RHS(s) − δ ≤ v(s) ≤ v (s). Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). Picking δ ≥ cmin is not guaranteed to result in a proper policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem described in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execution trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have implemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochasticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times faster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster than RTDP. On average MCP computed values for 127,442 states, while the size of M c was 3,713 states, and 24,052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart )) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA’s MARS program. All conclusions are our own. References [1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62, 2001. [3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16. Cambridge, MA: MIT Press, 2003. [4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report, Carnegie Mellon University, Pittsburgh, PA, 2004. [5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence, 23:13–27, 1984. [6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing, 1980. [7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processses. Mathematics of Operations Research, 12(3):441–450, 1987.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. [sent-7, score-0.293]
2 Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. [sent-9, score-0.513]
3 In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. [sent-10, score-0.265]
4 We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. [sent-11, score-0.398]
5 In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. [sent-12, score-0.325]
6 Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. [sent-13, score-0.297]
7 By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. [sent-14, score-0.322]
8 By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. [sent-15, score-0.281]
9 Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. [sent-16, score-0.278]
10 In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. [sent-18, score-0.161]
11 It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). [sent-19, score-0.714]
12 In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. [sent-20, score-0.16]
13 We will call such an MDP a compressed MDP, and we will call its states distinguished states. [sent-21, score-0.406]
14 We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. [sent-22, score-0.303]
15 Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. [sent-25, score-0.467]
16 If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. [sent-26, score-0.276]
17 In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. [sent-27, score-0.33]
18 It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. [sent-28, score-0.802]
19 Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. [sent-29, score-0.18]
20 Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. [sent-30, score-0.406]
21 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. [sent-31, score-0.181]
22 The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. [sent-32, score-0.233]
23 The helicopter can aid the ground robot by flying to and sensing places in the map. [sent-33, score-0.434]
24 Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e. [sent-38, score-0.223]
25 The robot must take the unknown locations into account when planning for its route. [sent-42, score-0.326]
26 Alternately, the robot can ask the helicopter to fly to any of these places and sense them. [sent-44, score-0.337]
27 The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. [sent-46, score-0.74]
28 Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. [sent-47, score-0.318]
29 Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). [sent-48, score-0.471]
30 In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. [sent-49, score-0.408]
31 The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. [sent-50, score-0.327]
32 The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . [sent-51, score-0.479]
33 So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. [sent-52, score-0.19]
34 We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. [sent-53, score-0.193]
35 We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. [sent-55, score-0.172]
36 With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. [sent-58, score-0.557]
37 At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. [sent-61, score-0.472]
38 This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. [sent-63, score-0.246]
39 The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . [sent-64, score-0.308]
40 Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. [sent-65, score-0.225]
41 On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. [sent-66, score-0.472]
42 1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . [sent-68, score-0.664]
43 ) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . [sent-72, score-0.45]
44 , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). [sent-75, score-1.073]
45 We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). [sent-76, score-0.779]
46 Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). [sent-77, score-0.527]
47 Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). [sent-78, score-0.21]
48 Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. [sent-79, score-0.357]
49 If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). [sent-80, score-0.562]
50 If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . [sent-81, score-0.41]
51 However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s. [sent-82, score-0.743]
52 RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. [sent-84, score-2.003]
53 A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . [sent-85, score-0.929]
54 So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. [sent-86, score-0.254]
55 Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . [sent-87, score-0.568]
56 It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. [sent-89, score-0.395]
57 The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. [sent-92, score-0.379]
58 MCP selects a state with negative Bellman error, spivot , and starts a search at that state. [sent-93, score-0.762]
59 (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c . [sent-94, score-0.837]
60 ) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). [sent-95, score-0.501]
61 This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. [sent-96, score-0.396]
62 As the search finds new compressed actions, it adds them and their outcomes to M c . [sent-100, score-0.441]
63 In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). [sent-105, score-0.618]
64 These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. [sent-106, score-0.159]
65 MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). [sent-107, score-0.223]
66 For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. [sent-108, score-1.186]
67 After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. [sent-109, score-0.175]
68 2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . [sent-114, score-0.997]
69 First, the estimate v that search returns for the expected cost of spivot should always be admissible. [sent-116, score-0.758]
70 Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . [sent-118, score-0.587]
71 This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. [sent-120, score-0.973]
72 The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). [sent-121, score-0.258]
73 Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. [sent-123, score-1.02]
74 (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c . [sent-124, score-0.386]
75 One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. [sent-126, score-0.181]
76 However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. [sent-127, score-0.242]
77 Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. [sent-128, score-0.155]
78 MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. [sent-130, score-0.207]
79 During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. [sent-135, score-0.208]
80 MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). [sent-138, score-0.188]
81 This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. [sent-142, score-0.929]
82 When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). [sent-145, score-0.515]
83 It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). [sent-146, score-0.812]
84 f best tracks the minimum f -value of all the compressed actions found. [sent-147, score-0.368]
85 The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). [sent-149, score-0.252]
86 To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). [sent-151, score-0.291]
87 Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). [sent-158, score-0.582]
88 If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. [sent-159, score-1.119]
89 Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). [sent-160, score-0.198]
90 Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). [sent-164, score-0.44]
91 Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. [sent-165, score-0.173]
92 Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). [sent-167, score-0.622]
93 Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. [sent-171, score-0.221]
94 This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. [sent-172, score-0.299]
95 We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). [sent-174, score-0.155]
96 VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. [sent-180, score-0.242]
97 On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). [sent-186, score-0.16]
98 The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. [sent-187, score-0.338]
99 The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. [sent-188, score-0.154]
100 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. [sent-203, score-0.33]
wordName wordTfidf (topN-words)
[('spivot', 0.587), ('mcp', 0.383), ('sgoal', 0.287), ('compressed', 0.254), ('mdp', 0.197), ('robot', 0.146), ('fvalue', 0.144), ('helicopter', 0.143), ('action', 0.132), ('planning', 0.115), ('sstart', 0.108), ('vlim', 0.108), ('null', 0.106), ('states', 0.104), ('search', 0.097), ('rhs', 0.095), ('actions', 0.087), ('open', 0.085), ('lao', 0.084), ('bellman', 0.084), ('state', 0.078), ('deterministic', 0.076), ('sbest', 0.072), ('policy', 0.067), ('sensing', 0.062), ('cmin', 0.06), ('succ', 0.06), ('outcomes', 0.058), ('stochastic', 0.056), ('compression', 0.054), ('successors', 0.052), ('places', 0.048), ('abest', 0.048), ('checkinitialize', 0.048), ('optimistic', 0.048), ('rtdp', 0.048), ('distinguished', 0.048), ('cost', 0.047), ('coordination', 0.046), ('visit', 0.044), ('heuristic', 0.043), ('admissible', 0.04), ('searches', 0.04), ('unknown', 0.038), ('astoch', 0.036), ('likhachev', 0.036), ('ground', 0.035), ('property', 0.035), ('ak', 0.034), ('uncertain', 0.034), ('sk', 0.032), ('adds', 0.032), ('perfect', 0.031), ('path', 0.031), ('obstacle', 0.031), ('stochasticity', 0.031), ('iteration', 0.03), ('executing', 0.029), ('sensor', 0.029), ('visited', 0.029), ('environments', 0.028), ('outcome', 0.028), ('faster', 0.027), ('best', 0.027), ('paths', 0.027), ('returns', 0.027), ('cycles', 0.027), ('locations', 0.027), ('acyclic', 0.027), ('trajectories', 0.026), ('line', 0.025), ('greedy', 0.025), ('ara', 0.024), ('improveheuristic', 0.024), ('insertupdatecompaction', 0.024), ('maxim', 0.024), ('repetitive', 0.024), ('sstoch', 0.024), ('run', 0.024), ('sparse', 0.023), ('irrelevant', 0.022), ('returned', 0.022), ('gordon', 0.021), ('mdps', 0.021), ('else', 0.021), ('plan', 0.021), ('anytime', 0.021), ('reachable', 0.021), ('limit', 0.02), ('min', 0.019), ('heuristics', 0.019), ('secs', 0.019), ('encounters', 0.019), ('nds', 0.019), ('theorem', 0.019), ('return', 0.019), ('bound', 0.018), ('mellon', 0.018), ('carnegie', 0.018), ('ties', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
Author: Maxim Likhachev, Sebastian Thrun, Geoffrey J. Gordon
Abstract: Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (d) Planning map (b) Robotic helicopter (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c . This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2.1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . . . , sk , ak where a1 through ak−1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1 , whose outcome distribution is P (s | sk , ak ) and whose cost is k−1 c(si , ai , si+1 ) + c(sk , ak , s ) c(s1 , a, s ) = i=1 (See Figure 2(a) for an example of such a compressed action.) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . . . , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s.t. RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. It maintains the invariant that 0 ≤ v(s) ≤ v ∗ (s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c . The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot , and starts a search at that state. (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c .) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. Since v(s ) ≤ v ∗ (s ), one way to guarantee that RHS(spivot , a) ≤ v ∗ (spivot ) is to compute an optimal compression of (spivot , s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A∗ search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A∗ search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c . It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot ) to the returned value. This value is at least as large as RHS(spivot ). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than δ. For small enough δ, this property guarantees that we will be able to find a good policy (see section 2.3). 2.2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 ≤ v ≤ v ∗ (spivot ) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . That is, v ≥ RHS(spivot ) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v ≤ vlim ≤ v ∗ (spivot ) (Property 3). Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c .) Property 3 uses v ∗ rather than v ∗ since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more efficient for our purposes. MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v ∗ values, h(s) ≤ v ∗ (s). Finally, MCP-search speeds up repetitive searches by improving heuristic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s ∈ M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return ∞; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), Es ∈Succ(s,a) {c(s, a, s ) + h(s )}); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) = ∞; InsertUpdateCompAction(spivot , s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot , s, a) into A(spivot ) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v ∗ (u); 14 set the cost c(u, a, sgoal ) of the limit action a from u to v(u); ¯ ¯ procedure Search(spivot ) 15 CheckInitialize(sgoal ), CheckInitialize(spivot ); 16 g(spivot ) = 0; 17 OPEN = {{spivot , null}}; 18 {sbest , abest } = {null, null}, f best = ∞; 19 while(g(sgoal ) > min{s,a}∈OPEN (fvalue({s, a})) AND f best + θ > min{s,a}∈OPEN (fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s ∈ Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a ∈ A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) − c(s, a , s )); 27 if g(s ) > g(s) + c(s, a , s ) 28 g(s ) = g(s) + c(s, a , s ); 29 insert/update {s , null} into OPEN with fvalue({s , null}); 30 for each stochastic a ∈ A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot , s, a); 34 if f best > fvalue({s, a}) then {sbest , abest } = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal ) ≤ min{s,a}∈OPEN (fvalue({s, a})) AND OPEN = ∅) 36 reconstruct the path from spivot to sgoal ; 37 update/insert into A(spivot ) a deterministic action a leading to sgoal ; 38 if f best ≥ g(sgoal ) then {sbest , abest } = {sgoal , null}, f best = g(sgoal ); 39 return [f best; min{s,a}∈OPEN (fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es ∈Succ(s,a) (c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A∗ . When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best ≤ v ∗ (spivot ) and is used as a new estimate for v(spivot ). The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). This minimum f -value then provides a lower bound on v ∗ (spivot ). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v ∗ (s). Line 02 uses the fact that f best − g(s) is a lower bound on v ∗ (s) at the end of each previous call to Search; for more details see [4]. 2.3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theorems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. Upon termination, for every state s ∈ M c ∩ πgreedy we ∗ have RHS(s) − δ ≤ v(s) ≤ v (s). Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). Picking δ ≥ cmin is not guaranteed to result in a proper policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem described in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execution trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have implemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochasticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times faster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster than RTDP. On average MCP computed values for 127,442 states, while the size of M c was 3,713 states, and 24,052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart )) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA’s MARS program. All conclusions are our own. References [1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62, 2001. [3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16. Cambridge, MA: MIT Press, 2003. [4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report, Carnegie Mellon University, Pittsburgh, PA, 2004. [5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence, 23:13–27, 1984. [6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing, 1980. [7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processses. Mathematics of Operations Research, 12(3):441–450, 1987.
2 0.1563922 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
Author: Pascal Poupart, Craig Boutilier
Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.
3 0.11509651 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
4 0.10724667 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
5 0.092144698 24 nips-2004-Approximately Efficient Online Mechanism Design
Author: David C. Parkes, Dimah Yanovsky, Satinder P. Singh
Abstract: Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision process (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the possibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement efficient policies, and retain truth-revelation as an approximate BayesianNash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1
6 0.08584585 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
7 0.065958031 102 nips-2004-Learning first-order Markov models for control
8 0.054911006 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
9 0.051948007 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
10 0.047371428 124 nips-2004-Multiple Alignment of Continuous Time Series
11 0.045301095 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
12 0.045244116 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
13 0.041659366 88 nips-2004-Intrinsically Motivated Reinforcement Learning
14 0.039332185 183 nips-2004-Temporal-Difference Networks
15 0.039160863 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
16 0.038184915 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
17 0.03767715 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning
18 0.035559628 166 nips-2004-Semi-supervised Learning via Gaussian Processes
19 0.03364937 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
20 0.032661993 70 nips-2004-Following Curved Regularized Optimization Solution Paths
topicId topicWeight
[(0, -0.098), (1, -0.015), (2, 0.165), (3, -0.05), (4, -0.112), (5, 0.143), (6, -0.004), (7, -0.033), (8, -0.068), (9, -0.027), (10, 0.028), (11, -0.026), (12, -0.007), (13, 0.064), (14, 0.009), (15, -0.034), (16, 0.008), (17, -0.044), (18, -0.054), (19, -0.03), (20, -0.026), (21, 0.033), (22, -0.077), (23, 0.066), (24, -0.013), (25, 0.035), (26, -0.008), (27, -0.049), (28, 0.109), (29, 0.074), (30, 0.115), (31, -0.07), (32, -0.166), (33, -0.063), (34, -0.041), (35, -0.016), (36, -0.019), (37, -0.013), (38, -0.03), (39, 0.149), (40, 0.073), (41, 0.082), (42, 0.026), (43, 0.05), (44, -0.113), (45, 0.068), (46, -0.01), (47, 0.022), (48, -0.027), (49, -0.128)]
simIndex simValue paperId paperTitle
same-paper 1 0.95816988 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
Author: Maxim Likhachev, Sebastian Thrun, Geoffrey J. Gordon
Abstract: Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (d) Planning map (b) Robotic helicopter (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c . This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2.1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . . . , sk , ak where a1 through ak−1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1 , whose outcome distribution is P (s | sk , ak ) and whose cost is k−1 c(si , ai , si+1 ) + c(sk , ak , s ) c(s1 , a, s ) = i=1 (See Figure 2(a) for an example of such a compressed action.) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . . . , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s.t. RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. It maintains the invariant that 0 ≤ v(s) ≤ v ∗ (s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c . The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot , and starts a search at that state. (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c .) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. Since v(s ) ≤ v ∗ (s ), one way to guarantee that RHS(spivot , a) ≤ v ∗ (spivot ) is to compute an optimal compression of (spivot , s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A∗ search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A∗ search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c . It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot ) to the returned value. This value is at least as large as RHS(spivot ). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than δ. For small enough δ, this property guarantees that we will be able to find a good policy (see section 2.3). 2.2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 ≤ v ≤ v ∗ (spivot ) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . That is, v ≥ RHS(spivot ) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v ≤ vlim ≤ v ∗ (spivot ) (Property 3). Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c .) Property 3 uses v ∗ rather than v ∗ since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more efficient for our purposes. MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v ∗ values, h(s) ≤ v ∗ (s). Finally, MCP-search speeds up repetitive searches by improving heuristic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s ∈ M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return ∞; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), Es ∈Succ(s,a) {c(s, a, s ) + h(s )}); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) = ∞; InsertUpdateCompAction(spivot , s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot , s, a) into A(spivot ) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v ∗ (u); 14 set the cost c(u, a, sgoal ) of the limit action a from u to v(u); ¯ ¯ procedure Search(spivot ) 15 CheckInitialize(sgoal ), CheckInitialize(spivot ); 16 g(spivot ) = 0; 17 OPEN = {{spivot , null}}; 18 {sbest , abest } = {null, null}, f best = ∞; 19 while(g(sgoal ) > min{s,a}∈OPEN (fvalue({s, a})) AND f best + θ > min{s,a}∈OPEN (fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s ∈ Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a ∈ A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) − c(s, a , s )); 27 if g(s ) > g(s) + c(s, a , s ) 28 g(s ) = g(s) + c(s, a , s ); 29 insert/update {s , null} into OPEN with fvalue({s , null}); 30 for each stochastic a ∈ A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot , s, a); 34 if f best > fvalue({s, a}) then {sbest , abest } = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal ) ≤ min{s,a}∈OPEN (fvalue({s, a})) AND OPEN = ∅) 36 reconstruct the path from spivot to sgoal ; 37 update/insert into A(spivot ) a deterministic action a leading to sgoal ; 38 if f best ≥ g(sgoal ) then {sbest , abest } = {sgoal , null}, f best = g(sgoal ); 39 return [f best; min{s,a}∈OPEN (fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es ∈Succ(s,a) (c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A∗ . When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best ≤ v ∗ (spivot ) and is used as a new estimate for v(spivot ). The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). This minimum f -value then provides a lower bound on v ∗ (spivot ). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v ∗ (s). Line 02 uses the fact that f best − g(s) is a lower bound on v ∗ (s) at the end of each previous call to Search; for more details see [4]. 2.3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theorems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. Upon termination, for every state s ∈ M c ∩ πgreedy we ∗ have RHS(s) − δ ≤ v(s) ≤ v (s). Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). Picking δ ≥ cmin is not guaranteed to result in a proper policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem described in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execution trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have implemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochasticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times faster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster than RTDP. On average MCP computed values for 127,442 states, while the size of M c was 3,713 states, and 24,052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart )) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA’s MARS program. All conclusions are our own. References [1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62, 2001. [3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16. Cambridge, MA: MIT Press, 2003. [4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report, Carnegie Mellon University, Pittsburgh, PA, 2004. [5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence, 23:13–27, 1984. [6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing, 1980. [7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processses. Mathematics of Operations Research, 12(3):441–450, 1987.
2 0.73796529 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
Author: Pascal Poupart, Craig Boutilier
Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.
3 0.73423153 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
4 0.73027951 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
Author: Daniela D. Farias, Benjamin V. Roy
Abstract: We introduce a new algorithm based on linear programming that approximates the differential value function of an average-cost Markov decision process via a linear combination of pre-selected basis functions. The algorithm carries out a form of cost shaping and minimizes a version of Bellman error. We establish an error bound that scales gracefully with the number of states without imposing the (strong) Lyapunov condition required by its counterpart in [6]. We propose a path-following method that automates selection of important algorithm parameters which represent counterparts to the “state-relevance weights” studied in [6]. 1
5 0.5893513 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
Author: Michael P. Holmes, Charles Jr.
Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for finding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1
6 0.47215995 64 nips-2004-Experts in a Markov Decision Process
7 0.43034318 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
8 0.38474888 74 nips-2004-Harmonising Chorales by Probabilistic Inference
9 0.36635694 24 nips-2004-Approximately Efficient Online Mechanism Design
10 0.34815022 171 nips-2004-Solitaire: Man Versus Machine
11 0.34785098 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning
12 0.33204442 29 nips-2004-Beat Tracking the Graphical Model Way
13 0.33127597 124 nips-2004-Multiple Alignment of Continuous Time Series
14 0.33003473 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
15 0.32658875 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing
16 0.27947846 185 nips-2004-The Convergence of Contrastive Divergences
17 0.27191746 183 nips-2004-Temporal-Difference Networks
18 0.27033928 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
19 0.23240258 102 nips-2004-Learning first-order Markov models for control
20 0.2277441 160 nips-2004-Seeing through water
topicId topicWeight
[(6, 0.304), (13, 0.049), (15, 0.08), (17, 0.012), (26, 0.067), (31, 0.015), (33, 0.215), (35, 0.028), (39, 0.015), (50, 0.027), (51, 0.01), (65, 0.016), (81, 0.024), (82, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.81887281 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
Author: Maxim Likhachev, Sebastian Thrun, Geoffrey J. Gordon
Abstract: Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (d) Planning map (b) Robotic helicopter (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c . This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2.1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . . . , sk , ak where a1 through ak−1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1 , whose outcome distribution is P (s | sk , ak ) and whose cost is k−1 c(si , ai , si+1 ) + c(sk , ak , s ) c(s1 , a, s ) = i=1 (See Figure 2(a) for an example of such a compressed action.) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . . . , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s.t. RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. It maintains the invariant that 0 ≤ v(s) ≤ v ∗ (s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c . The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot , and starts a search at that state. (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c .) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. Since v(s ) ≤ v ∗ (s ), one way to guarantee that RHS(spivot , a) ≤ v ∗ (spivot ) is to compute an optimal compression of (spivot , s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A∗ search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A∗ search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c . It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot ) to the returned value. This value is at least as large as RHS(spivot ). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than δ. For small enough δ, this property guarantees that we will be able to find a good policy (see section 2.3). 2.2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 ≤ v ≤ v ∗ (spivot ) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . That is, v ≥ RHS(spivot ) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v ≤ vlim ≤ v ∗ (spivot ) (Property 3). Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c .) Property 3 uses v ∗ rather than v ∗ since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more efficient for our purposes. MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v ∗ values, h(s) ≤ v ∗ (s). Finally, MCP-search speeds up repetitive searches by improving heuristic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s ∈ M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return ∞; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), Es ∈Succ(s,a) {c(s, a, s ) + h(s )}); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) = ∞; InsertUpdateCompAction(spivot , s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot , s, a) into A(spivot ) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v ∗ (u); 14 set the cost c(u, a, sgoal ) of the limit action a from u to v(u); ¯ ¯ procedure Search(spivot ) 15 CheckInitialize(sgoal ), CheckInitialize(spivot ); 16 g(spivot ) = 0; 17 OPEN = {{spivot , null}}; 18 {sbest , abest } = {null, null}, f best = ∞; 19 while(g(sgoal ) > min{s,a}∈OPEN (fvalue({s, a})) AND f best + θ > min{s,a}∈OPEN (fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s ∈ Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a ∈ A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) − c(s, a , s )); 27 if g(s ) > g(s) + c(s, a , s ) 28 g(s ) = g(s) + c(s, a , s ); 29 insert/update {s , null} into OPEN with fvalue({s , null}); 30 for each stochastic a ∈ A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot , s, a); 34 if f best > fvalue({s, a}) then {sbest , abest } = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal ) ≤ min{s,a}∈OPEN (fvalue({s, a})) AND OPEN = ∅) 36 reconstruct the path from spivot to sgoal ; 37 update/insert into A(spivot ) a deterministic action a leading to sgoal ; 38 if f best ≥ g(sgoal ) then {sbest , abest } = {sgoal , null}, f best = g(sgoal ); 39 return [f best; min{s,a}∈OPEN (fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es ∈Succ(s,a) (c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A∗ . When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best ≤ v ∗ (spivot ) and is used as a new estimate for v(spivot ). The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). This minimum f -value then provides a lower bound on v ∗ (spivot ). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v ∗ (s). Line 02 uses the fact that f best − g(s) is a lower bound on v ∗ (s) at the end of each previous call to Search; for more details see [4]. 2.3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theorems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. Upon termination, for every state s ∈ M c ∩ πgreedy we ∗ have RHS(s) − δ ≤ v(s) ≤ v (s). Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). Picking δ ≥ cmin is not guaranteed to result in a proper policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem described in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execution trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have implemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochasticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times faster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster than RTDP. On average MCP computed values for 127,442 states, while the size of M c was 3,713 states, and 24,052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart )) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA’s MARS program. All conclusions are our own. References [1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62, 2001. [3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16. Cambridge, MA: MIT Press, 2003. [4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report, Carnegie Mellon University, Pittsburgh, PA, 2004. [5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence, 23:13–27, 1984. [6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing, 1980. [7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processses. Mathematics of Operations Research, 12(3):441–450, 1987.
2 0.61458313 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
Author: Balázs Kégl
Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the final regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efficient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a finite number of steps. Then we provide confidence-interval-type bounds on the generalization error. 1
3 0.61187679 161 nips-2004-Self-Tuning Spectral Clustering
Author: Lihi Zelnik-manor, Pietro Perona
Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1
4 0.61128277 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
Author: Dori Peleg, Ron Meir
Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efficient solvers are available. Highly competitive numerical results on both artificial and real-world data sets are reported. 1
5 0.61120909 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
Author: Corinna Cortes, Mehryar Mohri
Abstract: In many applications, good ranking is a highly desirable performance for a classifier. The criterion commonly used to measure the ranking quality of a classification algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of confidence for its value. This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the error rate. They are compared with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classifier is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classification system is essential. The criterion widely used to measure the ranking quality of a classification algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of confidence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the confidence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar confidence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of confidence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight confidence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose confidence intervals, or strong distribution-dependent assumptions leading to tight but unsafe confidence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. Our results are also useful for testing the statistical significance of the difference of the AUC values of two classifiers. The paper is organized as follows. We first introduce the definition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and briefly review some essential aspects of the existing literature related to the computation of confidence intervals for the AUC. Our computation of the expected value and variance of the AUC for a fixed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a fixed error rate. These can be efficiently computed and used to determine our confidence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Definition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. The AUC is defined as the area under the ROC curve. Consider a binary classification task with m positive examples and n negative examples. Let C be a fixed classifier that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classification. It can be viewed as a measure based on pairwise comparisons between classifications of the two classes. It is an estimate of the probability Pxy that the classifier ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section briefly describes some previous distribution-dependent approaches presented in the statistical literature to derive confidence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a fixed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classifier ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting confidence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can fix m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose confidence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we find it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justified empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a fixed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the confidence intervals. For all q ≥ 0, let Xq (k, m, n) be defined by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a fixed error rate k/N assuming that all classifications or rankings with k errors are equiprobable. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classifications with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a fixed number of errors k: Proposition 2 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classifications with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simplified since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F defined for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefficient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to significantly more efficient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the confidence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a confidence interval E = [e1 , e2 ] is given for the error rate of a classifier C over a sample S, with the confidence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a confidence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassifications, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a confidence interval Ae for the AUC at the confidence level 1 − . Since our equiprobable model for the classifications is independent of the model used to compute the interval of confidence for the error rate, we can use E and Ae , e ∈ E, to compute a confidence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classifier and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a confidence interval for the error rate of C over S at the confidence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a confidence interval for the AUC value of the classifier C at the confidence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a fixed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be defined by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the confidence interval defined by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the confidence interval E is independent of our equiprobability model for fixed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the definitions of the intervals IK and IA . Thus, IA constitutes a confidence interval for the AUC value of C at the confidence level (1 − )(1 − ). In practice, the confidence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classifier, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassifications of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a confidence interval of the AUC value of the classifier C at the confidence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefficient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a confidence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the confidence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a confidence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a finer interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ confidence interval for the error rate at the confidence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter confidence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be verified. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a confidence interval of the AUC value of a classier C at the confidence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow confidence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is defined by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter confidence intervals (Fig. 1 (a)). It also leads to tighter confidence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on specific assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight confidence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classifiers are statistically significant by checking if the AUC value of one falls within the confidence interval of the other. 8 Conclusion We presented principled techniques for computing useful confidence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these confidence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a fixed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classifier ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the definition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-specific weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difficult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical significance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classifier Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efficient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.
6 0.61032242 44 nips-2004-Conditional Random Fields for Object Recognition
7 0.6085012 48 nips-2004-Convergence and No-Regret in Multiagent Learning
8 0.60778695 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
9 0.60737717 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
10 0.60641903 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
11 0.60633105 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
12 0.60614008 179 nips-2004-Surface Reconstruction using Learned Shape Models
13 0.60581398 115 nips-2004-Maximum Margin Clustering
14 0.60519111 166 nips-2004-Semi-supervised Learning via Gaussian Processes
15 0.605012 77 nips-2004-Hierarchical Clustering of a Mixture Model
16 0.60497773 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
17 0.60495347 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
18 0.60458905 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
19 0.60410815 82 nips-2004-Incremental Algorithms for Hierarchical Classification
20 0.60360473 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming