nips nips2004 nips2004-202 knowledge-graph by maker-knowledge-mining

202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs


Source: pdf

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. [sent-5, score-0.609]

2 This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. [sent-6, score-0.124]

3 The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states. [sent-7, score-0.184]

4 1 Introduction Partially observable Markov decision processes (POMDPs) provide a natural and expressive framework for decision making, but their use in practice has been limited by the lack of scalable solution algorithms. [sent-8, score-0.152]

5 Two important sources of intractability plague discrete model-based POMDPs: high dimensionality of belief space, and the complexity of policy or value function (VF) space. [sent-9, score-0.742]

6 Classic solution algorithms [4, 10, 7], for example, compute value functions represented by exponentially many value vectors, each of exponential size. [sent-10, score-0.089]

7 The complexity of policy/VF space has been addressed by observing that there are often very good policies whose value functions are representable by a small number of vectors. [sent-13, score-0.133]

8 Various algorithms such as approximate vector pruning [9], point-based value iteration (PBVI) [12, 16], bounded policy iteration (BPI) [14], gradient ascent (GA) [11, 1] and stochastic local search (SLS) [3] exploit this fact to produce (often near-optimal) policies of low complexity (i. [sent-14, score-0.731]

9 Conversely, it has been observed that belief states often carry more information than necessary. [sent-18, score-0.312]

10 However, since none of these approaches address the exponential complexity of policy/VF space, they can only solve slightly larger POMDPs (up to 8250 states [15]). [sent-21, score-0.127]

11 Scalable POMDP algorithms can only be realized when both sources of intractability are tackled simultaneously. [sent-22, score-0.093]

12 While Hansen and Feng [9] implemented such an algorithm by combining approximate state abstraction with approximate vector pruning, they didn’t demonstrate the scalability of the approach on large problems. [sent-23, score-0.173]

13 In this paper, we describe how to combine value directed compression (VDC) with bounded policy iteration (BPI) and demonstrate the scalability of the resulting algorithm (VDCBPI) on synthetic network management problems of up to 33 million states. [sent-24, score-0.845]

14 Among the techniques that deal with the curse of dimensionality, VDC offers the advantage that the compressed POMDP can be directly fed into existing POMDP algorithms with no (or only slight) adjustments. [sent-25, score-0.314]

15 Among algorithms that mitigate policy space complexity, BPI distinguishes itself by its ability to avoid local optima (cf. [sent-27, score-0.332]

16 SLS) and the fact that belief state monitoring is not required (cf. [sent-29, score-0.33]

17 We propose a new simple heuristic to compute good lossy value directed compressions. [sent-32, score-0.203]

18 We also augment BPI with the ability to bias its policy search to reachable belief states. [sent-33, score-0.713]

19 As a result, BPI can often find a much smaller policy of similar quality for a given initial belief state. [sent-34, score-0.572]

20 Policies and value functions for POMDPs are typically defined over belief space B, where a belief state b is a distribution over S capturing an agent’s knowledge about the current state of the world. [sent-37, score-0.64]

21 We denote the (unnormalized) belief update mapping by T a,z , a,z where Tij = Pr(sj |a, si ) Pr(z|sj ). [sent-39, score-0.24]

22 A factored POMDP, with exponentially many states, thus gives rise to a belief space of exponential dimensionality. [sent-40, score-0.338]

23 Policies represented by finite state controllers (FSCs) are defined by a (possibly cyclic) directed graph π = N , E , where nodes n ∈ N correspond to stochastic action choices and edges e ∈ E to stochastic transitions. [sent-41, score-0.233]

24 The value function V π of FSC π is given by: V π (n, s) = Pr(a|n)R(s, a) + γ a Pr(n |n, z)V π (n , s ) (1) Pr(s |s, a) Pr(z|s , a) z n The value V (n, b) of each node n is thus linear w. [sent-43, score-0.099]

25 t the belief state; hence the value function of the controller is piecewise-linear and convex. [sent-45, score-0.308]

26 3 Bounded Policy Iteration We briefly review the bounded policy iteration (BPI) algorithm (see [14] for details) and describe a simple extension to bias its search toward reachable belief states. [sent-53, score-0.825]

27 BPI incrementally constructs an FSC by alternating policy improvement and policy evaluation. [sent-54, score-0.664]

28 Unlike policy iteration [7], this is done by slowly increasing the number of nodes (and value vectors). [sent-55, score-0.511]

29 The policy improvement step greedily improves each node n by optimizing its action and observation strategies by solving the linear program (LP) in Table 1. [sent-56, score-0.462]

30 The policy evaluation step computes the value function of the current controller by solving Eq. [sent-58, score-0.436]

31 The algorithm monotonically improves the policy until convergence to a local optimum, at which point new nodes are introduced to escape the local optimum. [sent-60, score-0.409]

32 BPI is guaranteed to converge to a policy that is optimal at the “tangent” belief states while slowly growing the size of the controller [14]. [sent-61, score-0.682]

33 In practice, we often wish to find a policy suitable for a given initial belief state. [sent-62, score-0.572]

34 Since only a small subset of belief space is often reachable, it is generally possible to construct much smaller policies tailored to the reachable region. [sent-63, score-0.484]

35 We now describe a simple way to bias BPI’s efforts toward the reachable region. [sent-64, score-0.141]

36 Recall that the LP in Table 1 optimizes the parameters of a node to uniformly improve its value at all belief states. [sent-65, score-0.331]

37 We propose a new LP (Table 2) that weighs the improvement by the (unnormalized) discounted occupancy distribution induced by the current policy. [sent-66, score-0.075]

38 This accounts for belief states reachable for the node by aggregating them together. [sent-67, score-0.492]

39 When using the modified LP, BPI naturally tries to improve the policy at the reachable belief states before the others. [sent-69, score-0.807]

40 Since the modification ensures that the value function doesn’t decrease at any belief state, focusing the efforts on reachable belief states won’t decrease policy value at other belief states. [sent-70, score-1.325]

41 Furthermore, though the policy is initially biased toward reachable states, BPI will eventually improve the policy for all belief states. [sent-71, score-1.067]

42 Tπ f b ~π T ~ b b’ ~π T ~ b’ ~ R R Tπ ~ R R r r’ Figure 1: Functional flow of a POMDP (dotted arrows) and a compressed POMDP (solid arrows). [sent-72, score-0.263]

43 4 Value-Directed Compression We briefly review the sufficient conditions for a lossless compression of POMDPs [13] and describe a simple new algorithm to obtain good lossy compressions. [sent-73, score-0.337]

44 Belief states constitute a sufficient statistic summarizing all information available to the decision maker (i. [sent-74, score-0.105]

45 Since belief states often contain information irrelevant to the estimation of future rewards, one can often compress belief states into some lower-dimensional representation. [sent-78, score-0.653]

46 Let f be a compression function that maps each belief state b into some lower dimensional compressed belief state ˜ (see b Figure 1). [sent-79, score-1.026]

47 We desire a compression f such that ˜ b corresponds to the smallest statistic sufficient for accurately predicting the current reward r as well as the next compressed belief state ˜ (since it captures all the information in b b necessary to accurately predict subsequent rewards). [sent-81, score-0.79]

48 Such a compression f exists if we can ˜ ˜ also find compressed transition dynamics T a,z and a compressed reward function R such that: ˜ ˜ R = R ◦ f and f ◦ T a,z = T a,z ◦ f ∀a ∈ A, z ∈ Z (3) ˜ and T a,z satisfying Eq. [sent-82, score-0.748]

49 3, we can evaluate any policy π using the compressed ˜ Given an f , R ˜ ˜ POMDP dynamics to obtain V π . [sent-83, score-0.595]

50 Since V π = V π ◦f , the compressed POMDP is equivalent to the original. [sent-84, score-0.263]

51 Hence, the columns of the best linear lossless compression mapping F form a basis for the smallest invariant subspace (w. [sent-90, score-0.328]

52 We can find the columns of F by Krylov iteration: multiplying R by each T a,z until the newly generated vectors are linear combinations of previous ones. [sent-96, score-0.096]

53 1 The dimensionality of the compressed space is equal to the number of columns of F , which is necessarily smaller than or equal ˜ to the dimensionality of the original belief space. [sent-97, score-0.633]

54 each T Since linear lossless compressions are not always possible, we can extend the technique of [13] to find good lossy compressions with early stopping of the Krylov iteration. [sent-100, score-0.328]

55 Given a threshold or some upper bound k on the desired number of columns in F , we run Krylov iteration, retaining only the vectors with an error greater than , or the k vectors with largest error. [sent-111, score-0.102]

56 4— due to the lossy nature of the compression, the system is overconstrained. [sent-114, score-0.094]

57 But we can find ˜ ˜ suitable R and T a,z by computing a least square approximation, solving: ˜ ˜ F R = F F R and F T a,z F = F F T a,z ∀a ∈ A, z ∈ Z While compression is required when the dimensionality of belief space is too large, unfortunately, the columns of F have the same dimensionality. [sent-115, score-0.476]

58 Factored POMDPs of exponential dimension can, however, admit practical Krylov iteration if carried out using a compact ˜ ˜ representation (e. [sent-116, score-0.125]

59 5 Bounded Policy Iteration with Value-Directed Compression In principle, any POMDP algorithm can be used to solve the compressed POMDPs produced by VDC. [sent-119, score-0.289]

60 If the compression is lossless and the POMDP algorithm exact, the computed policy will be optimal for the original POMDP. [sent-120, score-0.575]

61 In practice, POMDP algorithms are usually approximate and lossless compressions are not always possible, so care must be taken to ensure numerical stability and a policy of high quality for the original POMDP. [sent-121, score-0.523]

62 ˜ ˜ Since V = F V , maximizing the compressed value vector V of some node n automatically maximizes the value V of n w. [sent-123, score-0.362]

63 Otherwise, the optimal policy of the compressed POMDP may not be optimal for the original POMDP. [sent-127, score-0.595]

64 Fortunately, when R is nonnegative then F is guaranteed to be nonnegative by the nature of Krylov iteration. [sent-128, score-0.07]

65 If some rewards are negative, we can add a sufficiently large constant to R to make it nonnegative without changing the decision problem. [sent-129, score-0.177]

66 Since most algorithms, including BPI, compute approximately optimal policies it is also critical to normalize the columns of F . [sent-130, score-0.139]

67 Such a difference in sensitivity may bias the ˜ search for a good policy to an undesirable region of the belief space, or may even cause the algorithm to return a policy that is far from optimal for the original POMDP despite the fact that it is -optimal for the compressed POMDP. [sent-133, score-1.167]

68 We note that it is “safer” to evaluate policies iteratively by successive approximation rather than solving the system in Eq. [sent-134, score-0.139]

69 In contrast, lossy compressed transition matrices T a,z are not guaranteed to have this property. [sent-137, score-0.357]

70 It is thus safer to evaluate policies by successive approximation for lossy compressions. [sent-140, score-0.233]

71 Finally several algorithms including BPI compute witness belief states to verify the domi˜ nance of a value vector. [sent-141, score-0.414]

72 Since the compressed belief space B is different from the original belief space B, this must be approached with care. [sent-142, score-0.743]

73 B is a simplex corresponding to the convex hull of the state points. [sent-143, score-0.089]

74 In contrast, since each row vector of F is the compressed ˜ version of some state point, B corresponds to the convex hull of the row vectors of F . [sent-144, score-0.385]

75 For instance, when verifying the dominance of a value vector, if there is a compressed witness ˜ there is alb, ways an uncompressed witness b, but not vice-versa. [sent-146, score-0.437]

76 In practice, this doesn’t impact the correctness of algorithms such as policy iteration, bounded policy iteration, incremental pruning, witness algorithm, etc. [sent-148, score-0.776]

77 The bottom right graph shows the running time of BPI on compressed versions of a cycle network of 25 machines. [sent-151, score-0.35]

78 1 Table 3: Comparison of the best policies achieved by VDCBPI to the doNothing and heuristic policies. [sent-176, score-0.148]

79 6 Experiments We report on experiments with VDCBPI on some synthetic network management problems similar to those introduced in [5]. [sent-179, score-0.103]

80 The SA receives a reward of 1 per working machine and 2 per working server. [sent-184, score-0.069]

81 We experimented with networks of 16, 19, 22 and 25 machines organized in two configurations: cycle (a ring) and 3legs (a tree of 3 branches joined at the root). [sent-192, score-0.081]

82 Figure 2 shows the average expected reward earned by policies computed by BPI after the POMDP has been compressed by VDC. [sent-193, score-0.461]

83 Results are averaged over 500 runs of 60 steps, starting with a belief state where all machines are working. [sent-194, score-0.33]

84 2 As expected, decision quality increases as we increase the number of nodes used in BPI and basis functions used in VDC. [sent-195, score-0.159]

85 2 shows the time taken by BPI on a cycle network of 25 machines (other problems exhibit similar behavior). [sent-201, score-0.112]

86 3 In Table 3 we compare the value of the best policy with less than 120 nodes found by VDCBPI to two other simple policies. [sent-205, score-0.439]

87 The doNothing policy lets the network evolve without any rebooting or pinging. [sent-206, score-0.399]

88 The heuristic policy estimates at each stage the probability of failure4 of each machine and reboots the machine most likely to be down if its failure probability is greater than threshold p1 or pings it if greater than threshold p2 . [sent-207, score-0.449]

89 5 This heuristic policy performs very well and therefore offers a strong competitor to VDCBPI. [sent-211, score-0.405]

90 But it is possible to do better than the heuristic policy by optimizing the choice of the machine that the SA may reboot or ping. [sent-212, score-0.431]

91 With a sufficient number of nodes and basis functions, VDCBPI outperforms the heuristic policy on the 3legs networks and matches it on the cycle networks. [sent-215, score-0.559]

92 This is quite remarkable given the fact that belief states were compressed to 250 dimensions or less compared to the original dimensionality ranging from 65,536 to 33,554,432. [sent-216, score-0.622]

93 7 Conclusion We have described a new POMDP algorithm that mitigates both high belief space dimensionality and policy/VF complexity. [sent-217, score-0.318]

94 By integrating value-directed compression with 2 The ruggedness of the graphs is mainly due to the variance in the reward samples. [sent-218, score-0.222]

95 4 Due to the large state space, approximate monitoring was performed by factoring the joint. [sent-220, score-0.119]

96 3 bounded policy iteration, we can solve synthetic network management POMDPs of 33 million states (3 orders of magnitude larger than previously solved discrete POMDPs). [sent-223, score-0.604]

97 Note that the scalability of VDCBPI is problem dependent, however we hope that new, scalable, approximate POMDP algorithms such as VDCBPI will allow POMDPs to be used to model real-world problems, with the expectation that they can be solved effectively. [sent-224, score-0.079]

98 Beyond policy space complexity and high dimensional belief spaces, further research will be necessary to deal with exponentially large action and observation spaces. [sent-227, score-0.627]

99 Computing optimal policies for partially observable decision processes using compact representations. [sent-240, score-0.223]

100 Dynamic programming for POMDPs using a factored state representation. [sent-287, score-0.134]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bpi', 0.378), ('policy', 0.332), ('pomdp', 0.275), ('pr', 0.272), ('pomdps', 0.266), ('compressed', 0.263), ('belief', 0.24), ('vdc', 0.198), ('vdcbpi', 0.198), ('fns', 0.162), ('compression', 0.153), ('reachable', 0.141), ('krylov', 0.126), ('rewards', 0.109), ('policies', 0.103), ('lossy', 0.094), ('lossless', 0.09), ('lp', 0.088), ('nodes', 0.077), ('states', 0.072), ('compressions', 0.072), ('witness', 0.072), ('iteration', 0.072), ('factored', 0.069), ('reward', 0.069), ('state', 0.065), ('dts', 0.063), ('intractability', 0.057), ('cycle', 0.056), ('donothing', 0.054), ('fsc', 0.054), ('pbvi', 0.054), ('poupart', 0.054), ('reboot', 0.054), ('pruning', 0.053), ('scalability', 0.05), ('basis', 0.049), ('dimensionality', 0.047), ('scalable', 0.046), ('heuristic', 0.045), ('occupancy', 0.043), ('sa', 0.042), ('management', 0.041), ('observable', 0.04), ('bounded', 0.04), ('node', 0.039), ('adds', 0.038), ('hansen', 0.038), ('controller', 0.038), ('columns', 0.036), ('lcbfs', 0.036), ('pings', 0.036), ('rebooting', 0.036), ('reboots', 0.036), ('safer', 0.036), ('sls', 0.036), ('sources', 0.036), ('solving', 0.036), ('nonnegative', 0.035), ('directed', 0.034), ('vectors', 0.033), ('unnormalized', 0.033), ('decision', 0.033), ('toronto', 0.032), ('action', 0.032), ('discounted', 0.032), ('mitigates', 0.031), ('synthetic', 0.031), ('network', 0.031), ('million', 0.031), ('value', 0.03), ('doesn', 0.03), ('vancouver', 0.03), ('actions', 0.029), ('approximate', 0.029), ('exponential', 0.029), ('compress', 0.029), ('boutilier', 0.029), ('offers', 0.028), ('planning', 0.027), ('won', 0.027), ('combinations', 0.027), ('solve', 0.026), ('expected', 0.026), ('monitoring', 0.025), ('controllers', 0.025), ('seattle', 0.025), ('machines', 0.025), ('table', 0.024), ('compact', 0.024), ('hull', 0.024), ('associates', 0.024), ('observation', 0.023), ('existing', 0.023), ('guestrin', 0.023), ('partially', 0.023), ('improve', 0.022), ('executing', 0.022), ('ga', 0.022), ('littman', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

2 0.23390251 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

3 0.18508604 39 nips-2004-Coarticulation in Markov Decision Processes

Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen

Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1

4 0.174992 64 nips-2004-Experts in a Markov Decision Process

Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour

Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1

5 0.15799046 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees

Author: Daniela D. Farias, Benjamin V. Roy

Abstract: We introduce a new algorithm based on linear programming that approximates the differential value function of an average-cost Markov decision process via a linear combination of pre-selected basis functions. The algorithm carries out a form of cost shaping and minimizes a version of Bellman error. We establish an error bound that scales gracefully with the number of states without imposing the (strong) Lyapunov condition required by its counterpart in [6]. We propose a path-following method that automates selection of important algorithm parameters which represent counterparts to the “state-relevance weights” studied in [6]. 1

6 0.1563922 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity

7 0.10152087 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

8 0.095775507 24 nips-2004-Approximately Efficient Online Mechanism Design

9 0.082770698 123 nips-2004-Multi-agent Cooperation in Diverse Population Games

10 0.078870609 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

11 0.078622438 102 nips-2004-Learning first-order Markov models for control

12 0.071138166 88 nips-2004-Intrinsically Motivated Reinforcement Learning

13 0.060325213 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

14 0.053621542 116 nips-2004-Message Errors in Belief Propagation

15 0.050933555 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

16 0.050439075 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

17 0.049798694 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

18 0.049147822 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

19 0.048112832 124 nips-2004-Multiple Alignment of Continuous Time Series

20 0.047639225 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.146), (1, -0.015), (2, 0.248), (3, -0.07), (4, -0.161), (5, 0.22), (6, -0.035), (7, 0.017), (8, -0.105), (9, -0.077), (10, 0.068), (11, -0.069), (12, 0.007), (13, 0.129), (14, -0.014), (15, -0.018), (16, 0.0), (17, 0.077), (18, 0.016), (19, 0.023), (20, -0.019), (21, 0.125), (22, -0.118), (23, 0.077), (24, -0.146), (25, 0.095), (26, -0.032), (27, -0.007), (28, 0.068), (29, 0.248), (30, 0.18), (31, -0.119), (32, -0.063), (33, -0.073), (34, -0.183), (35, 0.113), (36, -0.02), (37, 0.002), (38, 0.006), (39, 0.042), (40, 0.108), (41, -0.005), (42, 0.055), (43, -0.042), (44, -0.062), (45, 0.071), (46, -0.043), (47, 0.05), (48, -0.021), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97448283 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

2 0.73327994 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity

Author: Maxim Likhachev, Sebastian Thrun, Geoffrey J. Gordon

Abstract: Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (d) Planning map (b) Robotic helicopter (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c . This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2.1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . . . , sk , ak where a1 through ak−1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1 , whose outcome distribution is P (s | sk , ak ) and whose cost is k−1 c(si , ai , si+1 ) + c(sk , ak , s ) c(s1 , a, s ) = i=1 (See Figure 2(a) for an example of such a compressed action.) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . . . , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s.t. RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. It maintains the invariant that 0 ≤ v(s) ≤ v ∗ (s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c . The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot , and starts a search at that state. (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c .) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. Since v(s ) ≤ v ∗ (s ), one way to guarantee that RHS(spivot , a) ≤ v ∗ (spivot ) is to compute an optimal compression of (spivot , s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A∗ search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A∗ search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c . It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot ) to the returned value. This value is at least as large as RHS(spivot ). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than δ. For small enough δ, this property guarantees that we will be able to find a good policy (see section 2.3). 2.2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 ≤ v ≤ v ∗ (spivot ) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . That is, v ≥ RHS(spivot ) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v ≤ vlim ≤ v ∗ (spivot ) (Property 3). Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c .) Property 3 uses v ∗ rather than v ∗ since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more efficient for our purposes. MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v ∗ values, h(s) ≤ v ∗ (s). Finally, MCP-search speeds up repetitive searches by improving heuristic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s ∈ M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return ∞; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), Es ∈Succ(s,a) {c(s, a, s ) + h(s )}); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) = ∞; InsertUpdateCompAction(spivot , s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot , s, a) into A(spivot ) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v ∗ (u); 14 set the cost c(u, a, sgoal ) of the limit action a from u to v(u); ¯ ¯ procedure Search(spivot ) 15 CheckInitialize(sgoal ), CheckInitialize(spivot ); 16 g(spivot ) = 0; 17 OPEN = {{spivot , null}}; 18 {sbest , abest } = {null, null}, f best = ∞; 19 while(g(sgoal ) > min{s,a}∈OPEN (fvalue({s, a})) AND f best + θ > min{s,a}∈OPEN (fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s ∈ Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a ∈ A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) − c(s, a , s )); 27 if g(s ) > g(s) + c(s, a , s ) 28 g(s ) = g(s) + c(s, a , s ); 29 insert/update {s , null} into OPEN with fvalue({s , null}); 30 for each stochastic a ∈ A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot , s, a); 34 if f best > fvalue({s, a}) then {sbest , abest } = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal ) ≤ min{s,a}∈OPEN (fvalue({s, a})) AND OPEN = ∅) 36 reconstruct the path from spivot to sgoal ; 37 update/insert into A(spivot ) a deterministic action a leading to sgoal ; 38 if f best ≥ g(sgoal ) then {sbest , abest } = {sgoal , null}, f best = g(sgoal ); 39 return [f best; min{s,a}∈OPEN (fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es ∈Succ(s,a) (c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A∗ . When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best ≤ v ∗ (spivot ) and is used as a new estimate for v(spivot ). The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). This minimum f -value then provides a lower bound on v ∗ (spivot ). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v ∗ (s). Line 02 uses the fact that f best − g(s) is a lower bound on v ∗ (s) at the end of each previous call to Search; for more details see [4]. 2.3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theorems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. Upon termination, for every state s ∈ M c ∩ πgreedy we ∗ have RHS(s) − δ ≤ v(s) ≤ v (s). Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). Picking δ ≥ cmin is not guaranteed to result in a proper policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem described in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execution trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have implemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochasticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times faster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster than RTDP. On average MCP computed values for 127,442 states, while the size of M c was 3,713 states, and 24,052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart )) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA’s MARS program. All conclusions are our own. References [1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62, 2001. [3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16. Cambridge, MA: MIT Press, 2003. [4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report, Carnegie Mellon University, Pittsburgh, PA, 2004. [5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence, 23:13–27, 1984. [6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing, 1980. [7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processses. Mathematics of Operations Research, 12(3):441–450, 1987.

3 0.68290126 39 nips-2004-Coarticulation in Markov Decision Processes

Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen

Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1

4 0.6633473 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

5 0.56748903 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees

Author: Daniela D. Farias, Benjamin V. Roy

Abstract: We introduce a new algorithm based on linear programming that approximates the differential value function of an average-cost Markov decision process via a linear combination of pre-selected basis functions. The algorithm carries out a form of cost shaping and minimizes a version of Bellman error. We establish an error bound that scales gracefully with the number of states without imposing the (strong) Lyapunov condition required by its counterpart in [6]. We propose a path-following method that automates selection of important algorithm parameters which represent counterparts to the “state-relevance weights” studied in [6]. 1

6 0.5220685 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

7 0.4704932 64 nips-2004-Experts in a Markov Decision Process

8 0.45579198 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

9 0.45285285 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

10 0.4179832 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

11 0.3812685 24 nips-2004-Approximately Efficient Online Mechanism Design

12 0.30762053 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

13 0.27770272 130 nips-2004-Newscast EM

14 0.26239049 123 nips-2004-Multi-agent Cooperation in Diverse Population Games

15 0.2389448 171 nips-2004-Solitaire: Man Versus Machine

16 0.23709705 183 nips-2004-Temporal-Difference Networks

17 0.23100796 122 nips-2004-Modelling Uncertainty in the Game of Go

18 0.22975843 185 nips-2004-The Convergence of Contrastive Divergences

19 0.22932589 88 nips-2004-Intrinsically Motivated Reinforcement Learning

20 0.22706489 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.065), (15, 0.138), (17, 0.014), (26, 0.078), (31, 0.013), (33, 0.164), (35, 0.026), (39, 0.022), (50, 0.025), (65, 0.324), (71, 0.014), (76, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80545443 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

2 0.65020043 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artificially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures. 1

3 0.59013891 68 nips-2004-Face Detection --- Efficient and Rank Deficient

Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir

Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scanning large images, this decreases the computational complexity by a significant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1

4 0.58389485 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

Author: Ruei-sung Lin, David A. Ross, Jongwoo Lim, Ming-Hsuan Yang

Abstract: This paper presents an adaptive discriminative generative model that generalizes the conventional Fisher Linear Discriminant algorithm and renders a proper probabilistic interpretation. Within the context of object tracking, we aim to find a discriminative generative model that best separates the target from the background. We present a computationally efficient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not significantly change as time progresses, our method adapts a discriminative generative model to reflect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Numerous experiments show that our method is able to learn a discriminative generative model for tracking target objects undergoing large pose and lighting changes.

5 0.58298689 161 nips-2004-Self-Tuning Spectral Clustering

Author: Lihi Zelnik-manor, Pietro Perona

Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1

6 0.58142996 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

7 0.57999563 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

8 0.57991886 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

9 0.57968301 178 nips-2004-Support Vector Classification with Input Data Uncertainty

10 0.57923406 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

11 0.57875878 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

12 0.57853174 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

13 0.57819051 69 nips-2004-Fast Rates to Bayes for Kernel Machines

14 0.57778513 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

15 0.57770365 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

16 0.57719064 131 nips-2004-Non-Local Manifold Tangent Learning

17 0.57691538 103 nips-2004-Limits of Spectral Clustering

18 0.57576531 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

19 0.57553029 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

20 0.57551503 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units