nips nips2007 nips2007-204 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa
Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1
Reference: text
sentIndex sentText sentNum sentScore
1 A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. [sent-8, score-0.198]
2 Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. [sent-9, score-0.273]
3 In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. [sent-10, score-0.372]
4 The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. [sent-11, score-0.855]
5 We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. [sent-12, score-0.362]
6 This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. [sent-13, score-0.198]
7 Most of these methods compute a policy over the entire belief space. [sent-18, score-0.455]
8 These achieve scalability by planning only at execution time, thus allowing the agent to only consider belief states that can be reached over some (small) finite planning horizon. [sent-24, score-0.683]
9 This paper suggests otherwise, arguing that by combining offline and online techniques, we can preserve the theoretical properties of the former, while exploiting the scalability of the latter. [sent-27, score-0.273]
10 In previous work [11], we introduced an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. [sent-28, score-0.329]
11 The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. [sent-29, score-0.855]
12 In this paper, we derive formally the heuristics from our error minimization point of view and provide theoretical results showing that these search heuristics are admissible, and lead to complete and ǫoptimal algorithms. [sent-30, score-0.574]
13 This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. [sent-31, score-0.198]
14 To deal with this uncertainty, the agent maintains a belief state b(s), which expresses the probability that the agent is in each state at a given time step. [sent-36, score-0.535]
15 After each step, the belief state b is updated using Bayes rule. [sent-37, score-0.402]
16 We denote the belief update function b′ = τ (b, a, o), defined as b′ (s′ ) = ηO(o, a, s′ ) s∈S T (s, a, s′ )b(s), where η is a normalization constant ensuring s∈S b′ (s) = 1. [sent-38, score-0.339]
17 Solving a POMDP consists in finding an optimal policy, π ∗ : ∆S → A, which specifies the best action a to do in every belief state b, that maximizes the expected return (i. [sent-39, score-0.612]
18 We can find the optimal policy by computing the optimal value of a belief state over the planning horizon. [sent-42, score-0.67]
19 We also denote the value Q∗ (b, a) of a particular action a in belief state b, as the return we will obtain if we perform a in b and then follow the optimal policy Q∗ (b, a) = R(b, a) + γ o∈Ω P (o|b, a)V ∗ (τ (b, a, o)). [sent-45, score-0.728]
20 While any POMDP problem has infinitely many belief states, it has been shown that the optimal value function of a finite-horizon POMDP is piecewise linear and convex. [sent-47, score-0.368]
21 Thus we can define the optimal value function and policy of a finite-horizon POMDP using a finite set of |S|-dimensional hyper plans, called α-vectors, over the belief state space. [sent-48, score-0.547]
22 Most approximate offline value iteration algorithms achieve computational tractability by selecting a small subset of belief states, and keeping only those α-vectors which are maximal at the selected belief states [1, 3, 4]. [sent-50, score-0.749]
23 The precision of these algorithms depend on the number of belief points and their location in the space of beliefs. [sent-51, score-0.339]
24 3 Online Search in POMDPs Contrary to offline approaches, which compute a complete policy determining an action for every belief state, an online algorithm takes as input the current belief state and returns the single action which is the best for this particular belief state. [sent-52, score-1.674]
25 The advantage of such an approach is that it only needs to consider belief states that are reachable from the current belief state. [sent-53, score-0.776]
26 But in addition, since online planning is done at every step (and thus generalization between beliefs is not required), it is sufficient to calculate only the maximal value for the current belief state, not the full optimal α-vector. [sent-55, score-0.679]
27 A lookahead search algorithm can compute this value in two simple steps. [sent-56, score-0.208]
28 First we build a tree of reachable belief states from the current belief state. [sent-57, score-0.97]
29 Subsequent belief states (as calculated by the τ (b, a, o) function) are represented using OR-nodes (at which we must choose an action) and actions are included in between each layer of belief nodes using AND-nodes (at which we must consider all possible observations). [sent-59, score-0.93]
30 Note that in general the belief MDP could have a graph structure with cycles. [sent-60, score-0.339]
31 Hence, if we reach a belief that is already elsewhere in the tree, it will be duplicated. [sent-62, score-0.339]
32 1 Second, we estimate the value of the current belief state by propagating value estimates up from the fringe nodes, to their ancestors, all the way to the root. [sent-63, score-0.806]
33 An approximate value function is generally used at the fringe of the tree to approximate the infinite-horizon value. [sent-64, score-0.658]
34 We are particularly interested in the case where a lower bound and an upper bound on the value of the fringe belief states is available, as this allows us to get a bound on the error at any specific node. [sent-65, score-1.071]
35 Performing a complete k-step lookahead search multiplies the error bound on the approximate value function used at the fringe by γ k ([13]), and thus ensures better value estimates. [sent-67, score-0.774]
36 However, it has complexity exponential in k, and may explore belief states that have very small probabilities of occurring (and an equally small impact on the value function) as well as exploring suboptimal actions (which have no impact on the value function). [sent-68, score-0.471]
37 We would evidently prefer to have a more efficient online algorithm, which can guarantee equivalent or better error bounds. [sent-69, score-0.205]
38 In particular, we believe that the best way to achieve this is to have a search algorithm which uses estimates of error reduction as a criteria to guide the search over the reachable beliefs. [sent-70, score-0.446]
39 Our approach uses a best-first search of the belief reachability tree, where error minimization (at the root node) is used as the search criteria to select which fringe nodes to expand next. [sent-73, score-1.353]
40 Thus we need a way to express the error on the current belief (i. [sent-74, score-0.389]
41 root node) as a function of the error at the fringe nodes. [sent-76, score-0.519]
42 π ∗ (b, a) is the probability that the optimal policy chooses action a in belief b). [sent-79, score-0.629]
43 By abuse of notation, we will use b to represent both a belief node in the tree and its associated belief2 . [sent-80, score-0.643]
44 b∈F (T ) should be interpreted as a sum over all fringe nodes in the tree, while e(b) to be the error associated to the belief in fringe node b. [sent-84, score-1.427]
45 In any tree T , eT (b0 ) ≤ b0 ,b b∈F (T ) γ d(hT ) P (hb0 ,b |b0 , π ∗ )e(b). [sent-86, score-0.194]
46 Consider an arbitrary parent node b in tree T and let’s denote aT = argmaxa∈A LT (b, a). [sent-88, score-0.361]
47 T Search Heuristics From Theorem 1, we see that the contribution of each fringe node to the error in b0 is simply b0 ,b the term γ d(hT ) P (hb0 ,b |b0 , π ∗ )e(b). [sent-94, score-0.564]
48 Consequently, if we want to minimize eT (b0 ) as quickly as T possible, we should expand fringe nodes reached by the optimal policy π ∗ that maximize the term b0 ,b γ d(hT ) P (hb0 ,b |b0 , π ∗ )e(b) as they offer the greatest potential to reduce eT (b0 ). [sent-95, score-0.768]
49 This suggests us T a sound heuristic to explore the tree in a best-first-search way. [sent-96, score-0.309]
50 We define e(b) = U (b) − L(b) as an estimate of the error introduced by our bounds at ˆ fringe node b. [sent-99, score-0.625]
51 ˆ To approximate P (hb0 ,b |b0 , π ∗ ), we can view the term π ∗ (b, a) as the probability that action a T is optimal in belief b. [sent-101, score-0.543]
52 Thus, we consider an approximate policy πT that represents the probaˆ bility that action a is optimal in belief state b given the bounds LT (b, a) and UT (b, a) that we have on Q∗ (b, a) in tree T . [sent-102, score-0.977]
53 One possible approximation is to simply compute the probability that the Q-value of a given action is higher than its parent belief state value (instead of all actions’ Q-value). [sent-109, score-0.604]
54 In this case, we get ∞ b,a b b πT (b, a) = −∞ fT (x)FT (x)dx, where FT is the cumulative distribution function for V ∗ (b), ˆ given bounds LT (b) and UT (b) in tree T . [sent-110, score-0.255]
55 Notice that if the density function ˆ is 0 outside the interval between the lower and upper bound, then πT (b, a) = 0 for dominated ˆ actions, thus they are implicitly pruned from the search tree by this method. [sent-113, score-0.425]
56 (6) which simply selects the action that maximizes the upper bound. [sent-115, score-0.202]
57 This restricts exploration of the search tree to those fringe nodes that are reached by sequence of actions that maximize the upper bound of their parent belief state, as done in the AO∗ algorithm [14]. [sent-116, score-1.512]
58 The nice property of this approximation is that these fringe nodes are the only nodes that can potentially reduce the upper bound in b0 . [sent-117, score-0.75]
59 Using either of these two approximations for πT , we can estimate the error contribution eT (b0 , b) of ˆ ˆ b ,b b0 ,b d(hT0 ) a fringe node b on the value of root belief b0 in tree T , as: eT (b0 , b) = γ ˆ P (hT |b0 , πT )ˆ(b). [sent-118, score-1.162]
60 ˆ e Using this as a heuristic, the next fringe node b(T ) to expand in tree T is defined as b(T ) = b0 ,b argmaxb∈F (T ) γ d(hT ) P (hb0 ,b |b0 , πT )ˆ(b). [sent-119, score-0.774]
61 We use AEMS13 to denote the heuristic that uses πT ˆ e ˆ T 4 as defined in Equation 5, and AEMS2 to denote the heuristic that uses πT as defined in Equation 6. [sent-120, score-0.23]
62 Since the objective is to provide a near-optimal action within a finite allowed online planning time, the algorithm accepts two input parameters: t, the online search time allowed per action, and ǫ, the desired precision on the value function. [sent-123, score-0.69]
63 Algorithm 1 AEMS: Anytime Error Minimization Search Function S EARCH(t, ǫ) Static : T : an AND-OR tree representing the current search tree. [sent-124, score-0.335]
64 After a node is expanded, the U PDATE A NCESTORS function simply recomputes the bounds of its ancestors according to Equations determining b′ (s′ ), V ∗ (b), P (o|b, a) and Q∗ (b, a), as outlined in Section 2. [sent-126, score-0.243]
65 To find quickly the node that maximizes the heuristic in the whole tree, each node in the tree contains a reference to the best node to expand in their subtree. [sent-128, score-0.705]
66 These references are updated by the U PDATE A NCESTORS function without adding more complexity, such that when this function terminates, we always know immediatly which node to expand next, as its reference is stored in the root node. [sent-129, score-0.241]
67 After an action is executed in the environment, the tree T is updated such that our new current belief state becomes the root of T ; all nodes under this new root can be reused at the next time step. [sent-131, score-1.022]
68 3 Completeness and Optimality We now provide some sufficient conditions under which our heuristic search is guaranteed to converge to an ǫ-optimal policy after a finite number of expansions. [sent-133, score-0.372]
69 In any tree T , the approximate error contribution eT (b0 , bd ) of a belief node bd at depth ˆ d is bounded by eT (b0 , bd ) ≤ γ d supb e(b). [sent-138, score-1.203]
70 ˆ ˆ T 3 4 This heuristic is slightly different from the AEMS1 heuristic we had introduced in [11]. [sent-144, score-0.23]
71 the set of fringe nodes reached by a sequence of actions in which each action maximizes UT (b, a) in its respective belief state. [sent-146, score-1.159]
72 For any tree T , ǫ > 0, and D such that γ D supb e(b) ≤ ǫ, if for all b ∈ F(T ), either ˆ b0 ,b ′ d(hT ) ≥ D or there exists an ancestor b of b such that eT (b′ ) ≤ ǫ, then eT (b0 ) ≤ ǫ. [sent-148, score-0.445]
73 Let’s denote aT = argmaxa∈A UT (b, P Notice that for any tree T , and parent belief b ∈ T , eT (b) = ˆb a). [sent-150, score-0.59]
74 For any tree T and ǫ > 0, if πT is defined such that inf b,T |ˆT (b)>ǫ πT (b, aT ) > 0 for ˆ ˆ ˆb e aT = argmaxa∈A UT (b, a), then Algorithm 1 using b(T ) is complete and ǫ-optimal. [sent-155, score-0.27]
75 Let’s denote AT the set of ancestor belief states of b in the ˆ b tree T , and given a finite set A of belief nodes, let’s define emin (A) = minb∈A eT (b). [sent-161, score-1.134]
76 Now let’s define Tb = ˆT ˆ b {T |T f inite, b ∈ F (T ), emin (AT ) > ǫ} and B = {b|ˆ(b) inf T ∈Tb P (hb0 ,b |b0 , πT ) > 0, d(hb0 ,b ) ≤ D}. [sent-162, score-0.197]
77 ˆT e ˆ b T T Clearly, by the assumption that inf b,T |ˆT (b)>ǫ πT (b, aT ) > 0, then B contains all belief states b within depth ˆ ˆb e b D such that e(b) > 0, P (hb0 ,b |b0 , hb0 ,b ) > 0 and there exists a finite tree T where b ∈ F (T ) and all ancestors ˆ T,o T,a b′ of b have eT (b′ ) > ǫ. [sent-163, score-0.72]
78 Furthermore, B is finite since there are only finitely many belief states within depth ˆ b0 ,b D. [sent-164, score-0.445]
79 Clearly, Emin > 0 and we ˆ ˆ T b know that for any tree T , all beliefs b in B ∩ F (T ) have an approximate error contribution eT (b0 , b) ≥ Emin . [sent-166, score-0.336]
80 Hence ˆ by Lemma 1, this means that Algorithm 1 cannot expand any node at depth D′ or beyond before expanding b a tree T where B ∩ F (T ) = ∅. [sent-168, score-0.435]
81 Because there are only finitely many nodes within depth D′ , then it is clear that Algorithm 1 will reach such tree T after a finite number of expansions. [sent-169, score-0.379]
82 Furthermore, for this tree T , since b b B ∩ F (T ) = ∅, we have that for all beliefs b ∈ F (T ), either d(hb0 ,b ) ≥ D or emin (AT ) ≤ ǫ. [sent-170, score-0.41]
83 ˆ From this last theorem, we notice that we can potentially develop many different admissible heuristics for Algorithm 1; the main sufficient condition being that πT (b, a) > 0 for a = ˆ argmaxa′ ∈A UT (b, a′ ). [sent-172, score-0.231]
84 ˆ 5 Experiments In this section we present a brief experimental evaluation of Algorithm 1, showing that in addition to its useful theoretical properties, the empirical performance matches, and in some cases exceeds, that of other online approaches. [sent-188, score-0.228]
85 We then compare performance of Algorithm 1 with both heuristics (AEMS1 and AEMS2) to the performance achieved by other online approaches (Satia [7], BI-POMDP [8], RTBSS [10]). [sent-191, score-0.28]
86 Satia, BI-POMDP, AEMS1 and AEMS2 were all implemented using the same algorithm since they differ only in their choice of search heuristic used to guide the search. [sent-193, score-0.313]
87 RTBSS served as a base line for a complete k-step lookahead search using branch & bound pruning. [sent-194, score-0.29]
88 We can also observe that AEMS2 has the best average reuse percentage, which indicates that AEMS2 is able to guide the search toward the most probable nodes and allows it to generally maintain a higher number of belief nodes in the tree. [sent-200, score-0.804]
89 This could be explained by the fact that our assumption that the values of the actions are uniformly distributed between the lower and upper bounds is not valid in the considered environments. [sent-202, score-0.242]
90 Finally, we also examined how fast the lower and upper bounds converge if we let the algorithm run up to 1000 seconds on the initial belief state. [sent-203, score-0.522]
91 This gives an indication of which heuristic would be the best if we extended online planning time past 1sec. [sent-204, score-0.364]
92 6 Conclusion In this paper we examined theoretical properties of online heuristic search algorithms for POMDPs. [sent-206, score-0.486]
93 To this end, we described a general online search framework, and examined two admissible heuristics to guide the search. [sent-207, score-0.587]
94 Our experimental work supports the theoretical analysis, showing that AEMS2 is able to outperform online approaches. [sent-210, score-0.228]
95 This highlights the fact that not all admissible heuristics are equally useful. [sent-212, score-0.202]
96 5 The policy obtained by taking the combination of the |A| α-vectors that each represents the value of a policy performing the same action in every belief state. [sent-214, score-0.716]
97 30 Figure 1: Comparison of different online search algorithm in different environments. [sent-216, score-0.296]
98 6 V(b ) Heuristic / Algorithm 15 AEMS2 AEMS1 BI−POMDP Satia 10 900 916 896 953 859 254 923 947 942 944 5 −2 10 −1 10 0 1 10 10 2 10 3 10 Time (s) Figure 2: Evolution of the upper / lower bounds on the initial belief state in RockSample[7,8]. [sent-277, score-0.553]
99 AEMS: an anytime online search algorithm for approximate policy refinement in large POMDPs. [sent-336, score-0.536]
100 LAO * : A heuristic search algorithm that finds solutions with loops. [sent-342, score-0.256]
wordName wordTfidf (topN-words)
[('fringe', 0.404), ('belief', 0.339), ('lt', 0.296), ('ut', 0.213), ('tree', 0.194), ('supb', 0.184), ('pomdp', 0.16), ('online', 0.155), ('emin', 0.154), ('action', 0.145), ('search', 0.141), ('satia', 0.135), ('argmaxa', 0.134), ('heuristics', 0.125), ('nodes', 0.12), ('ine', 0.117), ('policy', 0.116), ('rtbss', 0.116), ('heuristic', 0.115), ('et', 0.112), ('node', 0.11), ('anytime', 0.094), ('planning', 0.094), ('ft', 0.093), ('actions', 0.091), ('ht', 0.083), ('pomdps', 0.083), ('hi', 0.08), ('bd', 0.077), ('rocksample', 0.077), ('admissible', 0.077), ('ancestor', 0.067), ('lookahead', 0.067), ('expand', 0.066), ('depth', 0.065), ('root', 0.065), ('state', 0.063), ('beliefs', 0.062), ('bounds', 0.061), ('aems', 0.058), ('fvrs', 0.058), ('lave', 0.058), ('ncestors', 0.058), ('olved', 0.058), ('pdate', 0.058), ('parent', 0.057), ('guide', 0.057), ('reachable', 0.057), ('upper', 0.057), ('tb', 0.05), ('error', 0.05), ('bound', 0.049), ('scalability', 0.047), ('qc', 0.046), ('terminates', 0.044), ('inf', 0.043), ('theoretical', 0.043), ('states', 0.041), ('bhi', 0.039), ('ebr', 0.039), ('ime', 0.039), ('lao', 0.039), ('lbi', 0.039), ('xpand', 0.039), ('ancestors', 0.038), ('factored', 0.038), ('observable', 0.036), ('return', 0.036), ('maxa', 0.035), ('agent', 0.035), ('canada', 0.035), ('mcgill', 0.034), ('minb', 0.034), ('recomputes', 0.034), ('lower', 0.033), ('theorem', 0.033), ('complete', 0.033), ('reached', 0.033), ('examined', 0.032), ('compression', 0.032), ('recurrence', 0.031), ('ross', 0.031), ('reused', 0.031), ('showing', 0.03), ('approximate', 0.03), ('optimal', 0.029), ('notice', 0.029), ('ha', 0.029), ('montr', 0.029), ('qu', 0.029), ('rb', 0.029), ('exploiting', 0.028), ('sequence', 0.027), ('minimization', 0.027), ('jair', 0.027), ('ho', 0.027), ('reuse', 0.027), ('environments', 0.027), ('tag', 0.026), ('partially', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa
Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1
2 0.31984788 215 nips-2007-What makes some POMDP problems easy to approximate?
Author: Wee S. Lee, Nan Rong, Daniel J. Hsu
Abstract: Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for partially observable Markov decision processes (POMDPs) in high dimensional belief spaces. In this work, we seek to understand the belief-space properties that allow some POMDP problems to be approximated efficiently and thus help to explain the point-based algorithms’ success often observed in the experiments. We show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of a reachable belief space, which is the subset of the belief space reachable from a given belief point. We also show that under the weaker condition of having a small covering number for an optimal reachable space, which is the subset of the belief space reachable under an optimal policy, computing an approximately optimal solution is NP-hard. However, given a suitable set of points that “cover” an optimal reachable space well, an approximate solution can be computed in polynomial time. The covering number highlights several interesting properties that reduce the complexity of POMDP planning in practice, e.g., fully observed state variables, beliefs with sparse support, smooth beliefs, and circulant state-transition matrices. 1
3 0.27887988 30 nips-2007-Bayes-Adaptive POMDPs
Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau
Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1
4 0.21389547 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
5 0.16104543 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
6 0.14591487 86 nips-2007-Exponential Family Predictive Representations of State
7 0.12550761 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
8 0.11373898 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
9 0.11296435 162 nips-2007-Random Sampling of States in Dynamic Programming
10 0.11060454 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
11 0.109279 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
12 0.10693339 21 nips-2007-Adaptive Online Gradient Descent
13 0.10375308 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
14 0.096306212 197 nips-2007-The Infinite Markov Model
15 0.09095075 102 nips-2007-Incremental Natural Actor-Critic Algorithms
16 0.084642 116 nips-2007-Learning the structure of manifolds using random projections
17 0.081790239 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
18 0.079495192 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
19 0.077992573 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
20 0.074049711 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
topicId topicWeight
[(0, -0.221), (1, -0.297), (2, 0.022), (3, -0.044), (4, -0.067), (5, -0.067), (6, 0.026), (7, 0.041), (8, 0.19), (9, -0.1), (10, -0.044), (11, 0.187), (12, -0.007), (13, -0.096), (14, -0.094), (15, -0.181), (16, 0.166), (17, -0.242), (18, 0.178), (19, -0.068), (20, -0.067), (21, -0.231), (22, 0.059), (23, 0.096), (24, -0.055), (25, 0.083), (26, -0.046), (27, 0.051), (28, -0.025), (29, -0.037), (30, -0.011), (31, 0.045), (32, 0.037), (33, -0.032), (34, 0.022), (35, 0.034), (36, -0.088), (37, -0.004), (38, 0.055), (39, 0.036), (40, 0.011), (41, 0.0), (42, -0.042), (43, 0.031), (44, 0.02), (45, -0.036), (46, 0.004), (47, 0.04), (48, 0.001), (49, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97591394 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa
Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1
2 0.8827672 215 nips-2007-What makes some POMDP problems easy to approximate?
Author: Wee S. Lee, Nan Rong, Daniel J. Hsu
Abstract: Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for partially observable Markov decision processes (POMDPs) in high dimensional belief spaces. In this work, we seek to understand the belief-space properties that allow some POMDP problems to be approximated efficiently and thus help to explain the point-based algorithms’ success often observed in the experiments. We show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of a reachable belief space, which is the subset of the belief space reachable from a given belief point. We also show that under the weaker condition of having a small covering number for an optimal reachable space, which is the subset of the belief space reachable under an optimal policy, computing an approximately optimal solution is NP-hard. However, given a suitable set of points that “cover” an optimal reachable space well, an approximate solution can be computed in polynomial time. The covering number highlights several interesting properties that reduce the complexity of POMDP planning in practice, e.g., fully observed state variables, beliefs with sparse support, smooth beliefs, and circulant state-transition matrices. 1
3 0.8511374 30 nips-2007-Bayes-Adaptive POMDPs
Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau
Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1
4 0.50220442 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
Author: Charlie Frogner, Avi Pfeffer
Abstract: Dynamic Bayesian networks are structured representations of stochastic processes. Despite their structure, exact inference in DBNs is generally intractable. One approach to approximate inference involves grouping the variables in the process into smaller factors and keeping independent beliefs over these factors. In this paper we present several techniques for decomposing a dynamic Bayesian network automatically to enable factored inference. We examine a number of features of a DBN that capture different types of dependencies that will cause error in factored inference. An empirical comparison shows that the most useful of these is a heuristic that estimates the mutual information introduced between factors by one step of belief propagation. In addition to features computed over entire factors, for efficiency we explored scores computed over pairs of variables. We present search methods that use these features, pairwise and not, to find a factorization, and we compare their results on several datasets. Automatic factorization extends the applicability of factored inference to large, complex models that are undesirable to factor by hand. Moreover, tests on real DBNs show that automatic factorization can achieve significantly lower error in some cases. 1
5 0.44069555 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
6 0.41683757 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
7 0.40680277 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
8 0.40617362 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
9 0.40117061 199 nips-2007-The Price of Bandit Information for Online Optimization
10 0.39969051 52 nips-2007-Competition Adds Complexity
11 0.38801432 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
12 0.3684248 162 nips-2007-Random Sampling of States in Dynamic Programming
13 0.34225011 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
14 0.33509517 171 nips-2007-Scan Strategies for Meteorological Radars
15 0.33301154 27 nips-2007-Anytime Induction of Cost-sensitive Trees
16 0.31716177 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
17 0.31555191 197 nips-2007-The Infinite Markov Model
18 0.30678603 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
19 0.29542145 116 nips-2007-Learning the structure of manifolds using random projections
20 0.28377509 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
topicId topicWeight
[(5, 0.054), (9, 0.097), (13, 0.035), (16, 0.013), (21, 0.036), (31, 0.028), (34, 0.014), (35, 0.029), (47, 0.091), (49, 0.038), (83, 0.145), (85, 0.043), (88, 0.013), (90, 0.06), (95, 0.215)]
simIndex simValue paperId paperTitle
1 0.90149713 150 nips-2007-Optimal models of sound localization by barn owls
Author: Brian J. Fischer
Abstract: Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. Here, I examine two models for the barn owl’s sound localization behavior. First, I consider a maximum likelihood estimator in order to further evaluate the cue matching model. Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. I show that the maximum likelihood estimator can not reproduce the owl’s behavior, while the maximum a posteriori estimator is able to match the behavior. This result suggests that the standard cue matching model will not be sufficient to explain sound localization behavior in the barn owl. The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior.
same-paper 2 0.76299113 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa
Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1
3 0.70095873 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
4 0.69963759 30 nips-2007-Bayes-Adaptive POMDPs
Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau
Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1
5 0.65579152 215 nips-2007-What makes some POMDP problems easy to approximate?
Author: Wee S. Lee, Nan Rong, Daniel J. Hsu
Abstract: Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for partially observable Markov decision processes (POMDPs) in high dimensional belief spaces. In this work, we seek to understand the belief-space properties that allow some POMDP problems to be approximated efficiently and thus help to explain the point-based algorithms’ success often observed in the experiments. We show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of a reachable belief space, which is the subset of the belief space reachable from a given belief point. We also show that under the weaker condition of having a small covering number for an optimal reachable space, which is the subset of the belief space reachable under an optimal policy, computing an approximately optimal solution is NP-hard. However, given a suitable set of points that “cover” an optimal reachable space well, an approximate solution can be computed in polynomial time. The covering number highlights several interesting properties that reduce the complexity of POMDP planning in practice, e.g., fully observed state variables, beliefs with sparse support, smooth beliefs, and circulant state-transition matrices. 1
6 0.64403182 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
7 0.63806975 63 nips-2007-Convex Relaxations of Latent Variable Training
8 0.62840807 158 nips-2007-Probabilistic Matrix Factorization
9 0.62774229 115 nips-2007-Learning the 2-D Topology of Images
10 0.62668782 134 nips-2007-Multi-Task Learning via Conic Programming
11 0.62522805 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
12 0.62459844 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
13 0.62458771 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
14 0.62339753 185 nips-2007-Stable Dual Dynamic Programming
15 0.62171084 7 nips-2007-A Kernel Statistical Test of Independence
16 0.62113512 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
17 0.62109792 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
18 0.62076944 187 nips-2007-Structured Learning with Approximate Inference
19 0.6186282 102 nips-2007-Incremental Natural Actor-Critic Algorithms
20 0.61830443 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking