nips nips2000 nips2000-1 knowledge-graph by maker-knowledge-mining

1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

Source: pdf

Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier

Abstract: We propose a method of approximate dynamic programming for Markov decision processes (MDPs) using algebraic decision diagrams (ADDs). We produce near-optimal value functions and policies with much lower time and space requirements than exact dynamic programming. Our method reduces the sizes of the intermediate value functions generated during value iteration by replacing the values at the terminals of the ADD with ranges of values. Our method is demonstrated on a class of large MDPs (with up to 34 billion states), and we compare the results with the optimal value functions.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a method of approximate dynamic programming for Markov decision processes (MDPs) using algebraic decision diagrams (ADDs). [sent-10, score-0.762]

2 We produce near-optimal value functions and policies with much lower time and space requirements than exact dynamic programming. [sent-11, score-0.307]

3 Our method reduces the sizes of the intermediate value functions generated during value iteration by replacing the values at the terminals of the ADD with ranges of values. [sent-12, score-0.767]

4 Our method is demonstrated on a class of large MDPs (with up to 34 billion states), and we compare the results with the optimal value functions. [sent-13, score-0.247]

5 1 Introduction The last decade has seen much interest in structured approaches to solving planning problems under uncertainty formulated as Markov decision processes (MDPs). [sent-14, score-0.259]

6 Structured approaches using decision trees have been applied to classical dynamic programming (DP) algorithms such as value iteration and policy iteration [7, 3]. [sent-16, score-0.85]

7 This can be accomplished by reducing the "level of detail" in the representation and aggregating states with similar (rather than identical) value. [sent-21, score-0.121]

8 Approximations of this kind have been examined in the context of tree structured approaches [2]; this paper extends this research by applying them to ADDs. [sent-22, score-0.142]

9 We develop two approximation methods for ADD-structured value functions, and apply them to the value diagrams generated during dynamic programming. [sent-25, score-0.748]

10 The result is a nearoptimal value function and policy. [sent-26, score-0.162]

11 We examine the tradeoff between computation time and decision quality, and consider several variable reordering strategies that facilitate approximate aggregation. [sent-27, score-0.685]

12 2 Solving MDPs using Algebraic Decision Diagrams We assume a fully-observable MDP [10] with finite sets of states S and actions A, transition function Pr(s, a, t), reward function R, and a discounted infinite-horizon optimality criterion with discount factor 13. [sent-28, score-0.222]

13 Value iteration can be used to compute an optimal stationary policy 7r : S -t A by constructing a series of n-stage-to-go value functions, where: V n+1(s) = R(s) + max {f3 L aEA Pr(s, a, t) . [sent-29, score-0.508]

14 Vn(t)} (1) tES The sequence of value functions vn produced by value iteration converges linearly to the optimal value function V*. [sent-30, score-0.927]

15 For some finite n, the actions that maximize Equation 1 form an optimal policy, and vn approximates its value. [sent-31, score-0.231]

16 ADDs [1,4, 5] are a compact, efficiently manipulable data structure for representing realvalued functions over boolean variables En -t R. [sent-32, score-0.2]

17 They generalize a tree-structured representation by allowing nodes to have multiple parents, leading to the recombination of isomorphic sub graphs and hence to a possible reduction in the representation size. [sent-33, score-0.157]

18 Specifically, a DBN for action a requires two sets of variables, one set X = {X I, . [sent-44, score-0.153]

19 , Xn} referring to the state of the system before action a has been executed, and X' = {X~, . [sent-47, score-0.19]

20 Directed arcs from variables in X to variables in X' indicate direct causal influence. [sent-51, score-0.122]

21 The conditional probability table (CPT) for each post-action variable XI defines a conditional distribution PJc~ over XI-i. [sent-52, score-0.123]

22 X n), but where the function value (distribution) depends only on those Xj that ar~ parents of X:' We represent this function using an ADD. [sent-58, score-0.193]

23 Figure I(a) shows a simple example of a single action represented as a DBN as well as a reward function. [sent-60, score-0.234]

24 The ADD representation of the CPTs for each action, PJc~ (X), are referred to as action diagrams, as shown in Figure 1(b), where X represents the' set of pre-action variables, {Xl, . [sent-64, score-0.219]

25 These action diagrams can be combined into a complete action diagram (Figure I(c»: n pa(x"X) = IIX:. [sent-68, score-0.881]

26 (2) i=l The complete action diagram represents all the effects of pre-action variables on postaction variables for a given action. [sent-70, score-0.504]

27 The immediate reward function R(X') is also represented as an ADD, as are the n-stage-to-go value functions vn(x). [sent-71, score-0.324]

28 Given the complete action diagrams for each action, and the immediate reward function, value iteration can be performed by setting VO = R, and applying Eq. [sent-72, score-0.941]

29 5 Figure 2: Approximation of original value diagram (a) with errors of 0. [sent-83, score-0.36]

30 The value iteration loop is continued until some stopping criterion is met. [sent-88, score-0.334]

31 3 Approximating Value Functions While structured solution techniques offer many advantages, the exact solution of MDPs in this way can only work if there are "few" distinct values in a value function. [sent-90, score-0.287]

32 Even if a DBN representation shows little dependence among variables from one stage to another, the influence of variables tends to "bleed" through a DBN over time, and many variables become relevant to predicting value. [sent-91, score-0.218]

33 Thus, even using structured methods, we must often relax the optimality constraint and generate only approximate value functions, from which near-optimal policies will hopefully arise. [sent-92, score-0.391]

34 Replacing such values with a single approximate values leads to size reduction, while not significantly affecting the precision of the value diagrams. [sent-94, score-0.411]

35 1 Decision Diagrams and Approximation Consider the value diagram shown in Figure 2(a), which has eight distinct values as shown. [sent-96, score-0.406]

36 The value of each state s is represented as a pair [l, u], where the lower, l, and upper, u, bounds on the values are both represented. [sent-97, score-0.245]

37 Now suppose that the diagram in Figure 2(a) exceeds resource limits, and a reduction in size is necessary to continue the value iteration process. [sent-100, score-0.609]

38 5 of each other, the diagrams in Figure 2(b) or (c) result, respectively. [sent-103, score-0.306]

39 The states which had proximal values have been merged, where merging a set of states 81,82, . [sent-104, score-0.169]

40 , [ln, un], results in an aggregate state, t, with a ranged value [min(h, . [sent-110, score-0.284]

41 The midpoint of the range estimates the true value of the states with minimal error, namely, 8pan( t) / 2. [sent-117, score-0.206]

42 The span of V is the maximum of all spans in the value diagram, and therefore the maximum error in V is simply span ( V) / 2 [2]. [sent-118, score-0.527]

43 The combined span of a set of states is the span of the pair that would result from merging them all. [sent-119, score-0.449]

44 The extent of a value diagram V is the combined span of the portion of the state space which it represents. [sent-120, score-0.664]

45 ADD-structured value functions can be leveraged by approximation techniques because approximations can always be performed directly without pre-processing techniques such as variable reordering. [sent-124, score-0.339]

46 Of course, variable reordering can still play an important computational role in ADD-structured methods, but are not needed for discovering approximations. [sent-125, score-0.478]

47 2 Value Iteration with Approximate Value Functions Approximate value iteration simply means applying an approximation technique to the nstage to go value function generated at each iteration of Eq. [sent-127, score-0.723]

48 In contrast, decision quality might require errors below some fixed value, referred to as the pruning strength, 8. [sent-130, score-0.437]

49 Thus, the objective of a single approximation step is a reduction in the size of a ranged value ADD by replacing all leaves which have combined spans less than the specified error bound by a single leaf. [sent-132, score-0.63]

50 Given a leaf [l, u] in V, the set of all leaves [li, ud such that the combined span of [li, Ui] with [l, u] is less than the specified error are merged. [sent-133, score-0.352]

51 We have also examined a quicker, but less exact, method for approximation, which exploits the fact that simply reducing the precision of the values at the leaves of an ADD merges the similar values. [sent-135, score-0.224]

52 The sequence of ranged value functions, Vn , converges after n' iterations to an approximate (non-ranged) value function, V, by taking the mid-points of each ranged terminal node in Vn '. [sent-137, score-0.743]

53 The pruning strength, 8, then gives the percentage difference between V and the optimal n'-stage-to-go value function V n '. [sent-138, score-0.488]

54 The value function V induces a policy, n, the value of which is ViTo In general, however, ViT # V [11] 2. [sent-139, score-0.324]

55 3 Variable Reordering As previously mentioned, variable reordering can have a significant effect on the size of an ADD, but finding the variable ordering which gives rise to the smallest ADD for a boolean function is co-NP-complete [4]. [sent-141, score-0.815]

56 The first two are standard for reordering variables in BDDs: Rudell's sifting algorithm and random reordering [12]. [sent-143, score-0.936]

57 The last reordering method we consider arises in the decision tree induction literature, and is related to the information gain criterion. [sent-144, score-0.524]

58 Given a value diagram V with extent 8, each variable x is considered in tum. [sent-145, score-0.499]

59 The value diagram is restricted first with x = true, and the extent 8t and the number of leaves nt are calculated for the restricted ADD. [sent-146, score-0.488]

60 If we collapsed the entire ADD into a single node, assuming a uniform distribution over values in the resulting 2In fact, the equality arises if and only if V = V·, where V· is the optimal value function. [sent-148, score-0.28]

61 Splitting the values with the variable x results in two new value diagrams, for each of which the entropy is calculated. [sent-150, score-0.285]

62 This method will be referred to as the minimum span method. [sent-152, score-0.196]

63 Our approximation methods were tested on various adaptations of a process planning problem taken from [7, 8]. [sent-156, score-0.112]

64 1 Approximation All experiments in this section were performed on problem domains where the variable ordering was the one selected implicitly by the constructors of the domains. [sent-158, score-0.163]

65 04 44 44 44 15 12 10 6 4 4 2 2 1 nodes leaves (inl) 22170 17108 15960 15230 14510 11208 3739 580 299 50 10 0 527 117 77 58 48 38 15 9 6 3 2 1 IV" - V*I (%) 0. [sent-171, score-0.114]

66 28 3lo25 Table 1: Comparing optimal with approximate value iteration on a domain with 28 boolean variables. [sent-180, score-0.591]

67 In Table 1 we compare optimal value iteration using ADDs (SPUDD as presented in [8]) with approximate value iteration using different pruning strengths J. [sent-181, score-1.204]

68 In order to avoid overly aggressive pruning in the early stage of the value iterations, we need to take into account the size of the value function at every iteration. [sent-182, score-0.645]

69 Therefore, we use a sliding pruning strength specified as J E~=o j3i extent(R) where R is the initial reward diagram, j3 is the discount factor introduced earlier and n is the iteration number. [sent-183, score-0.643]

70 We illustrate running time, value function size (internal nodes and leaf nodes), number of iterations, and the average sum of squared difference between the optimal value function, V* , and the value of the approximate policy, Vir. [sent-184, score-0.797]

71 It is important to note that the pruning strength is an upper bound on the approximation error. [sent-185, score-0.384]

72 That is, the optimal values are guaranteed to lie within the ranges of the approximate 3 See [9] for details. [sent-186, score-0.249]

73 However, as noted earlier, this bound does not hold for the value of an induced policy, as can be seen at 3% pruning in the last column of Table 1. [sent-189, score-0.445]

74 The effects of approximation on the performance of the value iteration algorithm are threefold. [sent-190, score-0.389]

75 First, the approximation itself introduces an overhead which depends on the size of the value function being approximated. [sent-191, score-0.255]

76 This effect can be seen in Table 1 at low pruning strengths (1 - 2%), where the running time is increased from that taken by optimal value iteration. [sent-192, score-0.701]

77 Second, the ranges in the value function reduce the number of iterations needed to attain convergence, as can be seen in Table 1 for pruning strengths greater than 2%. [sent-193, score-0.696]

78 However, for the lower pruning strengths, this effect is not observed. [sent-194, score-0.325]

79 This can be explained by the fact that a small number of states with values much greater (or much lower) than that of the rest of the state space may never be approximated. [sent-195, score-0.127]

80 Therefore, to converge, this portion of the state space requires the same number of iterations as in the optimal case 5. [sent-196, score-0.129]

81 The third effect of approximation is to reduce the size of the value functions, thus reducing the per iteration computation time during value iteration. [sent-197, score-0.631]

82 This effect is clearly seen at pruning strengths greater than 2%, where it overtakes the cost of approximation, and generates significant time and space savings. [sent-198, score-0.451]

83 Speed ups of 2 and 4 fold are obtained for pruning strengths of 3% and 4% respectively. [sent-199, score-0.409]

84 Furthermore, fewer than 60 leaf nodes represent the entire state space, while value errors in the policy do not exceed 6%. [sent-200, score-0.46]

85 This confirms our initial hypothesis that many values within a given domain are very similar and thus, replacing such values with ranges drastically reduces the size of the resulting diagram without significantly affecting the quality of the resulting policy. [sent-201, score-0.52]

86 Pruning strengths of more than 40% generate policies which are close to trivial, where a single action is always taken. [sent-203, score-0.316]

87 Results in the previous section were all generated using the "intuitive" variable ordering for the problem at hand. [sent-206, score-0.163]

88 It is probable that such an ordering is close to optimal, but such orderings may not always be obvious, and the effects of a poor ordering on the resources required for policy generation can be extreme. [sent-207, score-0.332]

89 Therefore, to characterize the reordering methods discussed in Section 3. [sent-208, score-0.401]

90 3, we start with initially randomly shuffled orders and compare the sizes of the final value diagrams with those found using the intuitive order. [sent-209, score-0.843]

91 5We are currently looking into alleviating this effect in order to increase convergence speed for low pruning strengths In Figure 3 we present results obtained from approximate value iteration with a pruning strength of 3% applied to a range of problem domain sizes. [sent-210, score-1.234]

92 In the absence of any reordering, diagrams produced with randomly shuffled variable orders are up to 3 times larger than those produced with the intuitive (unshuffled) order. [sent-211, score-0.755]

93 The minimum span reordering method, starting from a randomly shuffled order, finds orders which are equivalent to the intuitive one, producing value diagrams with nearly identical size. [sent-212, score-1.35]

94 The sifting and random reordering methods find orders which reduce the sizes further by up to a factor of 7. [sent-213, score-0.602]

95 Value iteration with the sifting reordering method (starting with shuffled orders) was found to run in time similar to that of value iteration with the intuitive ordering, while the other reordering methods took slightly longer. [sent-215, score-1.628]

96 All reordering methods, however, reduced running times and diagram sizes from that using no reordering, by factors of 3 to 5. [sent-216, score-0.703]

97 5 Concluding Remarks We examined a method for approximate dynamic programming for MDPs using ADDs. [sent-217, score-0.21]

98 Investigations into the use of variable reordering during value iteration have also proved fruitful, and yield large improvements in the sizes of value diagrams. [sent-220, score-1.033]

99 Results show that our policy generator is robust to the variable order, and so this is no longer a constraint for problem specification. [sent-221, score-0.208]

100 Spectral transforms for large boolean functions with applications to technology mapping. [sent-249, score-0.139]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reordering', 0.401), ('diagrams', 0.306), ('pruning', 0.283), ('diagram', 0.198), ('iteration', 0.172), ('shuffled', 0.171), ('span', 0.165), ('value', 0.162), ('action', 0.153), ('vn', 0.153), ('adds', 0.143), ('mdps', 0.135), ('policy', 0.131), ('strengths', 0.126), ('decision', 0.123), ('hoey', 0.122), ('pjc', 0.122), ('ranged', 0.122), ('add', 0.119), ('craig', 0.115), ('boolean', 0.094), ('ordering', 0.086), ('dbn', 0.084), ('approximate', 0.084), ('reward', 0.081), ('structured', 0.079), ('variable', 0.077), ('intuitive', 0.076), ('ranges', 0.076), ('reorder', 0.076), ('jesse', 0.073), ('sifting', 0.073), ('orders', 0.069), ('leaves', 0.066), ('boutilier', 0.063), ('algebraic', 0.063), ('examined', 0.063), ('dynamic', 0.063), ('extent', 0.062), ('variables', 0.061), ('richard', 0.06), ('sizes', 0.059), ('planning', 0.057), ('approximation', 0.055), ('leaf', 0.053), ('iterations', 0.049), ('cpts', 0.049), ('cudd', 0.049), ('dearden', 0.049), ('fabio', 0.049), ('lza', 0.049), ('merges', 0.049), ('spudd', 0.049), ('unshuffled', 0.049), ('vancouver', 0.049), ('nodes', 0.048), ('values', 0.046), ('table', 0.046), ('strength', 0.046), ('functions', 0.045), ('replacing', 0.045), ('running', 0.045), ('mdp', 0.045), ('states', 0.044), ('optimal', 0.043), ('aggregating', 0.042), ('alan', 0.042), ('billion', 0.042), ('british', 0.042), ('columbia', 0.042), ('terminal', 0.042), ('effect', 0.042), ('dp', 0.041), ('combined', 0.04), ('reduction', 0.039), ('robert', 0.038), ('bc', 0.038), ('hu', 0.038), ('size', 0.038), ('state', 0.037), ('policies', 0.037), ('immediate', 0.036), ('domain', 0.036), ('affecting', 0.035), ('merging', 0.035), ('spans', 0.035), ('representation', 0.035), ('actions', 0.035), ('advantages', 0.035), ('discount', 0.033), ('parents', 0.031), ('complete', 0.031), ('referred', 0.031), ('un', 0.03), ('entire', 0.029), ('optimality', 0.029), ('resources', 0.029), ('produced', 0.028), ('specified', 0.028), ('trees', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier

2 0.15311366 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Author: Brian Sallans, Geoffrey E. Hinton

Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.

3 0.13410461 48 nips-2000-Exact Solutions to Time-Dependent MDPs

Author: Justin A. Boyan, Michael L. Littman

Abstract: We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public transportation and telescope observation scheduling. 1

4 0.1094934 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

Author: Geoffrey J. Gordon

Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1

5 0.10588822 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

6 0.10238975 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

7 0.097817235 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

8 0.088462383 148 nips-2000-`N-Body' Problems in Statistical Learning

9 0.087954149 105 nips-2000-Programmable Reinforcement Learning Agents

10 0.080540158 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

11 0.075574771 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

12 0.070149653 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

13 0.062933266 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

14 0.059521563 43 nips-2000-Dopamine Bonuses

15 0.057586547 114 nips-2000-Second Order Approximations for Probability Models

16 0.057499986 120 nips-2000-Sparse Greedy Gaussian Process Regression

17 0.053994577 113 nips-2000-Robust Reinforcement Learning

18 0.053436905 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

19 0.052667748 13 nips-2000-A Tighter Bound for Graphical Models

20 0.051585849 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.192), (1, -0.026), (2, 0.103), (3, -0.253), (4, -0.121), (5, 0.017), (6, -0.022), (7, 0.027), (8, -0.05), (9, 0.051), (10, -0.039), (11, 0.045), (12, -0.038), (13, 0.014), (14, -0.02), (15, 0.023), (16, -0.0), (17, 0.11), (18, -0.085), (19, 0.064), (20, -0.029), (21, 0.023), (22, -0.054), (23, 0.017), (24, -0.014), (25, 0.012), (26, -0.026), (27, 0.047), (28, -0.001), (29, -0.033), (30, 0.077), (31, 0.062), (32, -0.14), (33, 0.005), (34, 0.078), (35, -0.108), (36, -0.159), (37, 0.018), (38, 0.075), (39, 0.069), (40, -0.043), (41, -0.059), (42, -0.146), (43, 0.075), (44, 0.18), (45, -0.077), (46, -0.056), (47, -0.023), (48, -0.055), (49, -0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96606147 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier

2 0.72572833 48 nips-2000-Exact Solutions to Time-Dependent MDPs

Author: Justin A. Boyan, Michael L. Littman

3 0.57147282 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Author: Brian Sallans, Geoffrey E. Hinton

4 0.52803791 148 nips-2000-`N-Body' Problems in Statistical Learning

Author: Alexander G. Gray, Andrew W. Moore

Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1

5 0.51588327 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

Author: Dirk Ormoneit, Peter W. Glynn

Abstract: Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem. 1

6 0.5098604 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

7 0.49427092 105 nips-2000-Programmable Reinforcement Learning Agents

8 0.44839463 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

9 0.41445589 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

10 0.41221708 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

11 0.4065634 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

12 0.37382662 113 nips-2000-Robust Reinforcement Learning

13 0.37310463 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

14 0.35569426 60 nips-2000-Gaussianization

15 0.34758738 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

16 0.34265301 120 nips-2000-Sparse Greedy Gaussian Process Regression

17 0.31534284 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks

18 0.30778223 52 nips-2000-Fast Training of Support Vector Classifiers

19 0.27421704 114 nips-2000-Second Order Approximations for Probability Models

20 0.26586202 23 nips-2000-An Adaptive Metric Machine for Pattern Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.049), (17, 0.096), (32, 0.021), (33, 0.04), (54, 0.012), (55, 0.043), (62, 0.115), (65, 0.019), (67, 0.047), (72, 0.347), (76, 0.03), (79, 0.019), (81, 0.025), (90, 0.028), (91, 0.015), (97, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94825053 132 nips-2000-The Interplay of Symbolic and Subsymbolic Processes in Anagram Problem Solving

Author: David B. Grimes, Michael Mozer

Abstract: Although connectionist models have provided insights into the nature of perception and motor control, connectionist accounts of higher cognition seldom go beyond an implementation of traditional symbol-processing theories. We describe a connectionist constraint satisfaction model of how people solve anagram problems. The model exploits statistics of English orthography, but also addresses the interplay of sub symbolic and symbolic computation by a mechanism that extracts approximate symbolic representations (partial orderings of letters) from sub symbolic structures and injects the extracted representation back into the model to assist in the solution of the anagram. We show the computational benefit of this extraction-injection process and discuss its relationship to conscious mental processes and working memory. We also account for experimental data concerning the difficulty of anagram solution based on the orthographic structure of the anagram string and the target word. Historically, the mind has been viewed from two opposing computational perspectives. The symbolic perspective views the mind as a symbolic information processing engine. According to this perspective, cognition operates on representations that encode logical relationships among discrete symbolic elements, such as stacks and structured trees, and cognition involves basic operations such as means-ends analysis and best-first search. In contrast, the subsymbolic perspective views the mind as performing statistical inference, and involves basic operations such as constraint-satisfaction search. The data structures on which these operations take place are numerical vectors. In some domains of cognition, significant progress has been made through analysis from one computational perspective or the other. The thesis of our work is that many of these domains might be understood more completely by focusing on the interplay of subsymbolic and symbolic information processing. Consider the higher-cognitive domain of problem solving. At an abstract level of description, problem solving tasks can readily be formalized in terms of symbolic representations and operations. However, the neurobiological hardware that underlies human cognition appears to be subsymbolic-representations are noisy and graded, and the brain operates and adapts in a continuous fashion that is difficult to characterize in discrete symbolic terms. At some level-between the computational level of the task description and the implementation level of human neurobiology-the symbolic and subsymbolic accounts must come into contact with one another. We focus on this point of contact by proposing mechanisms by which symbolic representations can modulate sub symbolic processing, and mechanisms by which subsymbolic representations are made symbolic. We conjecture that these mechanisms can not only provide an account for the interplay of symbolic and sub symbolic processes in cognition, but that they form a sensible computational strategy that outperforms purely subsymbolic computation, and hence, symbolic reasoning makes sense from an evolutionary perspective. In this paper, we apply our approach to a high-level cognitive task, anagram problem solving. An anagram is a nonsense string of letters whose letters can be rearranged to form a word. For example, the solution to the anagram puzzle RYTEHO is THEORY. Anagram solving is a interesting task because it taps higher cognitive abilities and issues of awareness, it has a tractable state space, and interesting psychological data is available to model. 1 A Sub symbolic Computational Model We start by presenting a purely subsymbolic model of anagram processing. By subsymbolic, we mean that the model utilizes only English orthographic statistics and does not have access to an English lexicon. We will argue that this model proves insufficient to explain human performance on anagram problem solving. However, it is a key component of a hybrid symbolic-subsymbolic model we propose, and is thus described in detail. 1.1 Problem Representation A computational model of anagram processing must represent letter orderings. For example, the model must be capable of representing a solution such as

same-paper 2 0.80488676 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier

3 0.43288037 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

Author: Anders Jonsson, Andrew G. Barto

Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.

4 0.42575157 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

5 0.42403105 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

Author: Igor V. Cadez, Padhraic Smyth

Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of

6 0.42115593 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

7 0.41966832 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

8 0.41721782 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

9 0.41501272 80 nips-2000-Learning Switching Linear Models of Human Motion

10 0.414839 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

11 0.40456083 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

12 0.39914969 92 nips-2000-Occam's Razor

13 0.39464706 105 nips-2000-Programmable Reinforcement Learning Agents

14 0.39448401 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

15 0.39417619 138 nips-2000-The Use of Classifiers in Sequential Inference

16 0.39411652 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script

17 0.39316487 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

18 0.38860536 22 nips-2000-Algorithms for Non-negative Matrix Factorization

19 0.38816211 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

20 0.38766265 127 nips-2000-Structure Learning in Human Causal Induction