jmlr jmlr2008 jmlr2008-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sungwook Yoon, Alan Fern, Robert Givan
Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-specific knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to define features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search
Reference: text
sentIndex sentText sentNum sentScore
1 However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. [sent-8, score-0.576]
2 This gives a new way of leveraging relaxed planning techniques in the context of learning. [sent-14, score-0.624]
3 Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. [sent-16, score-0.633]
4 Introduction In recent years, forward state-space search has become a popular approach in AI planning, leading to state-of-the-art performance in a variety of settings, including classical STRIPS planning (Hoffmann and Nebel, 2001; Vidal, 2004), optimal planning (Helmert et al. [sent-18, score-1.013]
5 , 2007), temporal-metric planning (Do and Kambhampati, 2003), nondeterministic planning (Bryce and Kambhampati, 2006), and oversubscribed planning (Benton et al. [sent-19, score-1.233]
6 Nevertheless, it is not hard to find domains where these heuristics do not work well, resulting in planning failure. [sent-24, score-0.576]
7 We are motivated by these failures and in this study, we investigate machine learning techniques that find domain-specific control knowledge that can improve or speed-up forward state-space search in a non-optimal, or satisficing, planning setting. [sent-25, score-0.723]
8 However, despite the significant effort, none of these approaches has been demonstrated to be competitive with state-of-the-art non-learning planners across a wide range of planning domains. [sent-27, score-0.573]
9 Even these approaches have not demonstrated the ability to outperform the best non-learning planners as measured on planning competition domains. [sent-32, score-0.603]
10 Second, we propose a novel hypothesis space for representing useful heuristic features of planning states. [sent-38, score-0.68]
11 The result is a learning-based planner that learns control knowledge for a planning domain from a small number of solved problems and is competitive with and often better than state-of-the-art non-learning planners across a substantial set of benchmark domains and problems. [sent-40, score-0.958]
12 Learning reactive policies for planning domains has been studied by several researchers (Khardon, 1999; Martin and Geffner, 2000; Yoon et al. [sent-56, score-0.798]
13 Nevertheless, such policies capture substantial information about the planning domain, which we would like to exploit in a more robust way. [sent-60, score-0.607]
14 In our experiments, we learned and evaluated both forms of control knowledge on benchmark problems from recent planning competitions. [sent-69, score-0.579]
15 1 Planning Domains A deterministic planning domain D defines a set of possible actions A and a set of states S in terms of a set of predicate symbols P , action types Y , and objects O. [sent-85, score-0.894]
16 Each action a ∈ A consists of: 1) an action name, which is an action type applied to the appropriate number of objects, 2) a set of precondition state facts Pre(a), 3) two sets of state facts Add(a) and Del(a) representing the add and delete effects respectively. [sent-88, score-0.719]
17 Given a planning domain, a planning problem P from the domain is a tuple (s, A, g), where A ⊆ A is a set of applicable actions, s ∈ S is the initial state, and g is a set of state facts representing the goal. [sent-90, score-0.99]
18 A solution plan for a planning problem is a sequence of actions (a 1 , . [sent-91, score-0.679]
19 We say that a planning problem (s, A, g) is reachable from problem (s0 , A, g) iff there is some action sequence in A∗ that leads from s0 to s. [sent-96, score-0.566]
20 Typically the competition is organized around a set of planning domains, with each domain providing a sequence of planning problems, often in increasing order of difficulty. [sent-100, score-0.945]
21 The input to our learner will be a set of problems for a particular planning domain along with a solution plan to each problem. [sent-109, score-0.644]
22 Some representative examples of such systems include learning for partial-order planning (Estlin and Mooney, 1996), learning for planning as satisfiability (Huang et al. [sent-135, score-0.822]
23 More recently there have been several learning-to-plan systems based on the idea of learning reactive policies for planning domains (Khardon, 1999; Martin and Geffner, 2000; Yoon et al. [sent-140, score-0.798]
24 Ideas from reinforcement learning have also been applied to learn control policies in AI planning domains. [sent-146, score-0.71]
25 The Blocksworld problems they considered were complex from a traditional RL perspective due to the large state and action spaces, however, they were relatively simple from an AI planning perspective. [sent-149, score-0.612]
26 A related approach, used a more powerful form of reinforcement learning, known as approximate policy iteration, and demonstrated good results in a number of planning competition domains (Fern et al. [sent-151, score-0.697]
27 For each, we describe how we will incorporate them into forward state-space search in order to improve planning performance. [sent-164, score-0.602]
28 A heuristic function H(s, A, g) is simply a function of a state s, action set A, and goal g that estimates the cost of achieving the goal from s using actions in A. [sent-168, score-0.558]
29 However, when a heuristic is less accurate, it must be used in the context of a search procedure such as best-first search, where the accuracy of the heuristic impacts the search efficiency. [sent-170, score-0.71]
30 Note that by best-first search, here we mean a search that is guided by only the heuristic value, rather than the path-cost plus heuristic value. [sent-172, score-0.586]
31 Recent progress in the development of domain-independent heuristic functions for planning has led to a new generation of state-of-the-art planners based on forward state-space heuristic search (Bonet and Geffner, 2001; Hoffmann and Nebel, 2001; Nguyen et al. [sent-175, score-1.263]
32 A reactive policy is a computationally efficient function π(s, A, g), possibly stochastic, that maps a planning problem (s, A, g) to an action in A. [sent-186, score-0.813]
33 Ideally, given an optimal or near-optimal policy for a planning domain, the trajectories represent high-quality solution plans. [sent-191, score-0.552]
34 , 2006) that consider using learned policies in this way in AI planning context. [sent-198, score-0.654]
35 Although these policies may select good actions in many states, the lack of search prevents them from overcoming the potentially numerous bad action choices. [sent-201, score-0.601]
36 Unfortunately, our initial investigation showed that in many planning competition domains these techniques were not powerful enough to overcome the flaws in our learned polices. [sent-204, score-0.603]
37 The main idea is to use reactive policies during the node expansion process of a heuristic search, which in our work is greedy best-first search. [sent-206, score-0.64]
38 While this technique for incorporating policies into search is simple, our empirical results, show that it is very effective, achieving better performance than either pure heuristic search or search-free policy execution. [sent-217, score-0.816]
39 Note that throughout, for notational convenience we will describe each search node by its implicit planning problem (s, A, g), where s is the current state of the node, g is the goal, and A is the action set. [sent-222, score-0.783]
40 1 Relaxed Plans Given a planning problem (s, A, g), we define the corresponding relaxed planning problem to be the problem (s, A+ , g) where the new action set A+ is created by copying A and then removing the delete list from each of the actions. [sent-240, score-1.297]
41 Thus, a relaxed planning problem is a version of the original planning problem where it is not necessary to worry about delete effects of actions. [sent-241, score-1.079]
42 A relaxed plan for a planning problem (s, A, g) is simply a plan that solves the relaxed planning problem. [sent-242, score-1.532]
43 First, although a relaxed plan may not necessarily solve the original planning problem, the length of the shortest relaxed plan serves as an admissible heuristic for the original planning problem. [sent-244, score-1.802]
44 Second, in general, it is computationally easier to find relaxed plans compared to solving general planning problems. [sent-246, score-0.73]
45 In the worst case, this is apparent by noting that the problem of plan existence can be solved in polynomial time for relaxed planning problems, but is PSPACE-complete for general problems. [sent-247, score-0.81]
46 HSP (Bonet and Geffner, 2001) uses forward state-space search guided by an admissible heuristic that estimates the length of the optimal relaxed plan. [sent-251, score-0.674]
47 FF’s style of relaxed plan computation is linear with the length of the relaxed plan, thus fast, but the resulting heuristics can be inadmissible. [sent-253, score-0.687]
48 FF computes relaxed plans using a relaxed plan graph (RPG). [sent-255, score-0.674]
49 An RPG is simply the usual plan graph created by Graphplan (Blum and Furst, 1995), but for the relaxed planning problem rather than the original problem. [sent-256, score-0.766]
50 After constructing the RPG for a planning problem, FF starts at the last RPG level and uses a backtrack-free procedure that extracts a sequence of actions that correspond to a successful relaxed plan. [sent-262, score-0.75]
51 691 YOON , F ERN AND G IVAN While the length of FF’s relaxed plan often serves as an effective heuristic, for a number of planning domains, ignoring delete effects leads to severe underestimates of the distance to a goal. [sent-264, score-0.849]
52 For example, Vidal (2004) uses relaxed plans to construct macro actions, which help the planner overcome regions of the state space where the relaxed-plan length heuristic is flat. [sent-269, score-0.707]
53 In this work, we give a novel approach to leveraging relaxed planning, in particular, we use relaxed plans as a source of information from which we can compute complex features that will be used to learn heuristic functions and policies. [sent-271, score-0.828]
54 Interestingly, as we will see, this approach will allow for features that are sensitive to delete lists of actions in relaxed plans, which can be used to help correct for the fact that relaxed plans ignore delete effects. [sent-272, score-0.822]
55 Notice that in the relaxed plan for S2 , on(A, B) is in the delete list of the action unstack(A, B) and at the same time it is a goal fact. [sent-279, score-0.617]
56 In particular, suppose that we had a feature that computed the number of such “on” facts that were both in the delete list of some relaxed plan action and in the goal, giving a value of 0 for S1 and 1 for S2 . [sent-281, score-0.676]
57 Our experiments show that these features are useful across a range of domains used in planning competitions. [sent-287, score-0.564]
58 3 Constructing Databases from Search Nodes Recall that each feature in our feature space corresponds to a taxonomic syntax expression (see next section) built from the predicate symbols in databases of facts constructed for each search node encountered. [sent-289, score-0.595]
59 We will denote the database for search node (s, A, g) as D(s, A, g), which will simply contain a set of ground facts over some set of predicate symbols and objects derived from the search node. [sent-290, score-0.555]
60 Note that in this work we use the relaxed plan computed by FF’s heuristic calculation. [sent-295, score-0.586]
61 • For each state fact in the add list of some action ai in the relaxed plan, add a fact to the database that is the result of prepending an a to the fact’s predicate symbol. [sent-299, score-0.637]
62 Note that taxonomic syntax class expressions will be constructed from the predicates in this database which include the set of planning domain predicates and action types, along with a variant of each planning domain predicate prepended with an ’a’, ’d’, ’g’, or ’c’. [sent-309, score-1.787]
63 It captures information about the relaxed plan using the action type predicates and the ’a’ and ’d’ predicates. [sent-311, score-0.615]
64 Learning Heuristic Functions Given the relaxed plan feature space, we will now describe how to use that space to represent and learn heuristic functions for use as control knowledge in forward state-space search. [sent-386, score-0.801]
65 1 Heuristic Function Representation Recall that a heuristic function H(s, A, g) is simply a function of a state s, action set A, and goal g that estimates the cost of achieving the goal from s using actions in A. [sent-393, score-0.558]
66 In particular, for each planning domain we would like to learn a distinct set of functions f i and their corresponding weights that lead to good planning performance in that domain when guided by the resulting linear heuristic function. [sent-395, score-1.206]
67 2 Heuristic Function Learning The input to our learning algorithm is a set of planning problems, each paired with an example solution plan, taken from a target planning domain. [sent-399, score-0.822]
68 1 R EPRESENTATION A taxonomic decision list policy is a list of taxonomic action-selection rules. [sent-482, score-0.586]
69 If L suggests no action for node (s, A, g), then π[L](s, A, g) is the lexicographically least action in s, whose preconditions are satisfied; otherwise, π[L](s, A, g) is the least action suggested by L. [sent-511, score-0.547]
70 In this way the heuristic will assign small heuristic values to rules that cover only a small number of examples and rules that cover many examples but suggest many actions outside of the training set. [sent-573, score-0.681]
71 Previously, the idea of monotonic properties of planning domains have been identified by Parmar (2002) as “measures of progress” and we inherit the term and expand the idea to ensembles of measures where the monotonicity is provided via a prioritized list of functions. [sent-587, score-0.628]
72 F is a strong measure of progress for planning domain D iff for any reachable problem (s, A, g) of D , either g ⊆ s or there exists an action a such that F(a(s), A, g) F(s, A, g). [sent-593, score-0.696]
73 The search over this space of class expressions is conducted using a beam search of user specified width b which is initialized to a beam that contains only the universal class expression athing. [sent-629, score-0.661]
74 Experiments We evaluated our learning techniques on the traditional benchmark domain Blocksworld and then on a subset of the STRIPS/ADL domains from two recent international planning competitions (IPC3 and IPC4). [sent-647, score-0.559]
75 We included all of the IPC3 and IPC4 domains where FF’s RPL heuristic was sufficiently inaccurate on the training data, so as to afford our heuristic learner the opportunity to learn. [sent-648, score-0.606]
76 For the case of the policy representations (taxonomic decision lists and measures of progress), we used FF’s RPL heuristic as the heuristic function and used a fixed policy-execution horizon of 50. [sent-654, score-0.7]
77 Each row corresponds to a distinct planning technique, some using learned control knowledge and some not. [sent-663, score-0.579]
78 FF adds goal-ordering, enforced-hill climbing, and helpful action pruning on top of relaxed plan heuristic search. [sent-667, score-0.741]
79 3 does not include any facts related to the relaxed plan into the search node databases. [sent-676, score-0.585]
80 • Greedy: number of problems solved using greedy execution of the decision list or measuresof-progress policies (only applies to systems that learn decision list rules and measures of progress). [sent-685, score-0.572]
81 Later in our discussion of the Depots domain we will give empirical evidence that one reason for this reduction in the amount of search is that the policies are able to quickly move through plateaus to find states with low heuristic values. [sent-722, score-0.614]
82 Again, as for the decision-list policies we see that the incorporation of the measures of progress into search significantly speeds up planning time compared to RPL. [sent-724, score-0.828]
83 That is, all actions not in the training plans are treated as negative examples, while in fact many of those actions are just as good as the selected action, since there are often many good action choices in a state. [sent-726, score-0.544]
84 This is likely because it is possible to learn very good decision list rules and measure of progress in the Blocksworld, which guide the heuristic search to good solutions very quickly. [sent-735, score-0.572]
85 Our feature comparison results indicate that when the relaxed plan features are removed, decisionlist and measures-of-progress learning still solve all of the problems, but the heuristic learner HnoRP only solves 12 problems. [sent-738, score-0.652]
86 This indicates that the relaxed plan information is important to the heuristic learner in this domain. [sent-739, score-0.614]
87 Aggregating the number of problems solved across the three domains in IPC3, the learning and planning systems DL, MoP and H all solved more problems than FF and RPL. [sent-777, score-0.614]
88 Interestingly, greedy action selection according to both the learned decision list policies and measures of progress is unable to solve any testing problem and very few training problems. [sent-784, score-0.678]
89 These domains have more predicates, actions and larger arities for each action than Blocksworld, leading to a bigger policy search space. [sent-789, score-0.631]
90 Still, the benefit of the policy cannot be ignored, since once it is learned, the execution of a policy is much faster than measures of progress or heuristic functions. [sent-790, score-0.61]
91 Our approach to incorporating policies into search appears to help speed-up the search process by adding nodes that are far from the current node being expanded, helping to overcome local minima of the heuristic function. [sent-794, score-0.722]
92 search to some local minimum for both the heuristic and policy, which would not necessarily be visited with the heuristic alone, causing poor performance here for the policy-based approach. [sent-805, score-0.586]
93 FF’s heuristic is very accurate for the first two domains, where for all of the solved problems the solution length and FF’s heuristic are almost identical, leaving little room for learning. [sent-808, score-0.545]
94 Second, we showed how to learn and use control knowledge over this feature space for forward state-space heuristic search planning, a planning framework for which little work has been done in the direction of learning. [sent-849, score-0.981]
95 A key problem in applying learned control knowledge in planning is to robustly deal with imperfect knowledge resulting from the statistical nature of the learning process. [sent-863, score-0.658]
96 However, there are many other planning settings and forms of control knowledge for which we are interested in developing robust mechanisms for applying control knowledge. [sent-865, score-0.608]
97 For example, stochastic planning domains and planning with richer cost functions are of primary interest. [sent-866, score-0.907]
98 In summary, we have demonstrated that it is possible to use machine learning to improve the performance of forward state-space search planners across a range of planning domains. [sent-878, score-0.764]
99 The FF planning system: Fast plan generation through heuristic o search. [sent-970, score-0.784]
100 Learning generalized policies in planning domains using concept languages. [sent-979, score-0.692]
wordName wordTfidf (topN-words)
[('planning', 0.411), ('rpl', 0.244), ('heuristic', 0.231), ('yoon', 0.217), ('relaxed', 0.213), ('policies', 0.196), ('ff', 0.183), ('blocksworld', 0.178), ('action', 0.155), ('dl', 0.149), ('taxonomic', 0.145), ('beam', 0.145), ('plan', 0.142), ('policy', 0.141), ('mop', 0.132), ('planners', 0.132), ('actions', 0.126), ('search', 0.124), ('ivan', 0.119), ('nowledge', 0.112), ('reactive', 0.106), ('plans', 0.106), ('predicates', 0.105), ('ern', 0.101), ('depth', 0.099), ('predicate', 0.098), ('ontrol', 0.095), ('expressions', 0.085), ('domains', 0.085), ('syntax', 0.084), ('heuristics', 0.08), ('control', 0.076), ('depots', 0.072), ('planner', 0.072), ('forward', 0.067), ('progress', 0.067), ('ltime', 0.066), ('list', 0.063), ('domain', 0.063), ('database', 0.062), ('fern', 0.061), ('competition', 0.06), ('greedy', 0.06), ('facts', 0.059), ('pickup', 0.059), ('freecell', 0.053), ('geffner', 0.053), ('learned', 0.047), ('node', 0.047), ('state', 0.046), ('ebl', 0.046), ('nebel', 0.046), ('sungwook', 0.046), ('knowledge', 0.045), ('hoffmann', 0.045), ('solved', 0.044), ('delete', 0.044), ('earning', 0.042), ('objects', 0.041), ('mnemonics', 0.04), ('subbarao', 0.04), ('length', 0.039), ('holding', 0.039), ('prioritized', 0.039), ('expression', 0.038), ('features', 0.038), ('lists', 0.038), ('covers', 0.037), ('trace', 0.036), ('block', 0.036), ('preconditions', 0.035), ('arity', 0.035), ('rule', 0.035), ('imperfect', 0.034), ('rpg', 0.034), ('hurt', 0.034), ('stack', 0.034), ('blocks', 0.033), ('driverlog', 0.033), ('gon', 0.033), ('kambhampati', 0.033), ('pipesworld', 0.033), ('seed', 0.033), ('strips', 0.033), ('legal', 0.032), ('training', 0.031), ('rules', 0.031), ('across', 0.03), ('measures', 0.03), ('literal', 0.03), ('expanded', 0.03), ('priority', 0.03), ('symbol', 0.029), ('decision', 0.029), ('martin', 0.029), ('positively', 0.028), ('minton', 0.028), ('literals', 0.028), ('learner', 0.028), ('learn', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999851 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
Author: Sungwook Yoon, Alan Fern, Robert Givan
Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-specific knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to define features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search
2 0.089740746 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
3 0.083526887 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
Author: Liviu Panait, Karl Tuyls, Sean Luke
Abstract: This paper presents the dynamics of multiple learning agents from an evolutionary game theoretic perspective. We provide replicator dynamics models for cooperative coevolutionary algorithms and for traditional multiagent Q-learning, and we extend these differential equations to account for lenient learners: agents that forgive possible mismatched teammate actions that resulted in low rewards. We use these extended formal models to study the convergence guarantees for these algorithms, and also to visualize the basins of attraction to optimal and suboptimal solutions in two benchmark coordination problems. The paper demonstrates that lenience provides learners with more accurate information about the benefits of performing their actions, resulting in higher likelihood of convergence to the globally optimal solution. In addition, the analysis indicates that the choice of learning algorithm has an insignificant impact on the overall performance of multiagent learning algorithms; rather, the performance of these algorithms depends primarily on the level of lenience that the agents exhibit to one another. Finally, the research herein supports the strength and generality of evolutionary game theory as a backbone for multiagent learning. Keywords: multiagent learning, reinforcement learning, cooperative coevolution, evolutionary game theory, formal models, visualization, basins of attraction
4 0.079214245 65 jmlr-2008-Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study
Author: Avraham Bab, Ronen I. Brafman
Abstract: Multi Agent Reinforcement Learning (MARL) has received continually growing attention in the past decade. Many algorithms that vary in their approaches to the different subtasks of MARL have been developed. However, the theoretical convergence results for these algorithms do not give a clue as to their practical performance nor supply insights to the dynamics of the learning process itself. This work is a comprehensive empirical study conducted on MGS, a simulation system developed for this purpose. It surveys the important algorithms in the field, demonstrates the strengths and weaknesses of the different approaches to MARL through application of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety of fully cooperative and fully competitive domains in self and heterogeneous play, and supplies an informal analysis of the resulting learning processes. The results can aid in the design of new learning algorithms, in matching existing algorithms to specific tasks, and may guide further research and formal analysis of the learning processes. Keywords: reinforcement learning, multi-agent reinforcement learning, stochastic games
5 0.064632691 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
Author: Faustino Gomez, Jürgen Schmidhuber, Risto Miikkulainen
Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks. Keywords: coevolution, recurrent neural networks, non-linear control, genetic algorithms, experimental comparison
6 0.064475693 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
7 0.043554682 54 jmlr-2008-Learning to Select Features using their Properties
8 0.041789565 83 jmlr-2008-Robust Submodular Observation Selection
9 0.041494343 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
10 0.03464235 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
11 0.034505468 58 jmlr-2008-Max-margin Classification of Data with Absent Features
12 0.034353599 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
13 0.033834733 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
14 0.030948639 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
15 0.029320266 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
16 0.028968874 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
17 0.028356552 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
18 0.028229954 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
19 0.027875688 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
20 0.026882222 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
topicId topicWeight
[(0, 0.149), (1, -0.014), (2, -0.053), (3, -0.055), (4, -0.268), (5, -0.151), (6, -0.128), (7, 0.041), (8, 0.086), (9, 0.076), (10, -0.036), (11, 0.012), (12, 0.022), (13, 0.029), (14, 0.046), (15, -0.023), (16, 0.074), (17, 0.085), (18, 0.009), (19, -0.011), (20, 0.108), (21, 0.026), (22, 0.013), (23, 0.053), (24, -0.047), (25, 0.026), (26, 0.017), (27, -0.091), (28, 0.021), (29, -0.035), (30, -0.047), (31, -0.062), (32, 0.218), (33, 0.058), (34, 0.145), (35, -0.199), (36, -0.055), (37, -0.005), (38, 0.064), (39, -0.194), (40, 0.006), (41, -0.02), (42, -0.034), (43, -0.166), (44, -0.023), (45, 0.114), (46, 0.154), (47, -0.02), (48, 0.266), (49, -0.091)]
simIndex simValue paperId paperTitle
same-paper 1 0.96660596 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
Author: Sungwook Yoon, Alan Fern, Robert Givan
Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-specific knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to define features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search
Author: Avraham Bab, Ronen I. Brafman
Abstract: Multi Agent Reinforcement Learning (MARL) has received continually growing attention in the past decade. Many algorithms that vary in their approaches to the different subtasks of MARL have been developed. However, the theoretical convergence results for these algorithms do not give a clue as to their practical performance nor supply insights to the dynamics of the learning process itself. This work is a comprehensive empirical study conducted on MGS, a simulation system developed for this purpose. It surveys the important algorithms in the field, demonstrates the strengths and weaknesses of the different approaches to MARL through application of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety of fully cooperative and fully competitive domains in self and heterogeneous play, and supplies an informal analysis of the resulting learning processes. The results can aid in the design of new learning algorithms, in matching existing algorithms to specific tasks, and may guide further research and formal analysis of the learning processes. Keywords: reinforcement learning, multi-agent reinforcement learning, stochastic games
3 0.38384089 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
4 0.34796232 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
Author: Faustino Gomez, Jürgen Schmidhuber, Risto Miikkulainen
Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks. Keywords: coevolution, recurrent neural networks, non-linear control, genetic algorithms, experimental comparison
5 0.33322635 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
Author: Liviu Panait, Karl Tuyls, Sean Luke
Abstract: This paper presents the dynamics of multiple learning agents from an evolutionary game theoretic perspective. We provide replicator dynamics models for cooperative coevolutionary algorithms and for traditional multiagent Q-learning, and we extend these differential equations to account for lenient learners: agents that forgive possible mismatched teammate actions that resulted in low rewards. We use these extended formal models to study the convergence guarantees for these algorithms, and also to visualize the basins of attraction to optimal and suboptimal solutions in two benchmark coordination problems. The paper demonstrates that lenience provides learners with more accurate information about the benefits of performing their actions, resulting in higher likelihood of convergence to the globally optimal solution. In addition, the analysis indicates that the choice of learning algorithm has an insignificant impact on the overall performance of multiagent learning algorithms; rather, the performance of these algorithms depends primarily on the level of lenience that the agents exhibit to one another. Finally, the research herein supports the strength and generality of evolutionary game theory as a backbone for multiagent learning. Keywords: multiagent learning, reinforcement learning, cooperative coevolution, evolutionary game theory, formal models, visualization, basins of attraction
6 0.32109505 54 jmlr-2008-Learning to Select Features using their Properties
7 0.30838764 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
8 0.28345236 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
9 0.27205402 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
10 0.26339832 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
11 0.22133739 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
12 0.21672449 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding
13 0.20833598 58 jmlr-2008-Max-margin Classification of Data with Absent Features
14 0.20298493 47 jmlr-2008-Learning Balls of Strings from Edit Corrections
15 0.18196402 56 jmlr-2008-Magic Moments for Structured Output Prediction
16 0.17966986 83 jmlr-2008-Robust Submodular Observation Selection
17 0.17400752 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
18 0.17395934 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
19 0.1673532 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
20 0.16725251 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
topicId topicWeight
[(0, 0.012), (5, 0.015), (40, 0.035), (54, 0.039), (58, 0.025), (66, 0.038), (76, 0.014), (88, 0.649), (92, 0.031), (94, 0.033), (99, 0.017)]
simIndex simValue paperId paperTitle
1 0.99486059 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
Author: Gérard Biau, Luc Devroye, Gábor Lugosi
Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent. Keywords: random forests, classification trees, consistency, bagging This paper is dedicated to the memory of Leo Breiman.
same-paper 2 0.98390937 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
Author: Sungwook Yoon, Alan Fern, Robert Givan
Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-specific knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to define features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search
3 0.95896298 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
4 0.93029565 96 jmlr-2008-Visualizing Data using t-SNE
Author: Laurens van der Maaten, Geoffrey Hinton
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling
5 0.7132929 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
Author: Manu Chhabra, Robert A. Jacobs
Abstract: The computational complexities arising in motor control can be ameliorated through the use of a library of motor synergies. We present a new model, referred to as the Greedy Additive Regression (GAR) model, for learning a library of torque sequences, and for learning the coefficients of a linear combination of sequences minimizing a cost function. From the perspective of numerical optimization, the GAR model is interesting because it creates a library of “local features”—each sequence in the library is a solution to a single training task—and learns to combine these sequences using a local optimization procedure, namely, additive regression. We speculate that learners with local representational primitives and local optimization procedures will show good performance on nonlinear tasks. The GAR model is also interesting from the perspective of motor control because it outperforms several competing models. Results using a simulated two-joint arm suggest that the GAR model consistently shows excellent performance in the sense that it rapidly learns to perform novel, complex motor tasks. Moreover, its library is overcomplete and sparse, meaning that only a small fraction of the stored torque sequences are used when learning a new movement. The library is also robust in the sense that, after an initial training period, nearly all novel movements can be learned as additive combinations of sequences in the library, and in the sense that it shows good generalization when an arm’s dynamics are altered between training and test conditions, such as when a payload is added to the arm. Lastly, the GAR model works well regardless of whether motor tasks are specified in joint space or Cartesian space. We conclude that learning techniques using local primitives and optimization procedures are viable and potentially important methods for motor control and possibly other domains, and that these techniques deserve further examination by the artificial intelligence and cognitive science
6 0.67757881 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
7 0.65805846 9 jmlr-2008-Active Learning by Spherical Subdivision
8 0.64635116 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
9 0.64071059 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
10 0.64025128 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
11 0.636383 57 jmlr-2008-Manifold Learning: The Price of Normalization
12 0.61174494 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
13 0.60693264 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
14 0.60674417 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
15 0.59659046 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
16 0.59071332 71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations
17 0.59040034 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
18 0.58641469 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
20 0.58178377 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers