jmlr jmlr2012 jmlr2012-41 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tobias Lang, Marc Toussaint, Kristian Kersting
Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics
Reference: text
sentIndex sentText sentNum sentScore
1 We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. [sent-9, score-1.405]
2 Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. [sent-10, score-1.338]
3 To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. [sent-11, score-0.971]
4 We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. [sent-12, score-0.986]
5 We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. [sent-13, score-1.924]
6 Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. [sent-14, score-0.812]
7 Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics 1. [sent-15, score-1.426]
8 The problem of exploration in stochastic relational worlds has so far received little attention. [sent-37, score-0.884]
9 , z 2006) are mostly model-free and use ε-greedy exploration which does not make use of relational knowledge. [sent-40, score-0.823]
10 Exploiting the relational knowledge for exploration is the problem we address in the current paper. [sent-41, score-0.823]
11 Applying existing, propositional exploration techniques is likely to fail: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be an instance of a well-known abstract context in which exploitation is promising. [sent-42, score-1.409]
12 In other terms, the key idea underlying our approach is: The inherent generalization of learned knowledge in the relational representation has profound implications also on the exploration strategy. [sent-43, score-0.864]
13 We introduce a concrete instan3726 E XPLORATION IN R ELATIONAL D OMAINS tiation in this setting where the learner and planner components are based on previous work: The learner is a relational learning algorithm of noisy indeterministic deictic (NID) rules (Pasula et al. [sent-48, score-0.933]
14 As a planner we employ PRADA (Lang and Toussaint, 2010) which translates the learned relational model to a grounded dynamic Bayesian network (DBN) and uses approximate inference (a factored frontier) to estimate the expected return of sampled action sequences. [sent-50, score-0.964]
15 Given a learner and a planner, the relational exploration-exploitation strategy needs to realize an estimation of novelty in the relational setting. [sent-51, score-1.304]
16 We generalize the notion of state counts to relational count functions such that visitation of a single state increases this measure of knownness also for “related” states. [sent-53, score-0.926]
17 We propose several possible choices of such features in a relational setting, including one that exploits the specific relational context features that are implicitly learned by the relational rule learner. [sent-56, score-1.967]
18 This will allow us to draw clear connections (i) to the basic R- MAX and E 3 framework for exploration in reinforcement learning and (ii) to the pioneering work of Walsh (2010) on KWIK learning in a relational reinforcement learning setting. [sent-61, score-1.027]
19 Walsh’s work proved the existence of a KWIK learning algorithm in a relational RL setting by nesting several KWIK algorithms for learning different parts of relational transition models. [sent-62, score-1.324]
20 Most work in this context has focused on model-free approaches (estimating a value function) and has not developed relational exploration strategies. [sent-96, score-0.823]
21 Essentially, a number of relational regression algorithms have been developed for use in these relational RL systems such as relational regression trees (Dˇ eroski et al. [sent-97, score-1.866]
22 Driessens and Dˇ eroski (2004) propose the use z of “reasonable policies” in model-free relational RL to provide guidance, that is, to increase the chance to discover sparse rewards in large relational state spaces, also known as reward shaping. [sent-102, score-1.392]
23 Sanner (2005, 2006) combines feature discovery with model-free relational reinforcement learning but does not discuss count function estimation for known states in the exploration-exploitation problem in general terms. [sent-103, score-0.921]
24 (2007) present an incremental relational regression tree algorithm that is capable of dealing with concept drift and showed that it enables a relational Q-learner to transfer knowledge from one task to another. [sent-105, score-1.244]
25 They do not learn a model of the domain and again, 3728 E XPLORATION IN R ELATIONAL D OMAINS relational exploration strategies were not developed. [sent-106, score-0.823]
26 (2003) calculate approximate value functions for relational MDPs from sampled (grounded) environments and provide guarantees for accurate planning in terms of the number of samples; they do not consider an agent which explores its environment step by step to learn a transition model. [sent-138, score-0.883]
27 The goal of this work is to propose a practically feasible online relational RL system that integrates efficient exploration in relational domains with fully unknown transition dynamics and learning complete action operators (including contexts, effects, and effect distributions). [sent-150, score-1.741]
28 More precisely, our contributions are the following: • We introduce the problem of learning relational count functions which generalize the classical notion of state (action) visitation counts. [sent-152, score-0.838]
29 • We provide guarantees on the exploration efficiency of the general R EX framework under the assumption that we had a relational KWIK learner and were capable of near-optimal planning in our domain. [sent-154, score-0.963]
30 • As a concrete instance of our R EX framework, we integrate the state-of-the-art relational planner PRADA (Lang and Toussaint, 2010) and a learner for probabilistic relational rules (Pasula et al. [sent-155, score-1.457]
31 Our work has interesting parallels in cognitive science: Windridge and Kittler (2010) employ ideas of relational exploration for cognitive bootstrapping, that is, to progressively learn more abstract representations of an agent’s environment on the basis of its action capabilities. [sent-161, score-1.013]
32 We then assume relational rule learning and PRADA as concrete learner and planner components for R EX. [sent-165, score-0.83]
33 Background on MDPs, Representations, Exploration and Transition Models In this section, we set up the theoretical background for the relational exploration framework and algorithms presented later. [sent-169, score-0.823]
34 Three different representation 3731 L ANG , T OUSSAINT AND K ERSTING types dominate AI research on discrete representations: (i) unstructured enumerated representations, (ii) factored propositional representations, and (iii) relational representations. [sent-187, score-0.916]
35 The state space S is described by means of a relational vocabulary consisting of predicates P and functions F , which yield the set of ground atoms with arguments taken from the set of domain objects O . [sent-204, score-0.829]
36 In MDPs based on relational representations, called relational MDPs, the commonalities of state structures, actions and objects can be expressed. [sent-207, score-1.581]
37 In practice, however, the number of exploratory actions becomes huge so that in case of the large state spaces of relational worlds, it is unrealistic to meet the theoretical thresholds of state visits. [sent-245, score-0.947]
38 Our evaluations will include variants of factored exploration strategies where the factorization is based on grounded relational formulas. [sent-248, score-0.93]
39 In this paper, we are investigating exploration strategies for relational representations and lift E 3 and R- MAX to relational domains. [sent-250, score-1.481]
40 } Table 2: The reinforcement learning agent collects a series E of relational state transitions consisting of an action (on the left), a predecessor state (first line) and a successor state (second line after the arrow). [sent-295, score-1.144]
41 In this sense, we view a relational transition model T as a (noisy) compressor of the experiences E whose compactness enables generalization. [sent-328, score-0.809]
42 Transition models which generalize over objects and states play a crucial role in our relational exploration algorithms. [sent-329, score-1.018]
43 The semantics of NID rules allow one to find a “satisficing” action sequence in relational domains that will lead with high probability to states with large rewards. [sent-399, score-0.973]
44 First, we discuss the implications of a relational knowledge representation for exploration on a conceptual level (Section 3. [sent-415, score-0.823]
45 We show how to quantify the knowledge of states and actions by means of a generalized, relational notion of state-action counts, namely a relational count function. [sent-417, score-1.646]
46 • The key benefit of relational learning is the ability to generalize over yet unobserved instances of the world based on relational abstractions. [sent-434, score-1.273]
47 We propose to generalize the notion of counts to a relational count function that quantifies the degree to which states and actions are known. [sent-436, score-1.081]
48 An example for a relational feature are binary tests that have value 1 if some relational query is true for s; otherwise they have value 0. [sent-445, score-1.244]
49 These imply different approaches to quantify known states and actions in a relational RL setting. [sent-467, score-0.921]
50 While many relational knowledge representations 3742 E XPLORATION IN R ELATIONAL D OMAINS have some notion of context or rule precondition, in our running example of NID rules these may correspond to the set of NID rule contexts {φr }. [sent-508, score-0.914]
51 That is, the description of novelty which drives exploration is lifted to the level of abstraction of these relational contexts. [sent-515, score-0.847]
52 (2006) and Halbritter and Geibel (2007) present relational reinforcement learning approaches which use relational graph kernels to estimate the similarity of relational states. [sent-526, score-1.968]
53 These can be used to estimate relational counts which, when applied in our context, would readily lead to alternative notions of novelty and thereby exploration strategies. [sent-527, score-0.851]
54 2 Relational Exploration Framework The approaches to estimate relational state and action counts which have been discussed above open a large variety of possibilities for concrete exploration strategies. [sent-552, score-1.071]
55 In the following, we derive a relational model-based reinforcement learning framework we call R EX (short for relational explorer) in which these strategies can be applied. [sent-553, score-1.346]
56 It adapts its estimated relational transition model T with the set of experiences E . [sent-557, score-0.809]
57 To remove this assumption, the relational explorer can instead attempt planned exploration first along the lines described by Kearns and Singh for the original E 3 . [sent-568, score-0.903]
58 In the case of relational R- MAX, a single model is built in which all unknown states according to the estimated counts lead to the absorbing special state s with reward Rmax . [sent-569, score-0.858]
59 The parameter ζ in our relational exploration framework R EX defines the threshold to decide whether states and actions are known or unknown. [sent-573, score-1.122]
60 In the following, we briefly establish conditions for which similar guarantees can be made in our relational exploration framework R EX using count functions: we state simple conditions for a KWIK learner to “match” with the used count function such that R EX using this learner is PAC-MDP. [sent-582, score-1.209]
61 Motivated by Walsh’s concrete relational KWIK learner and our concrete R EX instance described below, we mention another, more special case condition for a KWIK learner to match with a count function. [sent-597, score-0.899]
62 4 A Model-Based Reinforcement Learner for Relational Domains with Fully Unknown Transition Dynamics Our relational exploration framework R EX is independent of the concrete choices for the transition model representation and learner, the planning algorithm and the relational count function. [sent-622, score-1.735]
63 To our knowledge, this instance of R EX is the first empirically evaluated relational model-based reinforcement learning algorithm which learns and exploits full-fledged models of transition dynamics. [sent-624, score-0.804]
64 In the beginning, the robot is given a relational vocabulary to describe the world on a symbolic level. [sent-664, score-0.852]
65 The robot will apply relational exploration to learn as much as possible about the dynamics of its world in as little time as possible. [sent-675, score-1.034]
66 4 leads to a practical exploration system for relational RL which outperforms established non-relational techniques on a large number of relevant problems. [sent-760, score-0.823]
67 Our results show that R EX explores efficiently worlds with many objects and transfers learned knowledge to new situations, objects and, in contrast to model-free relational RL techniques, even to new tasks. [sent-761, score-0.868]
68 To our knowledge, this is the first evaluation of a model-based reinforcement learning agent which learns both the complete structure as well as the parameters of relational transition models. [sent-762, score-0.884]
69 We present results for both, relational E 3 and relational R- MAX, to demonstrate that R EX is a practical, effective approach with either strategy. [sent-765, score-1.244]
70 ) Flat E 3 uses pseudo propositional NID rules where the rule context describes a complete ground relational state; hence, a rule is only applicable in a specific state and this approach corresponds to the original E 3 . [sent-773, score-1.066]
71 4 as a concrete instance of relational E 3 and R- MAX and a factored propositional variant of R EX as factored propositional E 3 . [sent-799, score-1.127]
72 1 We compare our approach R EX to relational ε-greedy which is the established exploration technique in relational RL approaches (see the discussion in Section 1. [sent-812, score-1.445]
73 We emphasize that we are not aware of any other relational exploration approach for learning full transition models apart from ε-greedy which we could use as a baseline in our evaluation. [sent-818, score-0.903]
74 1 Series 1 – Robot Manipulation Domain In this first series of experiments, we compare our relational exploration approach R EX in both variants (E 3 and R- MAX) to relational ε-greedy and to flat and factored E 3 approaches in successively more complex problem settings. [sent-893, score-1.512]
75 Figure 2 shows that the relational explorers have superior success rates, require significantly fewer actions and reuse their learned knowledge effectively in subsequent rounds. [sent-913, score-0.955]
76 In the larger worlds, our R EX approaches, that is relational E 3 and R- MAX, require less rounds than εgreedy to solve the tasks with few actions: the learned count functions permit effective exploration. [sent-916, score-0.804]
77 Their learned rules and count functions for known states and actions enable a stable performance of relational E 3 and RMAX at a near-optimal level after already 2 rounds, while relational ε-greedy requires 3-4 rounds. [sent-930, score-1.752]
78 This is shown in the bottom row of Figure 5 where the results of relational E 3 are compared to restarting the learning procedure at the beginning of each new task (that is, in rounds 4 and 7) (the corresponding graphs for relational R- MAX and relational ε-greedy are similar). [sent-964, score-1.904]
79 Furthermore, both relational E 3 and R- MAX benefit from their learned count functions for active exploration and outperform relational ε-greedy clearly. [sent-965, score-1.589]
80 5 S UMMARY The experiments in the robot manipulation domain show that our relational exploration approach R EX is a practical and effective full-system approach for exploration in challenging realistic problem settings. [sent-968, score-1.257]
81 Our results confirm the intuitive expectation that relational knowledge improves learning transition models and learning count functions for known states and actions and thus exploration. [sent-969, score-1.104]
82 Experiment 1 shows that relational explorers scale better with the number of objects than propositional explorers. [sent-970, score-0.926]
83 The more challenging Experiments 2 and 3 demonstrate that the principled relational E 3 and R- MAX exploration of R EX outperforms the established ε-greedy exploration: the learned count functions permit effective active exploration. [sent-971, score-0.967]
84 The top row presents the results of different relational exploration approaches. [sent-985, score-0.823]
85 The bottom row compares the full curriculum relational E 3 which transfers learned knowledge to new tasks (same as in the top row) with restarting relational E 3 with each new task. [sent-986, score-1.285]
86 This is hazardous if one cannot generalize one’s experiences over objects, resulting in barely useful estimated count functions for known states and actions: the propositional explorers spend too much time in each state exploring actions without effects. [sent-1018, score-0.83]
87 Both relational E 3 and R- MAX clearly outperform relational ε-greedy in both the success rate as well as the number of required actions, indicating the usefulness of the learned count functions for active exploration. [sent-1020, score-1.415]
88 In the bottom row of Figure 6, the results of the relational approaches are compared to the scenario where the contexts of rules are known a-priori and only the outcomes of rules and their 3761 L ANG , T OUSSAINT AND K ERSTING 1 100 Success 0. [sent-1021, score-0.845]
89 In contrast, the smaller action numbers of relational E 3 and R- MAX indicate the advantage of learning and using expressive count functions for active exploration. [sent-1035, score-0.858]
90 Overall, relational R- MAX performs best with respect to all measures; relational E 3 has similar success rates and action numbers, but collects less rewards. [sent-1044, score-1.404]
91 All our experiments show that learning relational count functions of known states and actions and exploiting them in a principled exploration strategy outperforms both principled propositional exploration methods as well as the established technique for relational domains, relational ε-greedy. [sent-1058, score-2.842]
92 Conclusion Efficient exploration in relational worlds is an interesting problem that is fundamental to many real-life decision-theoretic planning problems, but has received little attention so far. [sent-1060, score-0.964]
93 We have approached this problem by proposing relational exploration strategies that borrow ideas from efficient techniques for propositional representations. [sent-1061, score-0.995]
94 The key step in going from propositional to relational representations is a new definition of the concept of the novelty of states and actions. [sent-1062, score-0.924]
95 We have introduced a framework of relational count functions to estimate empirical counts of relational data. [sent-1063, score-1.375]
96 We have introduced a relational exploration framework called R EX where such count functions are learned from experience and drive exploration in relational domains. [sent-1065, score-1.834]
97 One should start to explore statistical relational reasoning and learning techniques for the relational count function estimation problem implicit in exploring relational worlds. [sent-1076, score-1.969]
98 The resulting algorithms might in turn provide new relational exploration strategies. [sent-1083, score-0.823]
99 Future work should also explore the connection between relational exploration and transfer learning. [sent-1084, score-0.823]
100 Non–parametric policy gradients: A unified treatment of propositional and relational domains. [sent-1227, score-0.822]
wordName wordTfidf (topN-words)
[('relational', 0.622), ('actions', 0.205), ('exploration', 0.201), ('grab', 0.2), ('nid', 0.196), ('robot', 0.187), ('propositional', 0.172), ('puton', 0.172), ('inhand', 0.16), ('action', 0.133), ('kwik', 0.116), ('experiences', 0.107), ('prada', 0.104), ('count', 0.103), ('reinforcement', 0.102), ('ex', 0.099), ('states', 0.094), ('ersting', 0.088), ('oussaint', 0.088), ('walsh', 0.085), ('elational', 0.084), ('omains', 0.084), ('xploration', 0.084), ('agent', 0.08), ('planned', 0.08), ('planning', 0.08), ('transition', 0.08), ('cube', 0.073), ('objects', 0.072), ('contexts', 0.071), ('pasula', 0.068), ('factored', 0.067), ('rules', 0.065), ('ang', 0.062), ('planner', 0.061), ('worlds', 0.061), ('learner', 0.06), ('rule', 0.06), ('explorers', 0.06), ('state', 0.06), ('domains', 0.059), ('enumerated', 0.055), ('reward', 0.054), ('mdps', 0.054), ('literals', 0.052), ('rl', 0.049), ('ippc', 0.048), ('lang', 0.046), ('manipulation', 0.046), ('toussaint', 0.044), ('experience', 0.044), ('symbolic', 0.043), ('learned', 0.041), ('exploitation', 0.041), ('round', 0.04), ('grounded', 0.04), ('kersting', 0.04), ('plan', 0.04), ('rounds', 0.038), ('plans', 0.038), ('deictic', 0.038), ('kristian', 0.036), ('representations', 0.036), ('rewards', 0.034), ('queries', 0.034), ('st', 0.033), ('ball', 0.033), ('se', 0.032), ('rmax', 0.032), ('ce', 0.031), ('driessens', 0.031), ('generalize', 0.029), ('policy', 0.028), ('counts', 0.028), ('diuk', 0.028), ('strehl', 0.028), ('successor', 0.027), ('concrete', 0.027), ('success', 0.027), ('ground', 0.027), ('cubes', 0.026), ('boxes', 0.025), ('dynamics', 0.024), ('object', 0.024), ('jair', 0.024), ('lid', 0.024), ('mexploit', 0.024), ('predicates', 0.024), ('visitation', 0.024), ('atoms', 0.024), ('kearns', 0.024), ('abstraction', 0.024), ('mdp', 0.024), ('covered', 0.023), ('seeds', 0.023), ('outcomes', 0.022), ('effects', 0.022), ('uncertain', 0.022), ('xperiment', 0.022), ('environment', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999887 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
Author: Tobias Lang, Marc Toussaint, Kristian Kersting
Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics
2 0.13834541 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
Author: George Konidaris, Ilya Scheidwasser, Andrew Barto
Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to significantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that significantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-specific skills. Keywords: reinforcement learning, transfer, shaping, skills
3 0.11486877 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy
Abstract: We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support. Keywords: reinforcement learning, dynamic programming, decision making, linear regression, preference elicitation
4 0.092006877 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
Author: Sunita Nayak, Kester Duncan, Sudeep Sarkar, Barbara Loeding
Abstract: We present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. We extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Each sentence video is first transformed into a multidimensional time series representation, capturing the motion and shape aspects of the sign. Skin color blobs are extracted from frames of color video sequences, and a probabilistic relational distribution is formed for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented as a trajectory in a low dimensional space called the space of relational distributions. Given these time series trajectories, we extract signemes from multiple sentences concurrently using iterated conditional modes (ICM). We show results by learning single signs from a collection of sentences with one common pervading sign, multiple signs from a collection of sentences with more than one common sign, and single signs from a mixed collection of sentences. The extracted signemes demonstrate that our approach is robust to some extent to the variations produced within a sign due to different contexts. We also show results whereby these learned sign models are used for spotting signs in test sequences. Keywords: pattern extraction, sign language recognition, signeme extraction, sign modeling, iterated conditional modes
5 0.085766062 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
6 0.067721561 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
7 0.066916704 34 jmlr-2012-Dynamic Policy Programming
8 0.063448288 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
9 0.052903023 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
10 0.052295871 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
11 0.051349428 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation
12 0.034470376 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
13 0.034242071 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
14 0.033557147 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
15 0.027166815 4 jmlr-2012-A Kernel Two-Sample Test
16 0.026679115 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
17 0.026215008 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
18 0.02537936 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
19 0.024994155 80 jmlr-2012-On Ranking and Generalization Bounds
20 0.024745382 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
topicId topicWeight
[(0, -0.134), (1, -0.056), (2, 0.154), (3, -0.18), (4, 0.026), (5, -0.219), (6, -0.106), (7, -0.152), (8, 0.055), (9, 0.24), (10, -0.049), (11, -0.027), (12, 0.058), (13, -0.092), (14, 0.003), (15, -0.098), (16, -0.019), (17, 0.027), (18, 0.06), (19, -0.018), (20, 0.115), (21, 0.069), (22, -0.068), (23, -0.06), (24, 0.074), (25, 0.059), (26, -0.035), (27, 0.231), (28, -0.22), (29, 0.002), (30, -0.095), (31, 0.06), (32, 0.055), (33, 0.011), (34, 0.006), (35, -0.015), (36, 0.098), (37, -0.087), (38, 0.14), (39, -0.092), (40, -0.019), (41, -0.013), (42, -0.085), (43, 0.016), (44, 0.019), (45, 0.074), (46, -0.095), (47, 0.095), (48, -0.009), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.97566694 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
Author: Tobias Lang, Marc Toussaint, Kristian Kersting
Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics
2 0.73887265 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
Author: George Konidaris, Ilya Scheidwasser, Andrew Barto
Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to significantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that significantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-specific skills. Keywords: reinforcement learning, transfer, shaping, skills
3 0.51557219 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy
Abstract: We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support. Keywords: reinforcement learning, dynamic programming, decision making, linear regression, preference elicitation
4 0.3543092 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
5 0.31623465 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
Author: Sunita Nayak, Kester Duncan, Sudeep Sarkar, Barbara Loeding
Abstract: We present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. We extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Each sentence video is first transformed into a multidimensional time series representation, capturing the motion and shape aspects of the sign. Skin color blobs are extracted from frames of color video sequences, and a probabilistic relational distribution is formed for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented as a trajectory in a low dimensional space called the space of relational distributions. Given these time series trajectories, we extract signemes from multiple sentences concurrently using iterated conditional modes (ICM). We show results by learning single signs from a collection of sentences with one common pervading sign, multiple signs from a collection of sentences with more than one common sign, and single signs from a mixed collection of sentences. The extracted signemes demonstrate that our approach is robust to some extent to the variations produced within a sign due to different contexts. We also show results whereby these learned sign models are used for spotting signs in test sequences. Keywords: pattern extraction, sign language recognition, signeme extraction, sign modeling, iterated conditional modes
6 0.28617221 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
7 0.25990871 34 jmlr-2012-Dynamic Policy Programming
8 0.20483139 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
9 0.20193623 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
10 0.19885556 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
11 0.19554777 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation
12 0.19332288 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
13 0.19017696 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
14 0.17279276 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
15 0.1668981 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
16 0.15237349 4 jmlr-2012-A Kernel Two-Sample Test
17 0.14452352 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
18 0.14413467 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
19 0.13916779 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
20 0.13398124 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
topicId topicWeight
[(7, 0.012), (21, 0.026), (26, 0.027), (27, 0.012), (29, 0.041), (35, 0.022), (49, 0.012), (56, 0.015), (57, 0.491), (64, 0.011), (69, 0.016), (75, 0.038), (77, 0.012), (79, 0.02), (92, 0.084), (96, 0.076)]
simIndex simValue paperId paperTitle
1 0.8673234 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R
Author: Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman
Abstract: We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efficiency. Keywords: high-dimensional undirected graph estimation, glasso, huge, semiparametric graph estimation, data-dependent model selection, lossless screening, lossy screening 1. Overview Undirected graphs is a natural approach to describe the conditional independence among many variables. Each node of the graph represents a single variable and no edge between two variables implies that they are conditional independent given all other variables. In the past decade, significant progress has been made on designing efficient algorithms to learn undirected graphs from high-dimensional observational data sets. Most of these methods are based on either the penalized maximum-likelihood estimation (Friedman et al., 2007) or penalized regression methods (Meinshausen and B¨ hlmann, 2006). Existing packages include glasso, Covpath and CLIME. In particuu ∗. Also in the Department of Biostatistics. †. Also in the Department of Machine Learning. c 2012 Zhao, Liu, Roeder, Lafferty and Wasserman. Z HAO , L IU , ROEDER , L AFFERTY AND WASSERMAN lar, the glasso package has been widely adopted by statisticians and computer scientists due to its friendly user-inference and efficiency. In this paper1 we describe a newly developed R package named huge (High-dimensional Undirected Graph Estimation) coded in C. The package includes a wide range of functional modules and addresses some drawbacks of the graphical lasso algorithm. To gain more scalability, the package supports two modes of screening, lossless (Witten et al., 2011) and lossy screening. When using lossy screening, the user can select the desired screening level to scale up for high-dimensional problems, but this introduces some estimation bias. 2. Software Design and Implementation The package huge aims to provide a general framework for high-dimensional undirected graph estimation. The package includes Six functional modules (M1-M6) facilitate a flexible pipeline for analysis (Figure 1). M1. Data Generator: The function huge.generator() can simulate multivariate Gaussian data with different undirected graphs, including hub, cluster, band, scale-free, and Erd¨ s-R´ nyi o e random graphs. The sparsity level of the obtained graph and signal-to-noise ratio can also be set up by users. M2. Semiparametric Transformation: The function huge.npn() implements the nonparanormal method (Liu et al., 2009, 2012) for estimating a semiparametric Gaussian copula model.The nonparanormal family extends the Gaussian distribution by marginally transforming the variables. Computationally, the nonparanormal transformation only requires one pass through the data matrix. M3. Graph Screening: The scr argument in the main function huge() controls the use of largescale correlation screening before graph estimation. The function supports the lossless screening (Witten et al., 2011) and the lossy screening. Such screening procedures can greatly reduce the computational cost and achieve equal or even better estimation by reducing the variance at the expense of increased bias. Figure 1: The graph estimation pipeline. M4. Graph Estimation: Similar to the glasso package, the method argument in the huge() function supports two estimation methods: (i) the neighborhood pursuit algorithm (Meinshausen and B¨ hlmann, 2006) and (ii) the graphical lasso algorithm (Friedman et al., 2007). We apply u the coordinate descent with active set and covariance update, as well as other tricks suggested in Friedman et al. (2010). We modified the warm start trick to address the potential divergence problem of the graphical lasso algorithm (Mazumder and Hastie, 2011). The code is also memory-optimized using the sparse matrix data structure when estimating and storing full regularization paths for large 1. This paper is only a summary of the package huge. For more details please refer to the online vignette. 1060 H IGH - DIMENSIONAL U NDIRECTED G RAPH E STIMATION data sets. we also provide a complementary graph estimation method based on thresholding the sample correlation matrix, which is computationally efficient and widely applied in biomedical research. M5. Model Selection: The function huge.select() provides two regularization parameter selection methods: the stability approach for regularization selection (StARS) (Liu et al., 2010); and rotation information criterion (RIC). We also provide a likelihood-based extended Bayesian information criterion. M6. Graph Visualization: The plotting functions huge.plot() and plot() provide visualizations of the simulated data sets, estimated graphs and paths. The implementation is based on the igraph package. 3. User Interface by Example We illustrate the user interface by analyzing a stock market data which we contribute to the huge package. We acquired closing prices from all stocks in the S&P; 500 for all the days that the market was open between Jan 1, 2003 and Jan 1, 2008. This gave us 1258 samples for the 452 stocks that remained in the S&P; 500 during the entire time period. > > > > > library(huge) data(stockdata) # Load the data x = log(stockdata$data[2:1258,]/stockdata$data[1:1257,]) # Preprocessing x.npn = huge.npn(x, npn.func=
same-paper 2 0.85438448 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
Author: Tobias Lang, Marc Toussaint, Kristian Kersting
Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics
3 0.75936639 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
Author: Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann
Abstract: We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily motivated by the task of role mining for role-based access control, where users of a system are assigned one or more roles. In experiments with real-world access-control data, our model exhibits better generalization performance than state-of-the-art approaches. Keywords: clustering, multi-assignments, overlapping clusters, Boolean data, role mining, latent feature models
4 0.35369107 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa
Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ
5 0.33731285 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
Author: George Konidaris, Ilya Scheidwasser, Andrew Barto
Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to significantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that significantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-specific skills. Keywords: reinforcement learning, transfer, shaping, skills
6 0.30177164 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
7 0.30022424 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
8 0.2942456 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
9 0.29379812 34 jmlr-2012-Dynamic Policy Programming
10 0.29311728 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
11 0.28993943 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
12 0.28729287 80 jmlr-2012-On Ranking and Generalization Bounds
13 0.28713807 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
14 0.28260157 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
15 0.28136837 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
16 0.28007847 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
17 0.27780142 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
18 0.27690247 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
19 0.27668771 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
20 0.27407104 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach