nips nips2009 nips2009-250 knowledge-graph by maker-knowledge-mining

250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference


Source: pdf

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. [sent-3, score-0.41]

2 This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). [sent-6, score-0.382]

3 Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. [sent-7, score-0.575]

4 Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. [sent-8, score-0.352]

5 Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. [sent-9, score-0.151]

6 1 Introduction Factor graphs are a widely used representation for modeling complex dependencies amongst hidden variables in structured prediction problems. [sent-10, score-0.286]

7 MAP inference is the problem of finding the most probable setting to the graph’s hidden variables conditioned on some observed variables. [sent-12, score-0.237]

8 , Metropolis-Hastings) have been applied to problems such as MAP inference in these graphs [8, 9, 10, 11, 6]. [sent-17, score-0.221]

9 For example, consider the structured prediction task of clustering where the MAP inference problem is to group data points into equivalence classes according to some model. [sent-19, score-0.158]

10 59 0 1 2 3 4 5 6 7 8 9 20 Figure 1: The figure on the left shows the sequence of states along an optimal path beginning at a single-cluster configuration and ending at the MAP configuration (F1 scores for each state are shown). [sent-40, score-0.206]

11 For example, Figure 1 shows the F1 scores of each state along the optimal path to the MAP clustering (assuming each MCMC jump can reposition one data point at a time). [sent-44, score-0.164]

12 Taking into account the above discussion with an emphasis on the delayed feedback nature of the MAP inference problem immediately inspires us to employ reinforcement learning (RL) [12]. [sent-46, score-0.541]

13 RL is a framework for solving the sequential decision making problem with delayed reward. [sent-47, score-0.122]

14 Our approach is to directly learn the parameters of the log-linear factor graph with reinforcement learning during a training phase; MAP inference is performed by executing the policy. [sent-49, score-0.615]

15 In §3 we describe the details of our algorithm and discuss a number of ideas for coping with the combinatorial complexity in both state and action spaces. [sent-52, score-0.416]

16 1 Preliminaries Factor Graphs A factor graph is undirected bipartite graphical representation of a probability distribution with random variables and factors as nodes. [sent-56, score-0.263]

17 Let X be a set of observed variables and Y be a set of hidden variables. [sent-57, score-0.127]

18 Reinforcement learning (RL) refers to a class of problems in which an agent interacts with the environment and the objective is to learn a course of actions that optimizes a long-term measure of a delayed reward signal. [sent-64, score-0.477]

19 An MDP is the tuple M = S, A, R, P , where S is the set of states, A is the set of actions, R : S × A × S → I is the reward function, i. [sent-66, score-0.19]

20 R(s, a, s ) is the expected reward when action a is R taken in state s and transitions to state s , and P : S × A × S → [0, 1] is the transition probability function, i. [sent-68, score-0.763]

21 P a (s, s ) is the probability of reaching state s if action a is taken in state s. [sent-70, score-0.467]

22 A stochastic policy π is defined as π : S × A → [0, 1] such that a π(a|s) = 1, where π(s, a) is the probability of choosing action a (as the next action) when in state s. [sent-71, score-0.489]

23 Following a policy on an π MDP results in an expected discounted reward Rt accumulated over the course of the run, where T π k Rt = k=0 γ rt+k+1 . [sent-72, score-0.38]

24 An optimal policy π is a policy that maximizes this reward. [sent-73, score-0.274]

25 Given a Q-function (Q : S × A → I that represents the expected discounted reward for taking R) action a in state s, the optimal policy π can be found by locally maximizing Q at each step. [sent-74, score-0.679]

26 Methods of temporal difference (TD) [13] can be used to learn the optimal policy in MDPs, and even have convergence guarantees when the Q-function is in tabular form. [sent-75, score-0.216]

27 3 Our Approach In our RL treatment of learning factor graphs, each state in the system represents a complete assignment to the hidden variables Y =y. [sent-82, score-0.442]

28 Given a particular state, an action modifies the setting to a subset of the hidden variables; therefore, an action can also be defined as a setting to all the hidden variables Y =y . [sent-83, score-0.677]

29 However, in order to cope with complexity of the action space, we introduce a proposer (as in Metropolis-Hastings) B : Y → Y that constrains the space by limiting the number of possible actions from each state. [sent-84, score-0.564]

30 The reward function R can be defined as the residual performance improvement when the systems transitions from a current state y to a neighboring state y on the manifold induced by B. [sent-85, score-0.516]

31 Note that unless the hidden variables are highly constrained, the feasible regional will be combinatorial in |Y |; we discuss how to cope with this in the following sections. [sent-91, score-0.226]

32 , an assignment of Y variables), an action may be defined as a constrained set of modifications to a subset of the hidden variable assignments. [sent-94, score-0.358]

33 We constrain the action space to a manageable size by using a proposer, or a behavior policy from which actions are sampled. [sent-95, score-0.486]

34 A proposer defines the set of reachable states by describing the distribution over neighboring states s given a state s. [sent-96, score-0.379]

35 In context of the action space of an MDP, the proposer can be viewed in two ways. [sent-97, score-0.417]

36 First, each possible neighbor state s can be considered the result of an action a, leading to a large number of deterministic actions. [sent-98, score-0.352]

37 • Reward Function The reward function is designed so that the policy learned through delayed reward reaches the MAP configuration. [sent-101, score-0.679]

38 Alternatively, we could define a Euclidean reward as F(s ) − F(s ), where s is the ground truth. [sent-105, score-0.229]

39 • Transition Probability Function: Recall that the actions in our system are samples generated from a proposer B, and that each action uniquely identifies a next state in the system. [sent-107, score-0.701]

40 Thus, given the state s and the action a, the next state s has probability P a (s, s ) = 1 if s = simulate(s, a), and 0 otherwise. [sent-109, score-0.467]

41 We show below how Q values can be derived from the factor graph (Equation 1) in a manner that enables efficient computations. [sent-113, score-0.148]

42 As mentioned previously, a state is an assignment to hidden variables Y =y and an action is another assignment to the hidden variables Y =y (that results from changing the values of a subset of the variables ∆Y ∈ Y ). [sent-114, score-0.747]

43 For each assignment, the factor graph can compute the conditional probability p(y | x). [sent-116, score-0.148]

44 Then, the residual log-probability S resulting from taking action a in state y and reaching y is therefore log(p(y | x)) − log(p(y | x)). [sent-117, score-0.389]

45 Plugging in the model from Equation 1 and performing some algebraic manipulation so redundant factors cancel yields:   φ(x, y i ) − θ· y i ∈δ φ(x, y i ) (5) y i ∈δy y Where the partition function ZX and factors outside the neighborhood of ∆y cancel. [sent-118, score-0.128]

46 In practice an action will modify a small subset of the variables so this computation is extremely efficient. [sent-119, score-0.288]

47 3 Algorithm Now that we have defined MAP inference in a factor graph as an MDP, we can apply a wide variety of RL algorithms to learn the model’s parameters. [sent-122, score-0.258]

48 Our RL learning method for factor graphs is shown in Algorithm 1. [sent-124, score-0.209]

49 traces} − → − → → θ ← θ + αδ − e s←s until end of episode until end of training At the beginning of each episode, the factor graph is initialized to a random initial state s (by assigning Y =y0 ). [sent-126, score-0.344]

50 Then, during each step of the episode, the maximum action is obtained by repeatedly sampling from the proposal distribution (s =simulate(s, a)). [sent-127, score-0.313]

51 The system transition to the greedy state s with high probability (1 − ), or transitions to a random state instead. [sent-128, score-0.467]

52 Once learning has completed on a training set, MAP inference can be evaluated on test data by executing the resulting policy. [sent-130, score-0.158]

53 Because Q-values encode both the reward and value together, policy execution can be performed by choosing the action that maximizes the Q-function at each state. [sent-131, score-0.564]

54 4 Experiments We evaluate our approach by training a factor graph for solving the ontology alignment problem. [sent-132, score-0.351]

55 Ontology alignment is the problem of mapping concepts from one ontology to semantically equivalent concepts from another ontology; our treatment of the problem involves learning a first-order probabilistic model that clusters concepts into semantically equivalent sets. [sent-133, score-0.557]

56 There are two ontology mappings: one between two course catalog hierarchies, and another between two company profile hierarchies. [sent-135, score-0.424]

57 The course catalog contains 104 concepts and 4360 data records while the company profile domain contains 219 concepts and 23139 records. [sent-137, score-0.465]

58 The conditional random field we use to model the problem factors into binary decisions over sets of concepts, where the binary variable is one if all concepts in the set map to each other, and zero otherwise. [sent-139, score-0.317]

59 Each of these hidden variables neighbors a factor that also examines the observed concept data. [sent-140, score-0.225]

60 Since there are variables and factors for each hypothetical cluster, the size of the CRF is combinatorial in the number of concepts in the ontology, and it cannot be full instantiated even for small amounts of data. [sent-141, score-0.275]

61 1 Features The features used to represent the ontology alignment problem are described here. [sent-144, score-0.238]

62 We choose to encode our features in first order logic, aggregating and quantifying pairwise comparisons of concepts over entire sets. [sent-145, score-0.166]

63 9 80 80 Table 1: pairwise-matching precision, recall and F1 on the course catalog dataset 4. [sent-173, score-0.221]

64 2 Systems In this section we evaluate the performance of our reinforcement learning approach to MAP inference and compare it current stochastic and greedy alternatives. [sent-174, score-0.493]

65 In particular, we compare piecewise [20], contrastive divergence [21], and SampleRank [22, 11, 23]; these are described in more detail below. [sent-175, score-0.211]

66 • Contrastive Divergence (MH-CD1) with Metropolis-Hastings the system is trained with contrastive divergence and allowed to wander one step from the ground-truth configuration. [sent-181, score-0.22]

67 Once the parameters are learned, MAP inference is performed using Metropolis-Hastings (with a proposal distribution that modifies a single variable at a time). [sent-182, score-0.186]

68 MAP is performed with Metropolis-Hastings using a proposal distribution that modifies a single variable at a time (same proposer as in MH-CD1). [sent-184, score-0.256]

69 • Reinforcement Learning (RL): this is the system introduced in the paper that trains the CRF with delayed reward using Q(λ) to learn state-action returns. [sent-185, score-0.369]

70 The actions are derived from the same proposal distribution as used by our Metropolis-Hastings (MH-CD1,MH-SR) systems (modifying a single variable at a time); however it is exhaustively applied to find the maximum action. [sent-186, score-0.188]

71 In these experiments contrastive divergence and SampleRank were run for 10,000 samples each , while reinforcement learning was run for twenty episodes and 200 steps per episode. [sent-192, score-0.544]

72 CD1 and SampleRank were run for more steps to compensate for only observing a single action at each step (recall RL computes the action with the maximum value at each step by observing a large number of samples). [sent-193, score-0.474]

73 3 Results In Table 1 we compare F1 (pairwise-matching) scores of the various systems on the course catalog and company profile datasets. [sent-195, score-0.322]

74 SampleRank (MH-SR), contrastive divergence (MH-CD1) and reinforcement learning (RL) underwent ten training episodes initialized from random configurations; during MAP inference we initialized the systems to the state predicted by greedy agglomerative clustering. [sent-197, score-0.868]

75 Both SampleRank and reinforcement learning were able to achieve higher scores than greedy; however, reinforcement learning outperformed all systems with an error reduction of 75. [sent-198, score-0.667]

76 3% over contrastive divergence, 28% over SampleRank, 71% over GLUE and 48% over the previous state of the art (greedy agglomerative inference on a conditional random field). [sent-199, score-0.404]

77 Reinforcement learning also reduces error over each system on the company profile dataset. [sent-200, score-0.148]

78 After observing the improvements obtained by reinforcement learning, we wished to test how robust the method was at recovering from the local optima problem described in the introduction. [sent-201, score-0.409]

79 To gain more insight, we designed a separate experiment to compare Metropolis-Hastings inference (trained with SampleRank) and reinforcement learning more carefully. [sent-202, score-0.419]

80 In particular, the MAP inference procedures are initialized to random clusterings (in regions riddled with the type of local optima discussed in the introduction). [sent-204, score-0.21]

81 We then compare greedy MAP inference on a model whose parameters were learned with RL, to Metropolis-Hastings on a model with parameters learned from SampleRank. [sent-205, score-0.264]

82 Even though reinforcement learning’s policy requires it to be greedy with respect to the q-function, we observe that it is able to better escape the random initial configuration than the Metropolis-Hastings method. [sent-208, score-0.52]

83 Although both systems perform worse than under these conditions than those of the previous experiment, reinforcement learning does much better in this situation, indicating that the q-function learned is fairly robust and capable of generalizing to random regions of the space. [sent-210, score-0.349]

84 After observing Metropolis-Hasting’s tendency to get stuck in regions of lower score than reinforcement learning, we test RL to see if it would fall victim to these same optima. [sent-211, score-0.309]

85 In the last two rows of Table 2 we record the results of re-running both reinforcement learning and Metropolis-Hastings (on the SampleRank model) from the configurations Metropolis-Hastings became stuck. [sent-212, score-0.309]

86 5 Table 2: Average pairwise-matching precision, recall and F1 over ten random initialization points, and on the output of MH-SR after 10,000 inference steps. [sent-227, score-0.149]

87 Our approach is similar in spirit to Zhang and Dietterich who propose a reinforcement learning framework for solving combinatorial optimization problems [25]. [sent-229, score-0.373]

88 Similar to this approach, we also rely on generalization techniques in RL in order to directly approximate a policy over unseen test domains. [sent-230, score-0.137]

89 However, our formulation provides a framework that explicitly targets the MAP problem in large factor graphs and takes advantage of the log-linear representation of such models in order to employ a well studied class of generalization techniques in RL known as linear function approximation. [sent-231, score-0.209]

90 As shown in [28] these frameworks have connections to learning policies in reinforcement learning. [sent-236, score-0.309]

91 In contrast, we formulate parameter learning in factor graphs as an MDP over the space of complete configurations from which a variety of RL methods can be used to set the parameters. [sent-238, score-0.209]

92 Another approach that targets the problem of MAP inference is SampleRank [11, 23], which computes atomic gradient updates from jumps in the local search space. [sent-239, score-0.183]

93 6 Conclusions and Future Work We proposed an approach for solving the MAP inference problem in large factor graphs by using reinforcement learning to train model parameters. [sent-241, score-0.628]

94 Hence – unlike most search optimization approaches – the system is able to move out of local optima while aiming for the MAP configuration. [sent-243, score-0.157]

95 Benefitting from log linear nature of factor graphs such as CRFs we are also able to employ well studied RL linear function approximation techniques for learning generalizable value functions that are able to provide value estimates on the test set. [sent-244, score-0.248]

96 Learning and inference in weighted logic with application to natural language processing. [sent-283, score-0.201]

97 Solving combinatorial optimization tasks by reinforcement learning: A general methodology applied to resource-constrained scheduling. [sent-302, score-0.373]

98 Reinforcement learning for map inference in large factor graphs. [sent-309, score-0.365]

99 Inference and learning in large factor graphs with adaptive proposal distributions. [sent-330, score-0.285]

100 Learning to map between ontologies on the semantic web. [sent-334, score-0.157]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rl', 0.31), ('reinforcement', 0.309), ('samplerank', 0.283), ('action', 0.237), ('reward', 0.19), ('proposer', 0.18), ('map', 0.157), ('ontology', 0.151), ('policy', 0.137), ('catalog', 0.129), ('delayed', 0.122), ('contrastive', 0.117), ('state', 0.115), ('actions', 0.112), ('graphs', 0.111), ('inference', 0.11), ('glue', 0.103), ('guration', 0.101), ('optima', 0.1), ('factor', 0.098), ('andrew', 0.097), ('concepts', 0.096), ('logic', 0.091), ('company', 0.091), ('khashayar', 0.09), ('rohanimanesh', 0.09), ('wick', 0.09), ('simulate', 0.083), ('mdp', 0.083), ('gurations', 0.083), ('zx', 0.083), ('episode', 0.081), ('con', 0.08), ('proposal', 0.076), ('hidden', 0.076), ('greedy', 0.074), ('jumps', 0.073), ('amherst', 0.064), ('factors', 0.064), ('combinatorial', 0.064), ('sameer', 0.062), ('agglomerative', 0.062), ('tfidf', 0.062), ('mccallum', 0.061), ('cj', 0.06), ('transitions', 0.059), ('approximator', 0.058), ('pro', 0.057), ('system', 0.057), ('crf', 0.055), ('course', 0.053), ('alignment', 0.052), ('bhaskara', 0.051), ('watkin', 0.051), ('variables', 0.051), ('graph', 0.05), ('scores', 0.049), ('ab', 0.048), ('piecewise', 0.048), ('executing', 0.048), ('structured', 0.048), ('rt', 0.048), ('transition', 0.047), ('divergence', 0.046), ('ci', 0.046), ('assignment', 0.045), ('traces', 0.045), ('dom', 0.045), ('marthi', 0.045), ('milch', 0.045), ('downhill', 0.045), ('stuart', 0.045), ('aggregators', 0.045), ('tabular', 0.045), ('upenn', 0.045), ('le', 0.045), ('fernando', 0.043), ('massachusetts', 0.043), ('states', 0.042), ('learned', 0.04), ('ground', 0.039), ('generalizable', 0.039), ('daum', 0.039), ('brian', 0.039), ('recall', 0.039), ('residual', 0.037), ('eligibility', 0.037), ('pedro', 0.037), ('twenty', 0.037), ('maxa', 0.037), ('equation', 0.036), ('features', 0.035), ('cope', 0.035), ('episodes', 0.035), ('pairwise', 0.035), ('quanti', 0.034), ('precision', 0.034), ('temporal', 0.034), ('hal', 0.033), ('semantically', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

2 0.21018937 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

3 0.19403619 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Author: Keith Bush, Joelle Pineau

Abstract: Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings to reconstruct the observable state-space in the context of offline, model-based reinforcement learning. We demonstrate that the embedding of a system can change as a result of learning, and we argue that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach to learn a neurostimulation policy that suppresses epileptic seizures on animal brain slices. 1

4 0.17750449 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1

5 0.16635126 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

Author: Andrew McCallum, Karl Schultz, Sameer Singh

Abstract: Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to define these structures in a powerful and flexible way. Rather than using a declarative language, such as SQL or first-order logic, we advocate using an imperative language to express various aspects of model structure, inference, and learning. By combining the traditional, declarative, statistical semantics of factor graphs with imperative definitions of their construction and operation, we allow the user to mix declarative and procedural domain knowledge, and also gain significant efficiencies. We have implemented such imperatively defined factor graphs in a system we call FACTORIE, a software library for an object-oriented, strongly-typed, functional language. In experimental comparisons to Markov Logic Networks on joint segmentation and coreference, we find our approach to be 3-15 times faster while reducing error by 20-25%—achieving a new state of the art. 1

6 0.15150405 52 nips-2009-Code-specific policy gradient rules for spiking neurons

7 0.13673498 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

8 0.13411351 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

9 0.13370809 14 nips-2009-A Parameter-free Hedging Algorithm

10 0.13147318 53 nips-2009-Complexity of Decentralized Control: Special Cases

11 0.13049017 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

12 0.10603781 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming

13 0.10565042 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

14 0.10026295 221 nips-2009-Solving Stochastic Games

15 0.090296738 134 nips-2009-Learning to Explore and Exploit in POMDPs

16 0.082856216 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

17 0.078948654 181 nips-2009-Online Learning of Assignments

18 0.077118866 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

19 0.074220456 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

20 0.073731162 141 nips-2009-Local Rules for Global MAP: When Do They Work ?


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.241), (1, -0.007), (2, 0.171), (3, -0.253), (4, -0.259), (5, 0.005), (6, 0.028), (7, 0.011), (8, -0.065), (9, -0.018), (10, 0.082), (11, 0.121), (12, 0.021), (13, 0.027), (14, 0.013), (15, 0.045), (16, 0.027), (17, -0.068), (18, 0.032), (19, -0.107), (20, 0.011), (21, -0.048), (22, 0.116), (23, 0.068), (24, -0.059), (25, 0.101), (26, -0.019), (27, 0.015), (28, 0.097), (29, 0.021), (30, 0.006), (31, 0.11), (32, -0.04), (33, 0.008), (34, 0.0), (35, -0.039), (36, -0.036), (37, 0.048), (38, -0.108), (39, 0.027), (40, 0.069), (41, 0.052), (42, -0.002), (43, -0.07), (44, -0.068), (45, 0.07), (46, -0.026), (47, -0.042), (48, -0.017), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9546634 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

2 0.78660893 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1

3 0.75541127 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Author: Keith Bush, Joelle Pineau

Abstract: Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings to reconstruct the observable state-space in the context of offline, model-based reinforcement learning. We demonstrate that the embedding of a system can change as a result of learning, and we argue that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach to learn a neurostimulation policy that suppresses epileptic seizures on animal brain slices. 1

4 0.7323249 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.

5 0.72606695 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

Author: Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári

Abstract: In this paper we introduce a multi-step linear Dyna-style planning algorithm. The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. We propose two multi-step linear models. The first iterates the one-step linear model, but is generally computationally complex. The second interpolates between the one-step model and the infinite-step model (which turns out to be the LSTD solution), and can be learned efficiently online. Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks infinite steps into the future, and suffers from the model errors in non-stationary (control) environments.

6 0.69832581 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

7 0.6587556 242 nips-2009-The Infinite Partially Observable Markov Decision Process

8 0.65833426 134 nips-2009-Learning to Explore and Exploit in POMDPs

9 0.61346042 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

10 0.59029102 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

11 0.56609988 53 nips-2009-Complexity of Decentralized Control: Special Cases

12 0.55761737 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

13 0.52571589 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming

14 0.50160497 56 nips-2009-Conditional Neural Fields

15 0.47235432 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

16 0.44733229 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

17 0.42733693 14 nips-2009-A Parameter-free Hedging Algorithm

18 0.42583513 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

19 0.42491665 221 nips-2009-Solving Stochastic Games

20 0.42142183 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (21, 0.017), (24, 0.055), (25, 0.053), (35, 0.078), (36, 0.109), (39, 0.04), (55, 0.045), (58, 0.07), (61, 0.092), (65, 0.096), (71, 0.098), (81, 0.02), (82, 0.052), (86, 0.078), (91, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90702474 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

2 0.87384623 14 nips-2009-A Parameter-free Hedging Algorithm

Author: Kamalika Chaudhuri, Yoav Freund, Daniel J. Hsu

Abstract: We study the problem of decision-theoretic online learning (DTOL). Motivated by practical applications, we focus on DTOL when the number of actions is very large. Previous algorithms for learning in this framework have a tunable learning rate parameter, and a barrier to using online-learning in practical applications is that it is not understood how to set this parameter optimally, particularly when the number of actions is large. In this paper, we offer a clean solution by proposing a novel and completely parameter-free algorithm for DTOL. We introduce a new notion of regret, which is more natural for applications with a large number of actions. We show that our algorithm achieves good performance with respect to this new notion of regret; in addition, it also achieves performance close to that of the best bounds achieved by previous algorithms with optimally-tuned parameters, according to previous notions of regret. 1

3 0.85087031 70 nips-2009-Discriminative Network Models of Schizophrenia

Author: Irina Rish, Benjamin Thyreau, Bertrand Thirion, Marion Plaze, Marie-laure Paillere-martinot, Catherine Martelli, Jean-luc Martinot, Jean-baptiste Poline, Guillermo A. Cecchi

Abstract: Schizophrenia is a complex psychiatric disorder that has eluded a characterization in terms of local abnormalities of brain activity, and is hypothesized to affect the collective, “emergent” working of the brain. We propose a novel data-driven approach to capture emergent features using functional brain networks [4] extracted from fMRI data, and demonstrate its advantage over traditional region-of-interest (ROI) and local, task-specific linear activation analyzes. Our results suggest that schizophrenia is indeed associated with disruption of global brain properties related to its functioning as a network, which cannot be explained by alteration of local activation patterns. Moreover, further exploitation of interactions by sparse Markov Random Field classifiers shows clear gain over linear methods, such as Gaussian Naive Bayes and SVM, allowing to reach 86% accuracy (over 50% baseline - random guess), which is quite remarkable given that it is based on a single fMRI experiment using a simple auditory task. 1

4 0.84543258 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

Author: Theodore J. Perkins

Abstract: Continuous-time Markov chains are used to model systems in which transitions between states as well as the time the system spends in each state are random. Many computational problems related to such chains have been solved, including determining state distributions as a function of time, parameter estimation, and control. However, the problem of inferring most likely trajectories, where a trajectory is a sequence of states as well as the amount of time spent in each state, appears unsolved. We study three versions of this problem: (i) an initial value problem, in which an initial state is given and we seek the most likely trajectory until a given final time, (ii) a boundary value problem, in which initial and final states and times are given, and we seek the most likely trajectory connecting them, and (iii) trajectory inference under partial observability, analogous to finding maximum likelihood trajectories for hidden Markov models. We show that maximum likelihood trajectories are not always well-defined, and describe a polynomial time test for well-definedness. When well-definedness holds, we show that each of the three problems can be solved in polynomial time, and we develop efficient dynamic programming algorithms for doing so. 1

5 0.84519517 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

6 0.83807641 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

7 0.82837063 113 nips-2009-Improving Existing Fault Recovery Policies

8 0.82462382 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

9 0.81915498 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

10 0.8166303 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

11 0.81597805 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

12 0.81482965 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

13 0.81429297 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

14 0.81382608 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

15 0.81335372 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

16 0.8101179 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

17 0.80271012 33 nips-2009-Analysis of SVM with Indefinite Kernels

18 0.80264652 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

19 0.80087817 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

20 0.80024952 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior