nips nips2011 nips2011-136 knowledge-graph by maker-knowledge-mining

136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions


Source: pdf

Author: Mohammad S. Sorower, Janardhan R. Doppa, Walker Orr, Prasad Tadepalli, Thomas G. Dietterich, Xiaoli Z. Fern

Abstract: We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this “missingness” process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the problem of learning rules from natural language text sources. [sent-4, score-0.291]

2 These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. [sent-5, score-0.539]

3 Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. [sent-6, score-0.495]

4 However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. [sent-8, score-0.772]

5 This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. [sent-10, score-0.82]

6 Learning must simultaneously search the space of rules and learn the parameters of the mention model. [sent-11, score-0.55]

7 An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. [sent-13, score-0.395]

8 Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. [sent-14, score-0.268]

9 1 Introduction The immense volume of textual information available on the web provides an important opportunity and challenge for AI: Can we develop methods that can learn domain knowledge by reading natural texts such as news articles and web pages. [sent-15, score-0.418]

10 Concrete facts can be extracted as logical relations or as tuples to populate a data base. [sent-17, score-0.258]

11 In this paper, we focus on the latter setting: Given a data base of literals extracted from natural language texts (e. [sent-22, score-0.385]

12 Unfortunately for rule learning algorithms, natural language texts are incomplete. [sent-25, score-0.227]

13 The writer tends to mention only enough information to allow the reader to easily infer the remaining facts from shared background knowledge. [sent-26, score-0.644]

14 This aspect of economy in language was first pointed out by Grice 1 [7] in his maxims of cooperative conversation (see Table 1). [sent-27, score-0.228]

15 2 Be concise—say as much as This mentions that Kansas City is the away team and necessary, but no more. [sent-29, score-0.393]

16 mention that Kansas City lost the game or that the 4 Be clear. [sent-31, score-0.435]

17 Of course these facts can be inferred from domain knowledge rules such as the rule that “if one team is the winner, the other is the loser (and vice versa)” and the rule “if one team is the home team, the other is the away team (and vice versa)”. [sent-33, score-1.388]

18 ” This explicitly mentions that Khadr is Canadian, because otherwise the reader would infer that he was Egyptian based on the domain knowledge rule “if a person is born in a country, then the person is a citizen of that country”. [sent-36, score-0.531]

19 This paper formalizes the first two maxims, including this corollary, and then shows how to apply them to learn probabilistic Horn clause rules from propositions extracted from news stories. [sent-38, score-0.527]

20 We show that rules learned this way are able to correctly infer more information from incomplete texts than a baseline approach that treats propositions in news stories as missing at random. [sent-39, score-0.771]

21 The problem of learning rules from extracted texts has been studied previously [11, 2, 17]. [sent-40, score-0.4]

22 These systems rely on finding documents in which all of the facts participating in a rule are mentioned. [sent-41, score-0.294]

23 A drawback of this approach is that it is difficult to learn rules unless there are many documents that provide such complete training examples. [sent-43, score-0.387]

24 The central hypothesis of our work is that by explicitly modeling the process by which facts are mentioned, we can learn rules from sets of documents that are smaller and less complete. [sent-44, score-0.486]

25 They study learning hard (non-probabilistic) rules from incomplete extractions. [sent-47, score-0.267]

26 In contrast with our approach of learning explicit probabilistic models, they take the simpler approach of implicitly inverting the conversational maxims when counting evidence for a proposed rule. [sent-48, score-0.259]

27 To handle these, these authors sort the rules by their scores and keep high scoring rules even if they have some contraditions. [sent-51, score-0.482]

28 Such an approach can learn “almost hard” rules, but will have difficulty with rules that are highly probabilistic (e. [sent-52, score-0.285]

29 , that the home team is somewhat more likely to win a game than the away team). [sent-54, score-0.518]

30 Finally, we describe a method for probabilistically inverting the maxims to learn rules from textual mentions. [sent-61, score-0.463]

31 Consider a writer and a reader who share domain knowledge K. [sent-63, score-0.232]

32 We will write this as (K, M ENTION(F ) reader G), where reader represents the inference procedure of the reader and M ENTION is a modal operator that captures the action of mentioning a fact in the text. [sent-65, score-0.345]

33 With this notation, we can formalize the first two Gricean maxims as follows: • Mention true facts/don’t lie: F M ENTION(F ) ⇒ ⇒ M ENTION(F ) F (1) (2) The first formula is overly strong, because it requires the writer to mention all true facts. [sent-67, score-0.525]

34 , a person is typically a citizen of only one country), it suffices to mention C ITIZEN O F(Khadr, Canada) to prevent the faulty inference C ITIZEN O F(Khadr, Egypt). [sent-86, score-0.326]

35 Hence, for rules of the form P (x, y) ⇒ Q(x, y), where Q is a function from its first to its second argument, we can implement the inference-by-omission maxim as w5 : M ENTION P(x, y) ∧ FACT Q(x, z) ∧ (y = z) ⇒ M ENTION Q(x, z). [sent-87, score-0.289]

36 Finally, the shared knowledge P ⇒ Q is represented by the fact-to-fact rule: ⇒ w6 : FACT P 3 FACT Q In Markov Logic, each of these rules is assigned a (learned) weight which can be viewed as a cost of violating the rule. [sent-88, score-0.241]

37 The probability of a world ω is proportional to   exp  wj I[Rule j is satisfied by ω] , j where j iterates over all groundings of the Markov logic rules in world ω and I[φ] is 1 if φ is true and 0 otherwise. [sent-89, score-0.35]

38 Hence, we can include both a rule that says “if the home team is mentioned, then the away team is not mentioned” and rules that say “the home team is always mentioned” and “the away team is always mentioned”. [sent-91, score-1.412]

39 The relative weights on the rules determine the probability that particular literals are actually mentioned. [sent-93, score-0.417]

40 We seek to learn both the rules and their weights. [sent-95, score-0.285]

41 We proceed by first proposing candidate fact-to-fact rules and then automatically generating the other rules (especially the mentionto-mention rules) from the general rule schemata described above. [sent-96, score-0.575]

42 This has the effect of removing unnecessary rules by driving their weights to zero. [sent-98, score-0.241]

43 For the rule body (antecedent), we consider all conjunctions of literals involving other predicates (i. [sent-101, score-0.324]

44 Each candidate rule is scored on the mentions in the training documents for support (number of training examples that mention all facts in the body) and confidence (the conditional probability that the head is mentioned given that the body is satisfied). [sent-104, score-0.904]

45 We discard all rules that do not achieve minimum support σ and then keep the top τ most confident rules. [sent-105, score-0.241]

46 The selected rules are then entered into the knowledge base. [sent-107, score-0.241]

47 From each fact-to-fact rule, we derive mention-to-mention rules as described above. [sent-108, score-0.241]

48 Because our training data only consists of mentions and no facts, the facts are latent (hidden variables), and we must apply the EM algorithm to learn the weights. [sent-112, score-0.393]

49 17 Mentioned literals Complete records (%) (%) 0. [sent-119, score-0.215]

50 We apply the same method of learning rules (requiring minimum support σ and then taking the τ most confident rules). [sent-139, score-0.241]

51 The collection of rules is treated as a model of the joint distribution over the mentions. [sent-141, score-0.241]

52 3 Experimental Evaluation We evaluated our mention model approach using data generated from a known mention model to understand its behavior. [sent-143, score-0.53]

53 Then we compared its performance to the MAR approach on actual extractions from news stories about NFL football games, citizenship, and Somali ship hijackings. [sent-144, score-0.331]

54 The goal of this experiment was to evaluate the ability of our method to learn accurate rules from data that match the assumptions of the algorithm. [sent-146, score-0.285]

55 From this ground truth, we generate a set of mentions for each game as follows. [sent-153, score-0.336]

56 Then each of the remaining literals is mentioned with probability 1−q, where q is a parameter that we varied in the experiments. [sent-155, score-0.257]

57 Table 3 shows the average percentage of literals mentioned in each generated “news story” and the percentage of generated “news stories” that mentioned all literals. [sent-156, score-0.412]

58 50 100 99 98 97 93 87 sulting learned rules were evaluated 0. [sent-172, score-0.272]

59 Table 4 reports the proportion of complete game records (i. [sent-179, score-0.237]

60 Note that any facts mentioned in the generated articles are 5 automatically correctly inferred, so if no inference was performed at all, the results would match the second row of Table 3. [sent-182, score-0.387]

61 17), the algorithm was able to learn rules that predict well for articles with much higher levels of missing values. [sent-186, score-0.496]

62 62% of the literals are missing in the training dataset, which results in 61. [sent-189, score-0.306]

63 However, as the proportion of missing literals in the training data increases, the algorithm starts learning incorrect rules, so performance drops. [sent-192, score-0.306]

64 Nonetheless, the learned rules are still able to completely and correctly reconstruct 41% of the games! [sent-195, score-0.314]

65 The rules learned under such high levels of missingness are not totally correct. [sent-196, score-0.327]

66 When appropriately weighted in Markov Logic, this is a reasonable rule even though it is not perfectly correct (nor was it a rule that we applied during the synthetic data generation process). [sent-200, score-0.212]

67 13% of literals are mentioned), the learned rules are able to correctly infer 85% of the literals. [sent-213, score-0.517]

68 Experiments with Real Data: We performed experiments on three datasets extracted from news stories: NFL games, citizenship, and Somali ship hijackings. [sent-216, score-0.244]

69 A state-of-the-art infor- Table 6: Statistics on mentions for extracted NFL games mation extraction system from BBN (after repairing violations of integrity constraints). [sent-218, score-0.413]

70 The BBN coreference sysHome/Away Winner/Loser tem attempted to detect and combine multiple mentions of the same game men men men men men men none one both none one both within a single article. [sent-221, score-1.092]

71 0 games apparently involving more than two teams or where one team achieved multiple scores. [sent-235, score-0.281]

72 To address these problems, we took each extracted game and applied a set of integrity constraints. [sent-236, score-0.327]

73 The integrity constraints were learned automatically from 5 complete game records. [sent-237, score-0.311]

74 Examples of the learned constraints include “Every game has exactly two teams” and “Every game has exactly one winner. [sent-238, score-0.371]

75 ” Each extracted game was then converted into multiple games by deleting literals in all possible ways until all of the integrity constraints were satisfied. [sent-239, score-0.593]

76 To develop a test set, NFL Test, we manually extracted 55 games from news stories about the 2010 NFL season (which has no overlap with Gigaword V4). [sent-243, score-0.35]

77 It is interesting to ask whether these data are Table 7: Observed percentage of cases where exconsistent with the explicit mention model ver- actly one literal is mentioned and the percentage sus the missing-at-random model. [sent-248, score-0.49]

78 Let us sup- predicted if the literals were missing at random pose that under MAR, the probability that a fact Home/Away Winner/Loser will be mentioned is p. [sent-249, score-0.359]

79 Then the probability that both literals in a rule (e. [sent-250, score-0.269]

80 winner/loser) will be mentioned is p2 , the probmen men men men ability that both will be missing is (1 − p)2 , and one one one one the probability that exactly one will be men(%) (%) (%) (%) tioned is 2p(1 − p). [sent-256, score-0.573]

81 If the explicit mention model is correct, then the MAR fit will be a poor estimate of the fraction of cases where exactly one literal is missing. [sent-266, score-0.335]

82 We applied both our explicit mention model and the MAR model to the NFL dataset. [sent-272, score-0.291]

83 The cross-validated parameter values for the explicit mention model were = 0. [sent-273, score-0.291]

84 This reflects the extreme difficulty of the test set, where none of the articles mentions all literals involved in any rule. [sent-278, score-0.451]

85 0 Here are a few examples of the rules that are learned: 0. [sent-282, score-0.241]

86 17445 : M ENTION TEAM I N G AME(g, t1 ) ∧ M ENTION GAME L OSER(g, t2 ) ∧ (t1 = t2 ) ⇒ ¬M ENTION GAME W INNER(g, t1 ) The first rule is a weak form of the “fact” rule that if one team is the loser, the other is the winner. [sent-284, score-0.377]

87 The second rule is the corresponding “mention” rule that if the loser is mentioned then the winner is not. [sent-285, score-0.308]

88 The small weights on these rules are difficult to interpret in isolation, because in Markov logic, all of the weights are coupled and there are other learned rules that involve the same literals. [sent-286, score-0.513]

89 In these 7 articles, the citizenship of a person is mentioned 583 times and birthplace only 25 times. [sent-289, score-0.293]

90 Both are mentioned in the same article only 6 times (and of these, birthplace and citizenship are the same in only 4). [sent-290, score-0.259]

91 The cross-validated parameter values for the explicit mention model were = 0. [sent-293, score-0.291]

92 We collected a set Table 9: Birthplace and Citizenship: Predicted of 41 news stories concerning ship hijack- probability assigned to the correct interpretation by ings based on ship names taken from the web the Gricean mention model and the MAR model. [sent-298, score-0.646]

93 tracted all mentions of the ownership counCitizenship missing 1. [sent-305, score-0.336]

94 565 Twenty-five stories mentioned only one fact (ownership or flag), while 16 mentioned both. [sent-310, score-0.261]

95 5 and τ = 50; 2-3 EM iterrules from facts extracted from documents. [sent-316, score-0.23]

96 Experiments on synthetic mentions showed Configuration Gricean Model MAR that our method is able to correctly reconPred. [sent-318, score-0.234]

97 Our three studies provide evidence that news articles obey the maxims across three domains. [sent-327, score-0.373]

98 Indeed, it allows rules to be learned correctly from only a handful of complete training examples. [sent-330, score-0.37]

99 The Gricean mention model predicts that if a news story mentions that a ship was released, then it does not need to mention that the ship was “attacked” or “captured”. [sent-333, score-0.948]

100 Learning rules from incomplete examples via implicit mention models. [sent-382, score-0.532]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ention', 0.369), ('mention', 0.265), ('rules', 0.241), ('gricean', 0.218), ('nfl', 0.218), ('ame', 0.205), ('eam', 0.205), ('mar', 0.199), ('team', 0.191), ('maxims', 0.178), ('literals', 0.176), ('game', 0.17), ('mentions', 0.166), ('facts', 0.155), ('men', 0.122), ('home', 0.121), ('reader', 0.115), ('logic', 0.109), ('articles', 0.109), ('missing', 0.102), ('stories', 0.099), ('citizenship', 0.096), ('rule', 0.093), ('games', 0.09), ('news', 0.086), ('texts', 0.084), ('ship', 0.083), ('birthplace', 0.082), ('integrity', 0.082), ('oser', 0.082), ('writer', 0.082), ('mentioned', 0.081), ('extracted', 0.075), ('horn', 0.072), ('chicagobears', 0.068), ('khadr', 0.068), ('ownership', 0.068), ('country', 0.059), ('em', 0.056), ('predicates', 0.055), ('actp', 0.055), ('conversational', 0.055), ('grice', 0.055), ('missingness', 0.055), ('sanf', 0.055), ('language', 0.05), ('alchemy', 0.048), ('clause', 0.048), ('maxim', 0.048), ('predicate', 0.048), ('documents', 0.046), ('literal', 0.044), ('learn', 0.044), ('correctly', 0.042), ('head', 0.042), ('bbn', 0.041), ('broncos', 0.041), ('doppa', 0.041), ('itizen', 0.041), ('loser', 0.041), ('mentionp', 0.041), ('somali', 0.041), ('records', 0.039), ('percentage', 0.037), ('table', 0.037), ('away', 0.036), ('kansas', 0.036), ('football', 0.036), ('gigaword', 0.036), ('clauses', 0.036), ('denver', 0.036), ('domain', 0.035), ('markov', 0.034), ('person', 0.034), ('propositions', 0.033), ('inner', 0.031), ('concise', 0.031), ('learned', 0.031), ('web', 0.03), ('logical', 0.028), ('training', 0.028), ('complete', 0.028), ('infer', 0.027), ('betteridge', 0.027), ('boschee', 0.027), ('citizen', 0.027), ('etzioni', 0.027), ('extractions', 0.027), ('hijacked', 0.027), ('hruschka', 0.027), ('sorower', 0.027), ('whirl', 0.027), ('explicit', 0.026), ('ag', 0.026), ('synthetic', 0.026), ('incomplete', 0.026), ('orr', 0.024), ('carlson', 0.024), ('coreference', 0.024), ('tioned', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions

Author: Mohammad S. Sorower, Janardhan R. Doppa, Walker Orr, Prasad Tadepalli, Thomas G. Dietterich, Xiaoli Z. Fern

Abstract: We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this “missingness” process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. 1

2 0.069703244 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1

3 0.067248046 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

Author: Yisong Yue, Carlos Guestrin

Abstract: Diversified retrieval and online learning are two core research areas in the design of modern information retrieval systems. In this paper, we propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of feature-rich submodular utility models for diversified retrieval. We present an algorithm, called LSBG REEDY, and prove that it efficiently converges to a near-optimal model. As a case study, we applied our approach to the setting of personalized news recommendation, where the system must recommend small sets of news articles selected from tens of thousands of available articles each day. In a live user study, we found that LSBG REEDY significantly outperforms existing online learning approaches. 1

4 0.063628688 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games

Author: Richard G. Gibson, Duane Szafron

Abstract: Computing a good strategy in a large extensive form game often demands an extraordinary amount of computer memory, necessitating the use of abstraction to reduce the game size. Typically, strategies from abstract games perform better in the real game as the granularity of abstraction is increased. This paper investigates two techniques for stitching a base strategy in a coarse abstraction of the full game tree, to expert strategies in fine abstractions of smaller subtrees. We provide a general framework for creating static experts, an approach that generalizes some previous strategy stitching efforts. In addition, we show that static experts can create strong agents for both 2-player and 3-player Leduc and Limit Texas Hold’em poker, and that a specific class of static experts can be preferred among a number of alternatives. Furthermore, we describe a poker agent that used static experts and won the 3-player events of the 2010 Annual Computer Poker Competition.

5 0.063150756 218 nips-2011-Predicting Dynamic Difficulty

Author: Olana Missura, Thomas Gärtner

Abstract: Motivated by applications in electronic games as well as teaching systems, we investigate the problem of dynamic difficulty adjustment. The task here is to repeatedly find a game difficulty setting that is neither ‘too easy’ and bores the player, nor ‘too difficult’ and overburdens the player. The contributions of this paper are (i) the formulation of difficulty adjustment as an online learning problem on partially ordered sets, (ii) an exponential update algorithm for dynamic difficulty adjustment, (iii) a bound on the number of wrong difficulty settings relative to the best static setting chosen in hindsight, and (iv) an empirical investigation of the algorithm when playing against adversaries. 1

6 0.06021687 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference

7 0.057188455 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

8 0.037495032 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

9 0.034995951 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

10 0.033856086 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

11 0.032215737 102 nips-2011-Generalised Coupled Tensor Factorisation

12 0.031971779 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

13 0.031279933 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

14 0.030876691 249 nips-2011-Sequence learning with hidden units in spiking neural networks

15 0.029391007 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

16 0.0284498 202 nips-2011-On the Universality of Online Mirror Descent

17 0.027506009 5 nips-2011-A Denoising View of Matrix Completion

18 0.027097007 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

19 0.026840331 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

20 0.026812145 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.089), (1, -0.009), (2, -0.007), (3, 0.015), (4, 0.018), (5, -0.035), (6, -0.014), (7, -0.01), (8, -0.023), (9, 0.02), (10, 0.036), (11, -0.038), (12, -0.007), (13, -0.029), (14, 0.011), (15, -0.015), (16, 0.004), (17, 0.003), (18, 0.048), (19, -0.005), (20, 0.021), (21, 0.038), (22, -0.061), (23, -0.011), (24, 0.021), (25, 0.036), (26, -0.049), (27, 0.018), (28, 0.006), (29, 0.028), (30, -0.07), (31, -0.105), (32, -0.056), (33, 0.033), (34, 0.071), (35, -0.059), (36, 0.082), (37, -0.028), (38, -0.064), (39, 0.022), (40, -0.167), (41, -0.031), (42, -0.086), (43, 0.007), (44, -0.12), (45, 0.099), (46, -0.032), (47, -0.007), (48, -0.123), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93056798 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions

Author: Mohammad S. Sorower, Janardhan R. Doppa, Walker Orr, Prasad Tadepalli, Thomas G. Dietterich, Xiaoli Z. Fern

Abstract: We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this “missingness” process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. 1

2 0.80486828 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games

Author: Richard G. Gibson, Duane Szafron

Abstract: Computing a good strategy in a large extensive form game often demands an extraordinary amount of computer memory, necessitating the use of abstraction to reduce the game size. Typically, strategies from abstract games perform better in the real game as the granularity of abstraction is increased. This paper investigates two techniques for stitching a base strategy in a coarse abstraction of the full game tree, to expert strategies in fine abstractions of smaller subtrees. We provide a general framework for creating static experts, an approach that generalizes some previous strategy stitching efforts. In addition, we show that static experts can create strong agents for both 2-player and 3-player Leduc and Limit Texas Hold’em poker, and that a specific class of static experts can be preferred among a number of alternatives. Furthermore, we describe a poker agent that used static experts and won the 3-player events of the 2010 Annual Computer Poker Competition.

3 0.66897345 218 nips-2011-Predicting Dynamic Difficulty

Author: Olana Missura, Thomas Gärtner

Abstract: Motivated by applications in electronic games as well as teaching systems, we investigate the problem of dynamic difficulty adjustment. The task here is to repeatedly find a game difficulty setting that is neither ‘too easy’ and bores the player, nor ‘too difficult’ and overburdens the player. The contributions of this paper are (i) the formulation of difficulty adjustment as an online learning problem on partially ordered sets, (ii) an exponential update algorithm for dynamic difficulty adjustment, (iii) a bound on the number of wrong difficulty settings relative to the best static setting chosen in hindsight, and (iv) an empirical investigation of the algorithm when playing against adversaries. 1

4 0.60630006 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference

Author: Guy Broeck

Abstract: Probabilistic logics are receiving a lot of attention today because of their expressive power for knowledge representation and learning. However, this expressivity is detrimental to the tractability of inference, when done at the propositional level. To solve this problem, various lifted inference algorithms have been proposed that reason at the first-order level, about groups of objects as a whole. Despite the existence of various lifted inference approaches, there are currently no completeness results about these algorithms. The key contribution of this paper is that we introduce a formal definition of lifted inference that allows us to reason about the completeness of lifted inference algorithms relative to a particular class of probabilistic models. We then show how to obtain a completeness result using a first-order knowledge compilation approach for theories of formulae containing up to two logical variables. 1 Introduction and related work Probabilistic logic models build on first-order logic to capture relational structure and on graphical models to represent and reason about uncertainty [1, 2]. Due to their expressivity, these models can concisely represent large problems with many interacting random variables. While the semantics of these logics is often defined through grounding the models [3], performing inference at the propositional level is – as for first-order logic – inefficient. This has motivated the quest for lifted inference methods that exploit the structure of probabilistic logic models for efficient inference, by reasoning about groups of objects as a whole and avoiding repeated computations. The first approaches to exact lifted inference have upgraded the variable elimination algorithm to the first-order level [4, 5, 6]. More recent work is based on methods from logical inference [7, 8, 9, 10], such as knowledge compilation. While these approaches often yield dramatic improvements in runtime over propositional inference methods on specific problems, it is still largely unclear for which classes of models these lifted inference operators will be useful and for which ones they will eventually have to resort to propositional inference. One notable exception in this regard is lifted belief propagation [11], which performs exact lifted inference on any model whose factor graph representation is a tree. A first contribution of this paper is that we introduce a notion of domain lifted inference, which formally defines what lifting means, and which can be used to characterize the classes of probabilistic models to which lifted inference applies. Domain lifted inference essentially requires that probabilistic inference runs in polynomial time in the domain size of the logical variables appearing in the model. As a second contribution we show that the class of models expressed as 2-WFOMC formulae (weighted first-order model counting with up to 2 logical variables per formula) can be domain lifted using an extended first-order knowledge compilation approach [10]. The resulting approach allows for lifted inference even in the presence of (anti-) symmetric or total relations in a theory. These are extremely common and useful concepts that cannot be lifted by any of the existing first-order knowledge compilation inference rules. 1 2 Background We will use standard concepts of function-free first-order logic (FOL). An atom p(t1 , . . . , tn ) consists of a predicate p/n of arity n followed by n arguments, which are either constants or logical variables. An atom is ground if it does not contain any variables. A literal is an atom a or its negation ¬a. A clause is a disjunction l1 ∨ ... ∨ lk of literals. If k = 1, it is a unit clause. An expression is an atom, literal or clause. The pred(a) function maps an atom to its predicate and the vars(e) function maps an expression to its logical variables. A theory in conjunctive normal form (CNF) is a conjunction of clauses. We often represent theories by their set of clauses and clauses by their set of literals. Furthermore, we will assume that all logical variables are universally quantified. In addition, we associate a set of constraints with each clause or atom, either of the form X = t, where X is a logical variable and t is a constant or variable, or of the form X ∈ D, where D is a domain, or the negation of these constraints. These define a finite domain for each logical variable. Abusing notation, we will use constraints of the form X = t to denote a substitution of X by t. The function atom(e) maps an expression e to its atoms, now associating the constraints on e with each atom individually. To add the constraint c to an expression e, we use the notation e ∧ c. Two atoms unify if there is a substitution which makes them identical and if the conjunction of the constraints on both atoms with the substitution is satisfiable. Two expressions e1 and e2 are independent, written e1 ⊥ e2 , if no atom a1 ∈ atom(e1 ) unifies with an atom a2 ∈ atom(e2 ). ⊥ We adopt the Weighted First-Order Model Counting (WFOMC) [10] formalism to represent probabilistic logic models, building on the notion of a Herbrand interpretation. Herbrand interpretations are subsets of the Herbrand base HB (T ), which consists of all ground atoms that can be constructed with the available predicates and constant symbols in T . The atoms in a Herbrand interpretation are assumed to be true. All other atoms in HB (T ) are assumed to be false. An interpretation I satisfies a theory T , written as I |= T , if it satisfies all the clauses c ∈ T . The WFOMC problem is defined on a weighted logic theory T , which is a logic theory augmented with a positive weight function w and a negative weight function w, which assign a weight to each predicate. The WFOMC problem involves computing wmc(T, w, w) = w(pred(a)) I|=T a∈I 3 3.1 w(pred(a)). (1) a∈HB(T )\I First-order knowledge compilation for lifted probabilistic inference Lifted probabilistic inference A first-order probabilistic model defines a probability distribution P over the set of Herbrand interpretations H. Probabilistic inference in these models is concerned with computing the posterior probability P(q|e) of query q given evidence e, where q and e are logical expressions in general: P(q|e) = h∈H,h|=q∧e P(h) h∈H,h|=e P(h) (2) We propose one notion of lifted inference for first-order probabilistic models, defined in terms of the computational complexity of inference w.r.t. the domains of logical variables. It is clear that other notions of lifted inference are conceivable, especially in the case of approximate inference. Definition 1 (Domain Lifted Probabilistic Inference). A probabilistic inference procedure is domain lifted for a model m, query q and evidence e iff the inference procedure runs in polynomial time in |D1 |, . . . , |Dk | with Di the domain of the logical variable vi ∈ vars(m, q, e). Domain lifted inference does not prohibit the algorithm to be exponential in the size of the vocabulary, that is, the number of predicates, arguments and constants, of the probabilistic model, query and evidence. In fact, the definition allows inference to be exponential in the number of constants which occur in arguments of atoms in the theory, query or evidence, as long as it is polynomial in the cardinality of the logical variable domains. This definition of lifted inference stresses the ability to efficiently deal with the domains of the logical variables that arise, regardless of their size, and formalizes what seems to be generally accepted in the lifted inference literature. 2 A class of probabilistic models is a set of probabilistic models expressed in a particular formalism. As examples, consider Markov logic networks (MLN) [12] or parfactors [4], or the weighted FOL theories for WFOMC that we introduced above, when the weights are normalized. Definition 2 (Completeness). Restricting queries to atoms and evidence to a conjunction of literals, a procedure that is domain lifted for all probabilistic models m in a class of models M and for all queries q and evidence e, is called complete for M . 3.2 First-order knowledge compilation First-order knowledge compilation is an approach to lifted probabilistic inference consisting of the following three steps (see Van den Broeck et al. [10] for details): 1. Convert the probabilistic logical model to a weighted CNF. Converting MLNs or parfactors requires adding new atoms to the theory that represent the (truth) value of each factor or formula. set-disjunction 2 friends(X, Y ) ∧ smokes(X) ⇒ smokes(Y ) Smokers ⊆ People decomposable conjunction unit clause leaf (a) MLN Model ∧ smokes(X), X ∈ Smokers smokes(Y ) ∨ ¬ smokes(X) ∨¬ friends(X, Y ) ∨ ¬ f(X, Y ) friends(X, Y ) ∨ f(X, Y ) smokes(X) ∨ f(X, Y ) ¬ smokes(Y ) ∨ f(X, Y ). ∧ f(X, Y ), Y ∈ Smokers ∧ ¬ smokes(Y ), Y ∈ Smokers / ∧ f(X, Y ), X ∈ Smokers, Y ∈ Smokers / / x ∈ Smokers (b) CNF Theory deterministic disjunction Predicate friends smokes f w 1 1 e2 w 1 1 1 y ∈ Smokers / ∨ set-conjunction ∧ f(x, y) (c) Weight Functions ¬ friends(x, y) ∧ friends(x, y) ¬ f(x, y) (d) First-Order d-DNNF Circuit Figure 1: Friends-smokers example (taken from [10]) Example 1. The MLN in Figure 1a assigns a weight to a formula in FOL. Figure 1b represents the same model as a weighted CNF, introducing a new atom f(X, Y ) to encode the truth value of the MLN formula. The probabilistic information is captured by the weight functions in Figure 1c. 2. Compile the logical theory into a First-Order d-DNNF (FO d-DNNF) circuit. Figure 1d shows an example of such a circuit. Leaves represent unit clauses. Inner nodes represent the disjunction or conjunction of their children l and r, but with the constraint that disjunctions must be deterministic (l ∧ r is unsatisfiable) and conjunctions must be decomposable (l ⊥ r). ⊥ 3. Perform WFOMC inference to compute posterior probabilities. In a FO d-DNNF circuit, WFOMC is polynomial in the size of the circuit and the cardinality of the domains. To compile the CNF theory into a FO d-DNNF circuit, Van den Broeck et al. [10] propose a set of compilation rules, which we will refer to as CR 1 . We will now briefly describe these rules. Unit Propagation introduces a decomposable conjunction when the theory contains a unit clause. Independence creates a decomposable conjunction when the theory contains independent subtheories. Shannon decomposition applies when the theory contains ground atoms and introduces a deterministic disjunction between two modified theories: one where the ground atom is true, and one where it is false. Shattering splits clauses in the theory until all pairs of atoms represent either a disjoint or identical set of ground atoms. Example 2. In Figure 2a, the first two clauses are made independent from the friends(X, X) clause and split off in a decomposable conjunction by unit propagation. The unit clause becomes a leaf of the FO d-DNNF circuit, while the other operand requires further compilation. 3 dislikes(X, Y ) ∨ friends(X, Y ) fun(X) ∨ ¬ friends(X, Y ) friends(X, Y ) ∨ dislikes(X, Y ) ¬ friends(X, Y ) ∨ likes(X, Y ) friends(X, X) fun(X) ∨ ¬ friends(X, Y ) fun(X) ∨ ¬ friends(Y, X) ∧ FunPeople ⊆ People x ∈ People friends(X, X) friends(X, Y ) ∨ dislikes(X, Y ), X = Y ¬ friends(X, Y ) ∨ likes(X, Y ), X = Y likes(X, X) dislikes(x, Y ) ∨ friends(x, Y ) fun(x) ∨ ¬ friends(x, Y ) fun(X), X ∈ FunPeople ¬ fun(X), X ∈ FunPeople / fun(X) ∨ ¬ friends(X, Y ) fun(X) ∨ ¬ friends(Y, X) (a) Unit propagation of friends(X, X) (b) Independent partial grounding (c) Atom counting of fun(X) Figure 2: Examples of compilation rules. Circles are FO d-DNNF inner nodes. White rectangles show theories before and after applying the rule. All variable domains are People. (taken from [10]) Independent Partial Grounding creates a decomposable conjunction over a set of child circuits, which are identical up to the value of a grounding constant. Since they are structurally identical, only one child circuit is actually compiled. Atom Counting applies when the theory contains an atom with a single logical variable X ∈ D. It explicitly represents the domain D ⊆ D of X for which the atom is true. It compiles the theory into a deterministic disjunction between all possible such domains. Again, these child circuits are identical up to the value of D and only one is compiled. Example 3. The theory in Figure 2b is compiled into a decomposable set-conjunction of theories that are independent and identical up to the value of the x constant. The theory in Figure 2c contains an atom with one logical variable: fun(X). Atom counting compiles it into a deterministic setdisjunction over theories that differ in FunPeople, which is the domain of X for which fun(X) is true. Subsequent steps of unit propagation remove the fun(X) atoms from the theory entirely. 3.3 Completeness We will now characterize those theories where the CR 1 compilation rules cannot be used, and where the inference procedure has to resort to grounding out the theory to propositional logic. For these, first-order knowledge compilation using CR 1 is not yet domain lifted. When a logical theory contains symmetric, anti-symmetric or total relations, such as friends(X, Y ) ⇒ friends(Y, X), parent(X, Y ) ⇒ ¬ parent(Y, X), X = Y, ≤ (X, Y) ∨ ≤ (Y, X), or more general formulas, such as enemies(X, Y ) ⇒ ¬ friend(X, Y ) ∧ ¬ friend(Y, X), (3) (4) (5) (6) none of the CR 1 rules apply. Intuitively, the underlying problem is the presence of either: • Two unifying (not independent) atoms in the same clause which contain the same logical variable in different positions of the argument list. Examples include (the CNF of) Formulas 3, 4 and 5, where the X and Y variable are bound by unifying two atoms from the same clause. • Two logical variables that bind when unifying one pair of atoms but appear in different positions of the argument list of two other unifying atoms. Examples include Formula 6, which in CNF is ¬ friend(X, Y ) ∨ ¬ enemies(X, Y ) ¬ friend(Y, X) ∨ ¬ enemies(X, Y ) Here, unifying the enemies(X, Y ) atoms binds the X variables from both clauses, which appear in different positions of the argument lists of the unifying atoms friend(X, Y ) and friend(Y, X). Both of these properties preclude the use of CR 1 rules. Also in the context of other model classes, such as MLNs, probabilistic versions of the above formulas cannot be processed by CR 1 rules. 4 Even though first-order knowledge compilation with CR 1 rules does not have a clear completeness result, we can show some properties of theories to which none of the compilation rules apply. First, we need to distinguish between the arity of an atom and its dimension. A predicate with arity two might have atoms with dimension one, when one of the arguments is ground or both are identical. Definition 3 (Dimension of an Expression). The dimension of an expression e is the number of logical variables it contains: dim(e) = | vars(e)|. Lemma 1 (CR 1 Postconditions). The CR 1 rules remove all atoms from the theory T which have zero or one logical variable arguments, such that afterwards ∀a ∈ atom(T ) : dim(a) > 1. When no CR 1 rule applies, the theory is shattered and contains no independent subtheories. Proof. Ground atoms are removed by the Shannon decomposition operator followed by unit propagation. Atoms with a single logical variable (including unary relations) are removed by the atom counting operator followed by unit propagation. If T contains independent subtheories, the independence operator can be applied. Shattering is always applied when T is not yet shattered. 4 Extending first-order knowledge compilation In this section we introduce a new operator which does apply to the theories from Section 3.3. 4.1 Logical variable properties To formally define the operator we propose, and prove its correctness, we first introduce some mathematical concepts related to the logical variables in a theory (partly after Jha et al. [8]). Definition 4 (Binding Variables). Two logical variables X, Y are directly binding b(X, Y ) if they are bound by unifying a pair of atoms in the theory. The binding relationship b+ (X, Y ) is the transitive closure of the directly binding relation b(X, Y ). Example 4. In the theory ¬ p(W, X) ∨ ¬ q(X) r(Y ) ∨ ¬ q(Y ) ¬ r(Z) ∨ s(Z) the variable pairs (X, Y ) and (Y, Z) are directly binding. The variables X, Y and Z are binding. Variable W does not bind to any other variable. Note that the binding relationship b+ (X, Y ) is an equivalence relation that defines two equivalence classes: {X, Y, Z} and {W }. Lemma 2 (Binding Domains). After shattering, binding logical variables have identical domains. Proof. During shattering (see Section 3.2), when two atoms unify, binding two variables with partially overlapping domains, the atoms’ clauses are split up into clauses where the domain of the variables is identical, and clauses where the domains are disjoint and the atoms no longer unify. Definition 5 (Root Binding Class). A root variable is a variable that appears in all the atoms in its clause. A root binding class is an equivalence class of binding variables where all variables are root. Example 5. In the theory of Example 4, {X, Y, Z} is a root binding class and {W } is not. 4.2 Domain recursion We will now introduce the new domain recursion operator, starting with its preconditions. Definition 6. A theory allows for domain recursion when (i) the theory is shattered, (ii) the theory contains no independent subtheories and (iii) there exists a root binding class. From now on, we will denote with C the set of clauses of the theory at hand and with B a root binding class guaranteed to exist if C allows for domain recursion. Lemma 2 states that all variables in B have identical domains. We will denote the domain of these variables with D. The intuition behind the domain recursion operator is that it modifies D by making one element explicit: D = D ∪ {xD } with xD ∈ D . This explicit domain element is introduced by the S PLIT D / function, which splits clauses w.r.t. the new subdomain D and element xD . 5 Definition 7 (S PLIT D). For a clause c and given set of variables Vc ⊆ vars(c) with domain D, let S PLIT D(c, Vc ) = c, if Vc = ∅ S PLIT D(c1 , Vc \ {V }) ∪ S PLIT D(c2 , Vc \ {V }), if Vc = ∅ (7) where c1 = c ∧ (V = xD ) and c2 = c ∧ (V = xD ) ∧ (V ∈ D ) for some V ∈ Vc . For a set of clauses C and set of variables V with domain D: S PLIT D(C, V) = c∈C S PLIT D(c, V ∩ vars(c)). The domain recursion operator creates three sets of clauses: S PLIT D(C, B) = Cx ∪ Cv ∪ Cr , with Cx = {c ∧ (V = xD )|c ∈ C}, (8) V ∈B∩vars(c) Cv = {c ∧ (V = xD ) ∧ (V ∈ D )|c ∈ C}, (9) V ∈B∩vars(c) Cr = S PLIT D(C, B) \ Cx \ Cv . (10) Proposition 3. The conjunction of the domain recursion sets is equivalent to the original theory: c∈Cr c . c∈Cv c ∧ c∈C c ≡ c∈S PLIT D(C,B) c and therefore c∈C c ≡ c∈Cx c ∧ We will now show that these sets are independent and that their conjunction is decomposable. Theorem 4. The theories Cx , Cv and Cr are independent: Cx ⊥ Cv , Cx ⊥ Cr and Cv ⊥ Cr . ⊥ ⊥ ⊥ The proof of Theorem 4 relies on the following Lemma. Lemma 5. If the theory allows for domain recursion, all clauses and atoms contain the same number of variables from B: ∃n, ∀c ∈ C, ∀a ∈ atom(C) : | vars(c) ∩ B | = | vars(a) ∩ B | = n. c Proof. Denote with Cn the clauses in C that contain n logical variables from B and with Cn its compliment in C. If C is nonempty, there is a n > 0 for which Cn is nonempty. Then every atom in Cn contains exactly n variables from B (Definition 5). Since the theory contains no independent c c subtheories, there must be an atom a in Cn which unifies with an atom ac in Cn , or Cn is empty. After shattering, all unifications bind one variable from a to a single variable from ac . Because a contains exactly n variables from B, ac must also contain exactly n (Definition 4), and because B is c a root binding class, the clause of ac also contains exactly n, which contradicts the definition of Cn . c Therefore, Cn is empty, and because the variables in B are root, they also appear in all atoms. Proof of Theorem 4. From Lemma 5, all atoms in C contain the same number of variables from B. In Cx , these variables are all constrained to be equal to xD , while in Cv and Cr at least one variable is constrained to be different from xD . An attempt to unify an atom from Cx with an atom from Cv or Cr therefore creates an unsatisfiable set of constraints. Similarly, atoms from Cv and Cr cannot be unified. Finally, we extend the FO d-DNNF language proposed in Van den Broeck et al. [10] with a new node, the recursive decomposable conjunction ∧ r , and define the domain recursion compilation rule. Definition 8 ( ∧ r ). The FO d-DNNF node ∧ r (nx , nr , D, D , V) represents a decomposable conjunction between the d-DNNF nodes nx , nr and a d-DNNF node isomorphic to the ∧ r node itself. In particular, the isomorphic operand is identical to the node itself, except for the size of the domain of the variables in V, which becomes one smaller, going from D to D in the isomorphic operand. We have shown that the conjunction between sets Cx , Cv and Cr is decomposable (Theorem 4) and logically equivalent to the original theory (Proposition 3). Furthermore, Cv is identical to C, up to the constraints on the domain of the variables in B. This leads us to the following definition of domain recursion. Definition 9 (Domain Recursion). The domain recursion compilation rule compiles C into ∧ r (nx , nr , D, D , B), where nx , nr are the compiled circuits for Cx , Cr . The third set Cv is represented by the recursion on D, according to Definition 8. 6 nv Cr ∧r ¬ friends(x, X) ∨ friends(X, x), X = x ¬ friends(X, x) ∨ friends(x, X), X = x P erson ← P erson \ {x} nr nx ∨ Cx ¬ friends(x, x) ∨ friends(x, x) x ∈P erson x =x ¬ friends(x, x) friends(x, x) ∨ ∧ ¬ friends(x, x ) ¬ friends(x , x) ∧ friends(x, x ) friends(x , x) Figure 3: Circuit for the symmetric relation in Equation 3, rooted in a recursive conjunction. Example 6. Figure 3 shows the FO d-DNNF circuit for Equation 3. The theory is split up into three independent theories: Cr and Cx , shown in the Figure 3, and Cv = {¬ friends(X, Y ) ∨ friends(Y, X), X = x, Y = x}. The conjunction of these theories is equivalent to Equation 3. Theory Cv is identical to Equation 3, up to the inequality constraints on X and Y . Theorem 6. Given a function size, which maps domains to their size, the weighted first-order model count of a ∧ r (nx , nr , D, D , V) node is size(D) wmc( ∧ r (nx , nr , D, D , V), size) = wmc(nx , size)size(D) wmc(nr , size ∪{D → s}), s=0 (11) where size ∪{D → s} adds to the size function that the subdomain D has cardinality s. Proof. If C allows for domain recursion, due to Theorem 4, the weighted model count is wmc(C, size) = 1, if size(D) = 0 wmc(Cx ) · wmc(Cv , size ) · wmc(Cr , size ) if size(D) > 0 (12) where size = size ∪{D → size(D) − 1}. Theorem 7. The Independent Partial Grounding compilation rule is a special case of the domain recursion rule, where ∀c ∈ C : | vars(c) ∩ B | = 1 (and therefore Cr = ∅). 4.3 Completeness In this section, we introduce a class of models for which first-order knowledge compilation with domain recursion is complete. Definition 10 (k-WFOMC). The class of k-WFOMC consist of WFOMC theories with clauses that have up to k logical variables. A first completeness result is for 2-WFOMC, using the set of knowledge compilation rules CR 2 , which are the rules in CR 1 extended with domain recursion. Theorem 8 (Completeness for 2-WFOMC). First-order knowledge compilation using the CR 2 compilation rules is a complete domain lifted probabilistic inference algorithm for 2-WFOMC. Proof. From Lemma 1, after applying the CR 1 rules, the theory contains only atoms with dimension larger than or equal to two. From Definition 10, each clause has dimension smaller than or equal to two. Therefore, each logical variable in the theory is a root variable and according to Definition 5, every equivalence class of binding variables is a root binding class. Because of Lemma 1, the theory allows for domain recursion, which requires further compilation of two theories: Cx and Cr into nx and nr . Both have dimension smaller than 2 and can be lifted by CR 1 compilation rules. The properties of 2-WFOMC are a sufficient but not necessary condition for first-order knowledge compilation to be domain lifted. We can obtain a similar result for MLNs or parfactors by reducing them to a WFOMC problem. If an MLN contains only formulae with up to k logical variables, then its WFOMC representation will be in k-WFOMC. 7 This result for 2-WFOMC is not trivial. Van den Broeck et al. [10] showed in their experiments that counting first-order variable elimination (C-FOVE) [6] fails to lift the “Friends Smoker Drinker” problem, which is in 2-WFOMC. We will show in the next section that the CR 1 rules fail to lift the theory in Figure 4a, which is in 2-WFOMC. Note that there are also useful theories that are not in 2-WFOMC, such as those containing the transitive relation friends(X, Y ) ∧ friends(Y, Z) ⇒ friends(X, Z). 5 Empirical evaluation To complement the theoretical results of the previous section, we extended the WFOMC implementation1 with the domain recursion rule. We performed experiments with the theory in Figure 4a, which is a version of the friends and smokers model [11] extended with the symmetric relation of Equation 3. We evaluate the performance querying P(smokes(bob)) with increasing domain size, comparing our approach to the existing WFOMC implementation and its propositional counterpart, which first grounds the theory and then compiles it with the c2d compiler [13] to a propositional d-DNNF circuit. We did not compare to C-FOVE [6] because it cannot perform lifted inference on this model. 2 smokes(X) ∧ friends(X, Y ) ⇒ smokes(Y ) friends(X, Y ) ⇒ friends(Y, X). Runtime [s] Propositional inference quickly becomes intractable when there are more than 20 people. The lifted inference algorithms scale much better. The CR 1 rules can exploit some regularities in the model. For example, they eliminate all the smokes(X) atoms from the theory. They do, however, resort to grounding at a later stage of the compilation process. With the domain recursion rule, there is no need for grounding. This advantage is clear in the experiments, our approach having an almost constant inference time in this range of domains sizes. Note that the runtimes for c2d include compilation and evaluation of the circuit, whereas the WFOMC runtimes only represent evaluation of the FO d-DNNF. After all, propositional compilation depends on the domain size but first-order compilation does not. First-order compilation takes a constant two seconds for both rule sets. 10000 1000 100 10 1 0.1 0.01 c2d WFOMC - CR1 WFOMC - CR2 10 20 30 40 50 60 Number of People 70 80 (b) Evaluation Runtime (a) MLN Model Figure 4: Symmetric friends and smokers experiment, comparing propositional knowledge compilation (c2d) to WFOMC using compilation rules CR 1 and CR 2 (which includes domain recursion). 6 Conclusions We proposed a definition of complete domain lifted probabilistic inference w.r.t. classes of probabilistic logic models. This definition considers algorithms to be lifted if they are polynomial in the size of logical variable domains. Existing first-order knowledge compilation turns out not to admit an intuitive completeness result. Therefore, we generalized the existing Independent Partial Grounding compilation rule to the domain recursion rule. With this one extra rule, we showed that first-order knowledge compilation is complete for a significant class of probabilistic logic models, where the WFOMC representation has up to two logical variables per clause. Acknowledgments The author would like to thank Luc De Raedt, Jesse Davis and the anonymous reviewers for valuable feedback. This work was supported by the Research Foundation-Flanders (FWO-Vlaanderen). 1 http://dtai.cs.kuleuven.be/wfomc/ 8 References [1] Lise Getoor and Ben Taskar, editors. An Introduction to Statistical Relational Learning. MIT Press, 2007. [2] Luc De Raedt, Paolo Frasconi, Kristian Kersting, and Stephen Muggleton, editors. Probabilistic inductive logic programming: theory and applications. Springer-Verlag, Berlin, Heidelberg, 2008. [3] Daan Fierens, Guy Van den Broeck, Ingo Thon, Bernd Gutmann, and Luc De Raedt. Inference in probabilistic logic programs using weighted CNF’s. In Proceedings of UAI, pages 256–265, 2011. [4] David Poole. First-order probabilistic inference. In Proceedings of IJCAI, pages 985–991, 2003. [5] Rodrigo de Salvo Braz, Eyal Amir, and Dan Roth. Lifted first-order probabilistic inference. In Proceedings of IJCAI, pages 1319–1325, 2005. [6] Brian Milch, Luke S. Zettlemoyer, Kristian Kersting, Michael Haimes, and Leslie Pack Kaelbling. Lifted Probabilistic Inference with Counting Formulas. In Proceedings of AAAI, pages 1062–1068, 2008. [7] Vibhav Gogate and Pedro Domingos. Exploiting Logical Structure in Lifted Probabilistic Inference. In Proceedings of StarAI, 2010. [8] Abhay Jha, Vibhav Gogate, Alexandra Meliou, and Dan Suciu. Lifted Inference Seen from the Other Side: The Tractable Features. In Proceedings of NIPS, 2010. [9] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In Proceedings of UAI, pages 256–265, 2011. [10] Guy Van den Broeck, Nima Taghipour, Wannes Meert, Jesse Davis, and Luc De Raedt. Lifted Probabilistic Inference by First-Order Knowledge Compilation. In Proceedings of IJCAI, pages 2178–2185, 2011. [11] Parag Singla and Pedro Domingos. Lifted first-order belief propagation. In Proceedings of AAAI, pages 1094–1099, 2008. [12] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning, 62(1): 107–136, 2006. [13] Adnan Darwiche. New advances in compiling CNF to decomposable negation normal form. In Proceedings of ECAI, pages 328–332, 2004. 9

5 0.44152799 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman

Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1

6 0.4139055 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

7 0.39125019 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

8 0.38550618 27 nips-2011-Advice Refinement in Knowledge-Based SVMs

9 0.37272352 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems

10 0.35590565 127 nips-2011-Image Parsing with Stochastic Scene Grammar

11 0.35357016 55 nips-2011-Collective Graphical Models

12 0.34728459 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

13 0.3336429 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

14 0.33321056 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

15 0.33121967 202 nips-2011-On the Universality of Online Mirror Descent

16 0.32268041 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

17 0.29968506 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies

18 0.29758099 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

19 0.28372329 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

20 0.28341651 125 nips-2011-Identifying Alzheimer's Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (4, 0.042), (20, 0.028), (26, 0.012), (31, 0.076), (33, 0.022), (43, 0.032), (45, 0.079), (57, 0.025), (65, 0.014), (74, 0.05), (83, 0.021), (89, 0.454), (99, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79862756 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions

Author: Mohammad S. Sorower, Janardhan R. Doppa, Walker Orr, Prasad Tadepalli, Thomas G. Dietterich, Xiaoli Z. Fern

Abstract: We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this “missingness” process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. 1

2 0.60006338 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs

Author: Yusuke Watanabe

Abstract: While loopy Belief Propagation (LBP) has been utilized in a wide variety of applications with empirical success, it comes with few theoretical guarantees. Especially, if the interactions of random variables in a graphical model are strong, the behaviors of the algorithm can be difficult to analyze due to underlying phase transitions. In this paper, we develop a novel approach to the uniqueness problem of the LBP fixed point; our new “necessary and sufficient” condition is stated in terms of graphs and signs, where the sign denotes the types (attractive/repulsive) of the interaction (i.e., compatibility function) on the edge. In all previous works, uniqueness is guaranteed only in the situations where the strength of the interactions are “sufficiently” small in certain senses. In contrast, our condition covers arbitrary strong interactions on the specified class of signed graphs. The result of this paper is based on the recent theoretical advance in the LBP algorithm; the connection with the graph zeta function.

3 0.5595392 217 nips-2011-Practical Variational Inference for Neural Networks

Author: Alex Graves

Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1

4 0.38842866 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

Author: Elad Hazan, Tomer Koren, Nati Srebro

Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1

5 0.29123256 180 nips-2011-Multiple Instance Filtering

Author: Kamil A. Wnuk, Stefano Soatto

Abstract: We propose a robust filtering approach based on semi-supervised and multiple instance learning (MIL). We assume that the posterior density would be unimodal if not for the effect of outliers that we do not wish to explicitly model. Therefore, we seek for a point estimate at the outset, rather than a generic approximation of the entire posterior. Our approach can be thought of as a combination of standard finite-dimensional filtering (Extended Kalman Filter, or Unscented Filter) with multiple instance learning, whereby the initial condition comes with a putative set of inlier measurements. We show how both the state (regression) and the inlier set (classification) can be estimated iteratively and causally by processing only the current measurement. We illustrate our approach on visual tracking problems whereby the object of interest (target) moves and evolves as a result of occlusions and deformations, and partial knowledge of the target is given in the form of a bounding box (training set). 1

6 0.28995523 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

7 0.28818271 303 nips-2011-Video Annotation and Tracking with Active Learning

8 0.28774387 156 nips-2011-Learning to Learn with Compound HD Models

9 0.28762549 258 nips-2011-Sparse Bayesian Multi-Task Learning

10 0.28744707 283 nips-2011-The Fixed Points of Off-Policy TD

11 0.28714564 231 nips-2011-Randomized Algorithms for Comparison-based Search

12 0.28712565 276 nips-2011-Structured sparse coding via lateral inhibition

13 0.28669688 229 nips-2011-Query-Aware MCMC

14 0.28643155 186 nips-2011-Noise Thresholds for Spectral Clustering

15 0.28622159 66 nips-2011-Crowdclustering

16 0.28617224 197 nips-2011-On Tracking The Partition Function

17 0.28615904 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

18 0.28536382 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

19 0.28527349 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

20 0.28463018 241 nips-2011-Scalable Training of Mixture Models via Coresets