acl acl2013 acl2013-27 knowledge-graph by maker-knowledge-mining

27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

Source: pdf

Author: Oren Melamud ; Jonathan Berant ; Ido Dagan ; Jacob Goldberger ; Idan Szpektor

Abstract: Automatic acquisition of inference rules for predicates has been commonly addressed by computing distributional similarity between vectors of argument words, operating at the word space level. A recent line of work, which addresses context sensitivity of rules, represented contexts in a latent topic space and computed similarity over topic vectors. We propose a novel two-level model, which computes similarities between word-level vectors that are biased by topic-level context representations. Evaluations on a naturallydistributed dataset show that our model significantly outperforms prior word-level and topic-level models. We also release a first context-sensitive inference rule set.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Automatic acquisition of inference rules for predicates has been commonly addressed by computing distributional similarity between vectors of argument words, operating at the word space level. [sent-8, score-0.908]

2 A recent line of work, which addresses context sensitivity of rules, represented contexts in a latent topic space and computed similarity over topic vectors. [sent-9, score-0.62]

3 We also release a first context-sensitive inference rule set. [sent-12, score-0.499]

4 For example, the inference rule ‘X treat Y → X relieve eYx’ can eb,e t uhese ifnufel rteon ecxet rrauclet pairs aoft drugs a rnedli etvhee illnesses which they relieve, or to answer a question like “Which drugs relieve headache? [sent-14, score-0.585]

5 This research line was mainly initiated by the highly-cited DIRT algorithm (Lin and Pantel, 2001), which learns inference for binary predicates with two argument slots (like the rule in the example above). [sent-19, score-0.963]

6 DIRT represents a predicate by two vectors, one for each of the argument slots, where the vector entries correspond to the argument words that occurred with the predicate in the corpus. [sent-20, score-0.93]

7 Inference rules between pairs of predicates are then identified by measuring the similarity between their corresponding argument vectors. [sent-21, score-0.604]

8 Consequently, several knowledge resources of inference rules were released, containing the top scoring rules for each predicate (Schoenmackers et al. [sent-27, score-0.512]

9 Thus, a system that applies an inference rule to a text may estimate the validity of the rule application based on the pre-specified rule score. [sent-32, score-1.323]

10 However, the validity of an inference rule may depend on the context in which it is applied, such as the context specified by the given predicate’s arguments. [sent-33, score-0.703]

11 For example, ‘AT&T; acquire TMobile → AT&T; purchase T-Mobile’, is a valid application AoTf& &thTe; ruurlceh a‘sXe acquire eY’ → Xa purcahppasliec aYti’o , nw ohfile t ‘Ceh riuldlere ‘nX acquire esk Yil →s → C phuilr-drenpurchase lsek ‘ilClsh’ i lsd nreont. [sent-34, score-0.607]

12 aTcoq uadirdere ssksi ltlhsis → issue, a line of works emerged which computes a contextsensitive reliability score for each rule application, based on the given context. [sent-35, score-0.747]

13 Then, similarity is measured between the two topic distribution vectors corresponding to the two sides of the rule in the given context, yielding a context-sensitive score for each particular rule application. [sent-44, score-1.268]

14 We notice at this point that while contextinsensitive methods represent predicates by argument vectors in the original fine-grained word space, context-sensitive methods represent them as vectors at the level of latent topics. [sent-45, score-0.612]

15 This raises the question of whether such coarse-grained topic vectors might be less informative in determining the semantic similarity between the two predicates. [sent-46, score-0.411]

16 To address this hypothesized caveat of prior context-sensitive rule scoring methods, we propose a novel generic scheme that integrates wordlevel and topic-level representations. [sent-47, score-0.51]

17 Our scheme can be applied on top of any context-insensitive “base” similarity measure for rule learning, which operates at the word level, such as Cosine or Lin (Lin, 1998). [sent-48, score-0.699]

18 Rather than computing a single context-insensitive rule score, we compute a distinct word-level similarity score for each topic in an LDA model. [sent-49, score-0.784]

19 Then, when applying a rule in a given context, these different scores are weighed together based on the specific topic distribution under the given context. [sent-50, score-0.578]

20 This way, we calculate similarity over vectors in the original word space, while biasing them towards the given context via a topic model. [sent-51, score-0.557]

21 We first present context-insensitive rule learning, based on distributional similarity at the word level, and then context-sensitive scoring for rule applications, based on topic-level similarity. [sent-57, score-1.002]

22 1 Context-insensitive Rule Learning A predicate inference rule ‘LHS → RHS’, such as p‘rXe acquire fYer → eX r purchase Y’, specifies a adsire ‘cXtio ancaqlu iinrefer Yen →ce Xrela ptiuorcnh baestewe Ye’n, tswpeoc predicates. [sent-60, score-0.952]

23 Each rule side consists of a lexical predicate and (two) variable slots for its arguments. [sent-61, score-0.725]

24 A rule can be applied when its LHS matches a predicate with a pair of arguments in a text, allowing us to infer its RHS, with the corresponding instantiations for the argument variables. [sent-63, score-0.811]

25 The DIRT algorithm (Lin and Pantel, 2001) follows the distributional similarity paradigm to learn predicate inference rules. [sent-65, score-0.654]

26 For each predicate, DIRT represents each of its argument slots by an argument vector. [sent-66, score-0.544]

27 We denote the two vectors of the X and Y slots of a predicate pred by vpxred and vpyred, respectively. [sent-67, score-0.582]

28 Each entry of a vector v corresponds to a particular word (or term) w that instantiated the argument slot in a learning corpus, with a value v(w) = PMI(pred, w) (with PMI standing for point-wise mutual information). [sent-68, score-0.403]

29 Truhelen by combining ttehse am reealsiaubrieldsimilarities between the corresponding argument vectors of the two rule sides. [sent-71, score-0.657]

30 Concretely, denoting by land r the predicates appearing in the two rule sides, DIRT’s reliability score is defined as follows: sc=oreqDsIRimT((LvlHx,Svrx →) · s RimH(Sv)ly,vry) (1) where sim(v, v0) is a vector similarity measure. [sent-72, score-0.896]

31 This issue has been addressed in a separate line of research which introduced directional similarity measures suitable for inference rela- tions (Bhagat et al. [sent-77, score-0.386]

32 In our experiments we apply our proposed context-sensitive similarity scheme over three different base similarity measures. [sent-80, score-0.51]

33 DIRT and similar context-insensitive inference methods provide a single reliability score for a learned inference rule, which aims to predict the validity of the rule’s applications. [sent-81, score-0.525]

34 However, as exemplified in the Introduction, an inference rule may be valid in some contexts but invalid in others (e. [sent-82, score-0.67]

35 Since vector similarity in DIRT is computed over the single aggregate argument vector, the obtained reliability score tends to be biased towards the dominant contexts of the involved predicates. [sent-85, score-0.697]

36 Following this observation, it is desired to obtain a context-sensitive reliability score for each rule application in a given context, as described next. [sent-87, score-0.596]

37 2 Context-sensitive Rule Applications To assess the reliability of applying an inference rule in a given context we need some model for context representation, that should affect the rule reliability score. [sent-89, score-1.257]

38 Similar to the construction of argument vectors in the distributional model (described above in subsection 2. [sent-100, score-0.377]

39 1), all arguments instantiating each predicate slot are extracted from a large learning corpus. [sent-101, score-0.458]

40 Then, for each slot of each predicate, a pseudo-document is constructed containing the set of all argument words that instantiated this slot in the corpus. [sent-102, score-0.532]

41 We denote the two documents constructed for the X and Y slots of a predicate pred by dpxred and dpyred, respectively. [sent-103, score-0.478]

42 In comparison to the distributional model, these two documents correspond to the analogous argument vectors vpxred and vpyred, both containing exactly the same set of words. [sent-104, score-0.436]

43 2 The learning process results in the construction of K latent topics, where each topic t specifies a distribution over all words, denoted by p(w|t), and a topic distribution for each pseudodpo(wcu|mt),en ant d, ade tnooptiecd d by p(t|d). [sent-106, score-0.404]

44 In the topic-level model, d) corresponds to a predicate slot and w to a particular argument word instantiating this slot. [sent-108, score-0.668]

45 Hence, p(t|d, w) is viewed as specifying the relevance (or likelihood) eowf ethde a topic ti fyoirn gth teh predicate slot in the context of the given argument instantiation. [sent-109, score-0.874]

46 For example, for the predicate slot ‘acquire Y’ in the context of the argument ‘IBM’, we expect high relevance for a topic about companies, while in the context of the argument ‘knowledge ’ we expect high relevance for a topic about abstract concepts. [sent-110, score-1.348]

47 Accordingly, the distribution p(t|d, w) over eapltls topics provides a topic-level representation for a predicate slot in the context of a particular argument w. [sent-111, score-0.803]

48 This representation is used by the topic-level model to compute a context-sensitive score for inference rule applications, as follows. [sent-112, score-0.595]

49 1333 Consider the application of an inference rule ‘LHS → RHS’ in the context of a particular pair ‘oLf arguments f’o irn th thee X co anntedx tY o slots, tdiecnuolaterd p by wx and wy, respectively. [sent-115, score-0.638]

50 (2010) utilized the dot product form for their similarity measure: simDC (d, d0, w) = Σt [p(t|d, w) · p(t|d0, w)] (4) (the subscript DC stands for double-conditioning, as both distributions are conditioned on the argument word, unlike the measure below). [sent-118, score-0.535]

51 Dinu and Lapata (2010b) presented a slightly different similarity measure for topic distributions that performed better in their setting as well as in a related later paper on context-sensitive scoring of lexical similarity (Dinu and Lapata, 2010a). [sent-119, score-0.595]

52 In this measure, the topic distribution for the right hand side of the rule is not conditioned on w: simSC (d, d0, w) = Σt [p(t|d, w) · p(t|d0)] (5) (the subscript SC stands for single-conditioning, as only the left distribution is conditioned on the argument word). [sent-120, score-0.88]

53 Comparing the context-insensitive and contextsensitive models, we see that both of them measure similarity between vector representations of corresponding predicate slots. [sent-123, score-0.702]

54 However, while DIRT computes sim(v, v0) over vectors in the original word-level space, topic-level models compute sim(d, d0, w) by measuring similarity of vectors in a reduced-dimensionality latent space. [sent-124, score-0.493]

55 Hence, in the next section we propose a combined two-level model, which represents predicate slots in the original word-level space while biasing the similarity measure through topic-level context models. [sent-126, score-0.778]

56 3 Two-level Context-sensitive Inference Our model follows the general DIRT scheme while extending it to handle context-sensitive scoring of rule applications, addressing the scenario dealt by the context-sensitive topic models. [sent-127, score-0.56]

57 Following the methods in Section 2, for each predicate pred we construct, from the learning corpus, its argument vectors vpxred and vpyred as well as its argument pseudo-documents dpxred and dpyred. [sent-130, score-0.977]

58 At learning time, we compute for each candidate rule a separate, topic-biased, similarity score per each of the topics in the LDA model. [sent-134, score-0.733]

59 Then, at rule application time, we compute an overall reliability score for the rule by combining the per-topic similarity scores, while biasing the score combination according to the given context of w. [sent-135, score-1.396]

60 1 Topic-biased Word-vector Similarities Given a pair of word vectors v and v0, and any desired “base” vector similarity measure sim (e. [sent-138, score-0.455]

61 simLin), we compute a topic-biased similarity score for each LDA topic t, denoted by simt(v, v0). [sent-140, score-0.457]

62 The notation Lint denotes the simt measure when applied using Lin as the base similarity measure sim. [sent-143, score-0.48]

63 Table 1 illustrates topic-biased similarities for the Y slot of two rules involving the predicate ‘acquire’ . [sent-145, score-0.512]

64 On the other hand, the topic-biased similarity for t1 is substantially lower, since prominent words in this topic are likely to occur with ‘acquire’ but not with ‘learn’, yielding low distributional simi- larity. [sent-148, score-0.408]

65 2 Context-sensitive Similarity When applying an inference rule, we compute for each slot its context-sensitive similarity score simWT(v, v0, w), where v and v0 are the slot’s argument vectors for the two rule sides and w is the word instantiating the slot in the given rule application. [sent-151, score-1.869]

66 In this average, each topic is weighed by its “relevance” for the context in which the rule is applied, which consists of the left-hand-side predicate v and the argument w. [sent-153, score-1.075]

67 The relevance ofeach topic to different arguments of ‘acquire’ is illustrated by showing the top 5 words in the argument vector vaycquire for which the illustrated topic is the most likely one. [sent-155, score-0.555]

68 tured by p(t|dv , w) : simWT(v,v0,w) = X[p(t|dv,w) · simt(v,v0)] Xt (7) This way, a rule application would obtain a high score only if the current context fits those topics for which the rule is indeed likely to be valid, as captured by a high topic-biased similarity. [sent-156, score-1.001]

69 Table 2 illustrates the calculation of contextsensitive similarity scores in four rule applications, involving the Y slot of the predicate ‘acquire’ . [sent-158, score-1.152]

70 The opposite behavior is observed for ‘acquire → purchase’, altogether demonstrating ‘haocwq our m→od peulr successfully betihaseers tehme similarity score according to rule validity in context. [sent-160, score-0.659]

71 1335 Table 2: Context-sensitive similarity scores (in bold) for the Y slots of four rule applications. [sent-163, score-0.682]

72 For each rule application, the table shows a couple of the topic-biased scores Lint ofthe rule (as in Table 1), along with the topic relevance for the given context p(t|dv, w), which weighs cthee f topic-biased scores i pn( tth|de LinWT calculation. [sent-165, score-0.988]

73 Since our model can contextualize various distributional similarity measures, we evaluated the performance of all the above methods on several base similarity measures and their learned rulesets, namely Lin (Lin, 1998), BInc (Szpektor and Dagan, 2008) and vector Cosine similarity. [sent-172, score-0.589]

74 Binc (Szpektor and Dagan, 2008) is a directional similarity measure between word vectors, which outperformed Lin for predicate inference (Szpektor and Dagan, 2008). [sent-174, score-0.735]

75 3 ReVerb template extractions/instantiations are in the form of a tuple (x, pred, y), containing pred, a verb predicate, x, the argument instantiation of the template’s slot X, and y, the instantiation of the template’s slot Y . [sent-177, score-0.648]

76 To complete the learning, we calculated the topic-biased similarity score for each learned rule under each LDA topic, as specified in our context-sensitive model. [sent-189, score-0.65]

77 We release a rule set comprising the top 500 context-sensitive rules that we learned for each of the verb predicates in our learning corpus, along with our trained LDA 3ReVerb washingt is available on . [sent-190, score-0.593]

78 1336 MITVnveoa tlhiadoild 825L146i 5n1B257 I2n734cCo85213s719i2ne Table 3: Sizes of rule application test set for each learned rule-set. [sent-193, score-0.453]

79 5 This publicly available dataset contains about 6,500 manually annotated predicate template rule applications, each one labeled as correct or incorrect. [sent-198, score-0.637]

80 For example, ‘Jack agree with Jill 9 Jack feel sorry for Jill’ is a rule application in this dataset, labeled as incorrect, and ‘Registration open this month → Registration begin gthisistr matioonnth o’ pise nan tohtish emr o runtleh application, liaobnel beedas correct. [sent-199, score-0.42]

81 ’s dataset in which the assessed rule is not in the contextinsensitive rule-set learned for this measure or the argument instantiation ofthe rule is not in the LDA lexicon. [sent-203, score-1.139]

82 Finally, the task under which we assessed the tested models is to rank all rule applications in each test set, aiming to rank the valid rule applications above the invalid ones. [sent-208, score-0.929]

83 , 2008) of the rule application ranking computed by this method. [sent-210, score-0.42]

84 Specifically, topics are leveraged for high-level domain disambiguation, while fine grained wordlevel distributional similarity is computed for each rule under each such domain. [sent-244, score-0.767]

85 However, in higher numbers, topics relate to narrower domains and then topic biased word level similarity may become less effective due to potential sparseness. [sent-246, score-0.448]

86 provided a more detailed annotation for each invalid rule application. [sent-249, score-0.452]

87 Specifically, they annotated whether the context under which the rule is applied is valid. [sent-250, score-0.443]

88 This result more explicitly shows the advantages of integrating word-level and context-sensitive topiclevel similarities for differentiating valid and invalid contexts for rule applications. [sent-258, score-0.617]

89 Yet, many invalid rule applications occur under valid contexts due to inherently incorrect rules, and we want to make sure that also in this scenario our model does not fall behind the context-insensitive measure. [sent-259, score-0.57]

90 Indeed, on test-setvc, in which context mismatches are rare, our algorithm is still better than the original measure, indicating that WT can be safely applied to distributional similarity measures without concerns of reduced performance in different context scenarios. [sent-260, score-0.44]

91 6 Discussion and Future Work This paper addressed the problem of computing context-sensitive reliability scores for predicate inference rules. [sent-264, score-0.493]

92 In particular, we proposed a novel scheme that applies over any base distributional similarity measure which operates at the word level, and computes a single context-insensitive score for a rule. [sent-265, score-0.575]

93 Based on such a measure, our scheme constructs a context-sensitive similarity measure that computes a reliability score for predicate inference rules applications in the context of given arguments. [sent-266, score-1.122]

94 Then, given a specific candidate rule application, the LDA model is used to infer the topic distribution relevant to the context specified by the given arguments. [sent-269, score-0.606]

95 Finally, the contextsensitive rule application score is computed as a weighted average of the per-topic word-level similarity scores, which are weighed according to the inferred topic distribution. [sent-270, score-1.02]

96 While most works on context-insensitive predicate inference rules, such as DIRT (Lin and Pan- tel, 2001), are based on word-level similarity measures, almost all prior models addressing contextsensitive predicate inference rules are based on topic models (except for (Pantel et al. [sent-271, score-1.34]

97 We therefore focused on comparing the performance of our two-level scheme with state-of-the-art prior topic-level and word-level models of distributional similarity, over a random sample of inference rule applications. [sent-273, score-0.681]

98 While we focus on lexical-syntactic predicate templates and instanti- ations of their argument slots as context, lexical similarity methods consider various lexical units that are not necessarily predicates, with their context typically being the collection of words in a window around them. [sent-277, score-0.85]

99 In addition, (Dinu and Lapata, 2010a) adapted the predicate inference topic model from (Dinu and Lapata, 2010b) to compute lexical similarity in context. [sent-282, score-0.739]

100 A notable difference between our approach and theirs is that we use predicate pseudo-documents consisting of argument in- stantiations to learn our LDA model, while Eidelman et al. [sent-291, score-0.449]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rule', 0.362), ('dirt', 0.26), ('predicate', 0.239), ('argument', 0.21), ('similarity', 0.196), ('lda', 0.194), ('dinu', 0.183), ('contextsensitive', 0.162), ('slot', 0.161), ('szpektor', 0.153), ('acquire', 0.147), ('inference', 0.137), ('predicates', 0.13), ('topic', 0.13), ('slots', 0.124), ('reliability', 0.117), ('dagan', 0.111), ('reverb', 0.101), ('lint', 0.099), ('lin', 0.098), ('invalid', 0.09), ('simt', 0.088), ('vectors', 0.085), ('distributional', 0.082), ('context', 0.081), ('lapata', 0.08), ('binc', 0.079), ('lhs', 0.079), ('topics', 0.079), ('sc', 0.075), ('pred', 0.075), ('measure', 0.073), ('rhs', 0.07), ('zeichner', 0.07), ('ritter', 0.069), ('ido', 0.069), ('sim', 0.069), ('scheme', 0.068), ('rules', 0.068), ('berant', 0.067), ('idan', 0.067), ('purchase', 0.067), ('pantel', 0.066), ('dc', 0.066), ('biasing', 0.065), ('dv', 0.065), ('contextinsensitive', 0.059), ('salary', 0.059), ('simwt', 0.059), ('vpxred', 0.059), ('vpyred', 0.059), ('score', 0.059), ('application', 0.058), ('instantiating', 0.058), ('conditioned', 0.056), ('bhagat', 0.055), ('directional', 0.053), ('accommodate', 0.053), ('weighed', 0.053), ('relevance', 0.053), ('vt', 0.052), ('base', 0.05), ('boss', 0.048), ('wordlevel', 0.048), ('computes', 0.047), ('similarities', 0.044), ('jill', 0.043), ('relieve', 0.043), ('biased', 0.043), ('latent', 0.043), ('validity', 0.042), ('jack', 0.042), ('sides', 0.041), ('oren', 0.041), ('valid', 0.041), ('contexts', 0.04), ('eidelman', 0.04), ('extractions', 0.04), ('instantiation', 0.04), ('dpxred', 0.04), ('linwt', 0.04), ('registration', 0.04), ('schoenmackers', 0.04), ('topiclevel', 0.04), ('pad', 0.039), ('erk', 0.038), ('applications', 0.037), ('outperformed', 0.037), ('compute', 0.037), ('georgiana', 0.037), ('template', 0.036), ('denoted', 0.035), ('shinyama', 0.035), ('wt', 0.035), ('substitution', 0.033), ('learned', 0.033), ('distribution', 0.033), ('prior', 0.032), ('vector', 0.032), ('calculation', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000024 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

Author: Oren Melamud ; Jonathan Berant ; Ido Dagan ; Jacob Goldberger ; Idan Szpektor

2 0.54285616 376 acl-2013-Using Lexical Expansion to Learn Inference Rules from Sparse Data

Author: Oren Melamud ; Ido Dagan ; Jacob Goldberger ; Idan Szpektor

Abstract: Automatic acquisition of inference rules for predicates is widely addressed by computing distributional similarity scores between vectors of argument words. In this scheme, prior work typically refrained from learning rules for low frequency predicates associated with very sparse argument vectors due to expected low reliability. To improve the learning of such rules in an unsupervised way, we propose to lexically expand sparse argument word vectors with semantically similar words. Our evaluation shows that lexical expansion significantly improves performance in comparison to state-of-the-art baselines.

3 0.20187643 314 acl-2013-Semantic Roles for String to Tree Machine Translation

Author: Marzieh Bazrafshan ; Daniel Gildea

Abstract: We experiment with adding semantic role information to a string-to-tree machine translation system based on the rule extraction procedure of Galley et al. (2004). We compare methods based on augmenting the set of nonterminals by adding semantic role labels, and altering the rule extraction process to produce a separate set of rules for each predicate that encompass its entire predicate-argument structure. Our results demonstrate that the second approach is effective in increasing the quality of translations.

4 0.19875579 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

Author: Tiziano Flati ; Roberto Navigli

Abstract: We present SPred, a novel method for the creation of large repositories of semantic predicates. We start from existing collocations to form lexical predicates (e.g., break ∗) and learn the semantic classes that best f∗it) tahned ∗ argument. Taon idco this, we extract failtl thhee ∗ occurrences ion Wikipedia ewxthraiccht match the predicate and abstract its arguments to general semantic classes (e.g., break BODY PART, break AGREEMENT, etc.). Our experiments show that we are able to create a large collection of semantic predicates from the Oxford Advanced Learner’s Dictionary with high precision and recall, and perform well against the most similar approach.

5 0.17592771 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: Generative probabilistic models have been used for content modelling and template induction, and are typically trained on small corpora in the target domain. In contrast, vector space models of distributional semantics are trained on large corpora, but are typically applied to domaingeneral lexical disambiguation tasks. We introduce Distributional Semantic Hidden Markov Models, a novel variant of a hidden Markov model that integrates these two approaches by incorporating contextualized distributional semantic vectors into a generative model as observed emissions. Experiments in slot induction show that our approach yields improvements in learning coherent entity clusters in a domain. In a subsequent extrinsic evaluation, we show that these improvements are also reflected in multi-document summarization.

6 0.17585529 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation

7 0.15547436 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

8 0.15099832 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

9 0.14671804 189 acl-2013-ImpAr: A Deterministic Algorithm for Implicit Semantic Role Labelling

10 0.13649377 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

11 0.13239877 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

12 0.12992661 267 acl-2013-PARMA: A Predicate Argument Aligner

13 0.12770939 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

14 0.12288976 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

15 0.11892416 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

16 0.11740544 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

17 0.11182485 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

18 0.11131064 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

19 0.1075877 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

20 0.1071676 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.256), (1, 0.076), (2, 0.057), (3, -0.213), (4, -0.075), (5, 0.048), (6, 0.001), (7, 0.058), (8, -0.179), (9, 0.006), (10, 0.108), (11, 0.193), (12, 0.276), (13, 0.043), (14, 0.147), (15, -0.073), (16, 0.114), (17, -0.022), (18, 0.24), (19, 0.074), (20, -0.051), (21, 0.11), (22, -0.129), (23, 0.142), (24, -0.013), (25, 0.111), (26, 0.142), (27, 0.025), (28, 0.082), (29, -0.074), (30, -0.015), (31, -0.138), (32, 0.039), (33, -0.024), (34, 0.007), (35, 0.087), (36, 0.028), (37, -0.047), (38, -0.014), (39, -0.019), (40, -0.105), (41, -0.025), (42, 0.037), (43, -0.038), (44, 0.088), (45, 0.015), (46, -0.002), (47, 0.033), (48, -0.053), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9666543 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

Author: Oren Melamud ; Jonathan Berant ; Ido Dagan ; Jacob Goldberger ; Idan Szpektor

2 0.95474231 376 acl-2013-Using Lexical Expansion to Learn Inference Rules from Sparse Data

Author: Oren Melamud ; Ido Dagan ; Jacob Goldberger ; Idan Szpektor

3 0.64587057 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation

Author: Zhenhua Tian ; Hengheng Xiang ; Ziqi Liu ; Qinghua Zheng

Abstract: This paper presents an unsupervised random walk approach to alleviate data sparsity for selectional preferences. Based on the measure of preferences between predicates and arguments, the model aggregates all the transitions from a given predicate to its nearby predicates, and propagates their argument preferences as the given predicate’s smoothed preferences. Experimental results show that this approach outperforms several state-of-the-art methods on the pseudo-disambiguation task, and it better correlates with human plausibility judgements.

4 0.59523445 314 acl-2013-Semantic Roles for String to Tree Machine Translation

Author: Marzieh Bazrafshan ; Daniel Gildea

5 0.57894063 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

Author: Tiziano Flati ; Roberto Navigli

6 0.57862097 189 acl-2013-ImpAr: A Deterministic Algorithm for Implicit Semantic Role Labelling

7 0.54297507 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

8 0.52467752 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

9 0.50841033 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

10 0.50749028 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

11 0.48752218 267 acl-2013-PARMA: A Predicate Argument Aligner

12 0.48370683 269 acl-2013-PLIS: a Probabilistic Lexical Inference System

13 0.42969584 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

14 0.41702273 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

15 0.41380847 237 acl-2013-Margin-based Decomposed Amortized Inference

16 0.4078829 126 acl-2013-Diverse Keyword Extraction from Conversations

17 0.39564273 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

18 0.39278728 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

19 0.39129436 57 acl-2013-Arguments and Modifiers from the Learner's Perspective

20 0.38367468 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.074), (6, 0.017), (11, 0.203), (15, 0.016), (24, 0.051), (26, 0.016), (28, 0.2), (35, 0.108), (42, 0.037), (48, 0.078), (70, 0.026), (88, 0.014), (90, 0.038), (95, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88395751 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

Author: Oren Melamud ; Jonathan Berant ; Ido Dagan ; Jacob Goldberger ; Idan Szpektor

2 0.87257999 349 acl-2013-The mathematics of language learning

Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra

Abstract: unkown-abstract

3 0.84663004 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

Author: Tony Veale ; Guofu Li

Abstract: Just as observing is more than just seeing, comparing is far more than mere matching. It takes understanding, and even inventiveness, to discern a useful basis for judging two ideas as similar in a particular context, especially when our perspective is shaped by an act of linguistic creativity such as metaphor, simile or analogy. Structured resources such as WordNet offer a convenient hierarchical means for converging on a common ground for comparison, but offer little support for the divergent thinking that is needed to creatively view one concept as another. We describe such a means here, by showing how the web can be used to harvest many divergent views for many familiar ideas. These lateral views complement the vertical views of WordNet, and support a system for idea exploration called Thesaurus Rex. We show also how Thesaurus Rex supports a novel, generative similarity measure for WordNet. 1 Seeing is Believing (and Creating) Similarity is a cognitive phenomenon that is both complex and subjective, yet for practical reasons it is often modeled as if it were simple and objective. This makes sense for the many situations where we want to align our similarity judgments with those of others, and thus focus on the same conventional properties that others are also likely to focus upon. This reliance on the consensus viewpoint explains why WordNet (Fellbaum, 1998) has proven so useful as a basis for computational measures of lexico-semantic similarity Guofu Li School of Computer Science and Informatics, University College Dublin, Belfield, Dublin D2, Ireland. l .guo fu . l gmai l i @ .com (e.g. see Pederson et al. 2004, Budanitsky & Hirst, 2006; Seco et al. 2006). These measures reduce the similarity of two lexical concepts to a single number, by viewing similarity as an objective estimate of the overlap in their salient qualities. This convenient perspective is poorly suited to creative or insightful comparisons, but it is sufficient for the many mundane comparisons we often perform in daily life, such as when we organize books or look for items in a supermarket. So if we do not know in which aisle to locate a given item (such as oatmeal), we may tacitly know how to locate a similar product (such as cornflakes) and orient ourselves accordingly. Yet there are occasions when the recognition of similarities spurs the creation of similarities, when the act of comparison spurs us to invent new ways of looking at an idea. By placing pop tarts in the breakfast aisle, food manufacturers encourage us to view them as a breakfast food that is not dissimilar to oatmeal or cornflakes. When ex-PM Tony Blair published his memoirs, a mischievous activist encouraged others to move his book from Biography to Fiction in bookshops, in the hope that buyers would see it in a new light. Whenever we use a novel metaphor to convey a non-obvious viewpoint on a topic, such as “cigarettes are time bombs”, the comparison may spur us to insight, to see aspects of the topic that make it more similar to the vehicle (see Ortony, 1979; Veale & Hao, 2007). In formal terms, assume agent A has an insight about concept X, and uses the metaphor X is a Y to also provoke this insight in agent B. To arrive at this insight for itself, B must intuit what X and Y have in common. But this commonality is surely more than a standard categorization of X, or else it would not count as an insight about X. To understand the metaphor, B must place X 660 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 6 0–670, in a new category, so that X can be seen as more similar to Y. Metaphors shape the way we per- ceive the world by re-shaping the way we make similarity judgments. So if we want to imbue computers with the ability to make and to understand creative metaphors, we must first give them the ability to look beyond the narrow viewpoints of conventional resources. Any measure that models similarity as an objective function of a conventional worldview employs a convergent thought process. Using WordNet, for instance, a similarity measure can vertically converge on a common superordinate category of both inputs, and generate a single numeric result based on their distance to, and the information content of, this common generalization. So to find the most conventional ways of seeing a lexical concept, one simply ascends a narrowing concept hierarchy, using a process de Bono (1970) calls vertical thinking. To find novel, non-obvious and useful ways of looking at a lexical concept, one must use what Guilford (1967) calls divergent thinking and what de Bono calls lateral thinking. These processes cut across familiar category boundaries, to simultaneously place a concept in many different categories so that we can see it in many different ways. de Bono argues that vertical thinking is selective while lateral thinking is generative. Whereas vertical thinking concerns itself with the “right” way or a single “best” way of looking at things, lateral thinking focuses on producing alternatives to the status quo. To be as useful for creative tasks as they are for conventional tasks, we need to re-imagine our computational similarity measures as generative rather than selective, expansive rather than reductive, divergent as well as convergent and lateral as well as vertical. Though WordNet is ideally structured to support vertical, convergent reasoning, its comprehensive nature means it can also be used as a solid foundation for building a more lateral and divergent model of similarity. Here we will use the web as a source of diverse perspectives on familiar ideas, to complement the conventional and often narrow views codified by WordNet. Section 2 provides a brief overview of past work in the area of similarity measurement, before section 3 describes a simple bootstrapping loop for acquiring richly diverse perspectives from the web for a wide variety of familiar ideas. These perspectives are used to enhance a Word- Net-based measure of lexico-semantic similarity in section 4, by broadening the range of informative viewpoints the measure can select from. Similarity is thus modeled as a process that is both generative and selective. This lateral-andvertical approach is evaluated in section 5, on the Miller & Charles (1991) data-set. A web app for the lateral exploration of diverse viewpoints, named Thesaurus Rex, is also presented, before closing remarks are offered in section 6. 2 Related Work and Ideas WordNet’s taxonomic organization of nounsenses and verb-senses – in which very general categories are successively divided into increasingly informative sub-categories or instancelevel ideas – allows us to gauge the overlap in information content, and thus of meaning, of two lexical concepts. We need only identify the deepest point in the taxonomy at which this content starts to diverge. This point of divergence is often called the LCS, or least common subsumer, of two concepts (Pederson et al., 2004). Since sub-categories add new properties to those they inherit from their parents – Aristotle called these properties the differentia that stop a category system from trivially collapsing into itself – the depth of a lexical concept in a taxonomy is an intuitive proxy for its information content. Wu & Palmer (1994) use the depth of a lexical concept in the WordNet hierarchy as such a proxy, and thereby estimate the similarity of two lexical concepts as twice the depth of their LCS divided by the sum of their individual depths. Leacock and Chodorow (1998) instead use the length of the shortest path between two concepts as a proxy for the conceptual distance between them. To connect any two ideas in a hierarchical system, one must vertically ascend the hierarchy from one concept, change direction at a potential LCS, and then descend the hierarchy to reach the second concept. (Aristotle was also first to suggest this approach in his Poetics). Leacock and Chodorow normalize the length of this path by dividing its size (in nodes) by twice the depth of the deepest concept in the hierarchy; the latter is an upper bound on the distance between any two concepts in the hierarchy. Negating the log of this normalized length yields a corresponding similarity score. While the role of an LCS is merely implied in Leacock and Chodorow’s use of a shortest path, the LCS is pivotal nonetheless, and like that of Wu & Palmer, the approach uses an essentially vertical reasoning process to identify a single “best” generalization. Depth is a convenient proxy for information content, but more nuanced proxies can yield 661 more rounded similarity measures. Resnick (1995) draws on information theory to define the information content of a lexical concept as the negative log likelihood of its occurrence in a corpus, either explicitly (via a direct mention) or by presupposition (via a mention of any of its sub-categories or instances). Since the likelihood of a general category occurring in a corpus is higher than that of any of its sub-categories or instances, such categories are more predictable, and less informative, than rarer categories whose occurrences are less predictable and thus more informative. The negative log likelihood of the most informative LCS of two lexical concepts offers a reliable estimate of the amount of infor- mation shared by those concepts, and thus a good estimate of their similarity. Lin (1998) combines the intuitions behind Resnick’s metric and that of Wu and Palmer to estimate the similarity of two lexical concepts as an information ratio: twice the information content of their LCS divided by the sum of their individual information contents. Jiang and Conrath (1997) consider the converse notion of dissimilarity, noting that two lexical concepts are dissimilar to the extent that each contains information that is not shared by the other. So if the information content of their most informative LCS is a good measure of what they do share, then the sum of their individual information contents, minus twice the content of their most informative LCS, is a reliable estimate of their dissimilarity. Seco et al. (2006) presents a minor innovation, showing how Resnick’s notion of information content can be calculated without the use of an external corpus. Rather, when using Resnick’s metric (or that of Lin, or Jiang and Conrath) for measuring the similarity of lexical concepts in WordNet, one can use the category structure of WordNet itself to estimate infor- mation content. Typically, the more general a concept, the more descendants it will possess. Seco et al. thus estimate the information content of a lexical concept as the log of the sum of all its unique descendants (both direct and indirect), divided by the log of the total number of concepts in the entire hierarchy. Not only is this intrinsic view of information content convenient to use, without recourse to an external corpus, Seco et al. show that it offers a better estimate of information content than its extrinsic, corpus-based alternatives, as measured relative to average human similarity ratings for the 30 word-pairs in the Miller & Charles (1991) test set. A similarity measure can draw on other sources of information besides WordNet’s category structures. One might eke out additional information from WordNet’s textual glosses, as in Lesk (1986), or use category structures other than those offered by WordNet. Looking beyond WordNet, entries in the online encyclopedia Wikipedia are not only connected by a dense topology of lateral links, they are also organized by a rich hierarchy of overlapping categories. Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. Nonetheless, WordNet can be a valuable component of a hybrid measure, and Agirre et al. (2009) use an SVM (support vector machine) to combine information from WordNet with information harvested from the web. Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. Similarity is not always applied to pairs of concepts; it is sometimes analogically applied to pairs of pairs of concepts, as in proportional analogies of the form A is to B as C is to D (e.g., hacks are to writers as mercenaries are to soldiers, or chisels are to sculptors as scalpels are to surgeons). In such analogies, one is really assessing the similarity of the unstated relationship between each pair of concepts: thus, mercenaries are soldiers whose allegiance is paid for, much as hacks are writers with income-driven loyalties; sculptors use chisels to carve stone, while surgeons use scalpels to cut or carve flesh. Veale (2004) used WordNet to assess the similarity of A:B to C:D as a function of the combined similarity of A to C and of B to D. In contrast, Turney (2005) used the web to pursue a more divergent course, to represent the tacit relationships of A to B and of C to D as points in a highdimensional space. The dimensions of this space initially correspond to linking phrases on the web, before these dimensions are significantly reduced using singular value decomposition. In the infamous SAT test, an analogy A:B::C:D has four other pairs of concepts that serve as likely distractors (e.g. singer:songwriter for hack:writer) and the goal is to choose the most appropriate C:D pair for a given A:B pairing. Using variants of Wu and Palmer (1994) on the 374 SAT analogies of Turney (2005), Veale (2004) reports a success rate of 38–44% using only WordNet-based similarity. In contrast, Turney (2005) reports up to 55% success on the same analogies, partly because his approach aims 662 to match implicit relations rather than explicit concepts, and in part because it uses a divergent process to gather from the web as rich a perspec- tive as it can on these latent relationships. 2.1 Clever Comparisons Create Similarity Each of these approaches to similarity is a user of information, rather than a creator, and each fails to capture how a creative comparison (such as a metaphor) can spur a listener to view a topic from an atypical perspective. Camac & Glucksberg (1984) provide experimental evidence for the claim that “metaphors do not use preexisting associations to achieve their effects [… ] people use metaphors to create new relations between concepts.” They also offer a salutary reminder of an often overlooked fact: every comparison exploits information, but each is also a source of new information in its own right. Thus, “this cola is acid” reveals a different perspective on cola (e.g. as a corrosive substance or an irritating food) than “this acid is cola” highlights for acid (such as e.g., a familiar substance) Veale & Keane (1994) model the role of similarity in realizing the long-term perlocutionary effect of an informative comparison. For example, to compare surgeons to butchers is to encourage one to see all surgeons as more bloody, … crude or careless. The reverse comparison, of butchers to surgeons, encourages one to see butchers as more skilled and precise. Veale & Keane present a network model of memory, called Sapper, in which activation can spread between related concepts, thus allowing one concept to prime the properties of a neighbor. To interpret an analogy, Sapper lays down new activation-carrying bridges in memory between analogical counterparts, such as between surgeon & butcher, flesh & meat, and scalpel & cleaver. Comparisons can thus have lasting effects on how Sapper sees the world, changing the pattern of activation that arises when it primes a concept. Veale (2003) adopts a similarly dynamic view of similarity in WordNet, showing how an analogical comparison can result in the automatic addition of new categories and relations to WordNet itself. Veale considers the problem of finding an analogical mapping between different parts of WordNet’s noun-sense hierarchy, such as between instances of Greek god and Norse god, or between the letters of different alphabets, such as of Greek and Hebrew. But no structural similarity measure for WordNet exhibits enough discernment to e.g. assign a higher similarity to Zeus & Odin (each is the supreme deity of its pantheon) than to a pairing of Zeus and any other Norse god, just as no structural measure will assign a higher similarity to Alpha & Aleph or to Beta & Beth than to any random letter pairing. A fine-grained category hierarchy permits fine-grained similarity judgments, and though WordNet is useful, its sense hierarchies are not especially fine-grained. However, we can automatically make WordNet subtler and more discerning, by adding new fine-grained categories to unite lexical concepts whose similarity is not reflected by any existing categories. Veale (2003) shows how a property that is found in the glosses of two lexical concepts, of the same depth, can be combined with their LCS to yield a new fine-grained parent category, so e.g. “supreme” + deity = Supreme-deity (for Odin, Zeus, Jupiter, etc.) and “1 st” + letter = 1st-letter (for Alpha, Aleph, etc.) Selected aspects of the textual similarity of two WordNet glosses – the key to similarity in Lesk (1986) – can thus be reified into an explicitly categorical WordNet form. 3 Divergent (Re)Categorization To tap into a richer source of concept properties than WordNet’s glosses, we can use web ngrams. Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). The numbers to the right are Google frequency counts. a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 To find the stable properties that can underpin a meaningful fine-grained category for cowboy, we must seek out the properties that are so often presupposed to be salient of all cowboys that one can use them to anchor a simile, such as

4 0.81395358 124 acl-2013-Discriminative state tracking for spoken dialog systems

Author: Angeliki Metallinou ; Dan Bohus ; Jason Williams

Abstract: In spoken dialog systems, statistical state tracking aims to improve robustness to speech recognition errors by tracking a posterior distribution over hidden dialog states. Current approaches based on generative or discriminative models have different but important shortcomings that limit their accuracy. In this paper we discuss these limitations and introduce a new approach for discriminative state tracking that overcomes them by leveraging the problem structure. An offline evaluation with dialog data collected from real users shows improvements in both state tracking accuracy and the quality of the posterior probabilities. Features that encode speech recognition error patterns are particularly helpful, and training requires rel- atively few dialogs.

5 0.79630446 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

Author: Xuezhe Ma ; Fei Xia

Abstract: In this paper, we propose a simple and effective approach to domain adaptation for dependency parsing. This is a feature augmentation approach in which the new features are constructed based on subtree information extracted from the autoparsed target domain data. To demonstrate the effectiveness of the proposed approach, we evaluate it on three pairs of source-target data, compared with several common baseline systems and previous approaches. Our approach achieves significant improvement on all the three pairs of data sets.

6 0.78595406 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

7 0.77951014 376 acl-2013-Using Lexical Expansion to Learn Inference Rules from Sparse Data

8 0.77620488 286 acl-2013-Psycholinguistically Motivated Computational Models on the Organization and Processing of Morphologically Complex Words

9 0.77127588 61 acl-2013-Automatic Interpretation of the English Possessive

10 0.76338601 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

11 0.76114035 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora

12 0.75591123 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation

13 0.75180113 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

14 0.75083745 245 acl-2013-Modeling Human Inference Process for Textual Entailment Recognition

15 0.74197686 170 acl-2013-GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web

16 0.73424673 75 acl-2013-Building Japanese Textual Entailment Specialized Data Sets for Inference of Basic Sentence Relations

17 0.72955978 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

18 0.72748297 242 acl-2013-Mining Equivalent Relations from Linked Data

19 0.71774876 202 acl-2013-Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web

20 0.71127868 387 acl-2013-Why-Question Answering using Intra- and Inter-Sentential Causal Relations