acl acl2011 acl2011-22 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eyal Shnarch ; Jacob Goldberger ; Ido Dagan
Abstract: Recognizing entailment at the lexical level is an important and commonly-addressed component in textual inference. Yet, this task has been mostly approached by simplified heuristic methods. This paper proposes an initial probabilistic modeling framework for lexical entailment, with suitable EM-based parameter estimation. Our model considers prominent entailment factors, including differences in lexical-resources reliability and the impacts of transitivity and multiple evidence. Evaluations show that the proposed model outperforms most prior systems while pointing at required future improvements. 1 Introduction and Background Textual Entailment was proposed as a generic paradigm for applied semantic inference (Dagan et al., 2006). This task requires deciding whether a tex- tual statement (termed the hypothesis-H) can be inferred (entailed) from another text (termed the textT). Since it was first introduced, the six rounds of the Recognizing Textual Entailment (RTE) challenges1 , currently organized under NIST, have become a standard benchmark for entailment systems. These systems tackle their complex task at various levels of inference, including logical representation (Tatu and Moldovan, 2007; MacCartney and Manning, 2007), semantic analysis (Burchardt et al., 2007) and syntactic parsing (Bar-Haim et al., 2008; Wang et al., 2009). Inference at these levels usually 1http://www.nist.gov/tac/2010/RTE/index.html 558 requires substantial processing and resources (e.g. parsing) aiming at high performance. Nevertheless, simple entailment methods, performing at the lexical level, provide strong baselines which most systems did not outperform (Mirkin et al., 2009; Majumdar and Bhattacharyya, 2010). Within complex systems, lexical entailment modeling is an important component. Finally, there are cases in which a full system cannot be used (e.g. lacking a parser for a targeted language) and one must resort to the simpler lexical approach. While lexical entailment methods are widely used, most of them apply ad hoc heuristics which do not rely on a principled underlying framework. Typically, such methods quantify the degree of lexical coverage of the hypothesis terms by the text’s terms. Coverage is determined either by a direct match of identical terms in T and H or by utilizing lexical semantic resources, such as WordNet (Fellbaum, 1998), that capture lexical entailment relations (denoted here as entailment rules). Common heuristics for quantifying the degree of coverage are setting a threshold on the percentage coverage of H’s terms (Majumdar and Bhattacharyya, 2010), counting absolute number of uncovered terms (Clark and Harrison, 2010), or applying an Information Retrievalstyle vector space similarity score (MacKinlay and Baldwin, 2009). Other works (Corley and Mihalcea, 2005; Zanzotto and Moschitti, 2006) have applied a heuristic formula to estimate the similarity between text fragments based on a similarity function between their terms. These heuristics do not capture several important aspects of entailment, such as varying reliability of Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 558–563, entailment resources and the impact of rule chaining and multiple evidence on entailment likelihood. An additional observation from these and other systems is that their performance improves only moderately when utilizing lexical resources2. We believe that the textual entailment field would benefit from more principled models for various entailment phenomena. Inspired by the earlier steps in the evolution of Statistical Machine Translation methods (such as the initial IBM models (Brown et al., 1993)), we formulate a concrete generative probabilistic modeling framework that captures the basic aspects of lexical entailment. Parameter estimation is addressed by an EM-based approach, which enables estimating the hidden lexical-level entailment parameters from entailment annotations which are available only at the sentence-level. While heuristic methods are limited in their ability to wisely integrate indications for entailment, probabilistic methods have the advantage of being extendable and enabling the utilization of wellfounded probabilistic methods such as the EM algorithm. We compared the performance of several model variations to previously published results on RTE data sets, as well as to our own implementation of typical lexical baselines. Results show that both the probabilistic model and our percentagecoverage baseline perform favorably relative to prior art. These results support the viability of the probabilistic framework while pointing at certain modeling aspects that need to be improved. 2 Probabilistic Model Under the lexical entailment scope, our modeling goal is obtaining a probabilistic score for the likelihood that all H’s terms are entailed by T. To that end, we model prominent aspects of lexical entailment, which were mostly neglected by previous lexical methods: (1) distinguishing different reliability levels of lexical resources; (2) allowing transitive chains of rule applications and considering their length when estimating their validity; and (3) considering multiple entailments when entailing a term. 2See ablation tests reports in http://aclweb.org/aclwiki/ index.php?title=RTE Knowledge Resources#Ablation Tests 559 Figure 1: The generative process of entailing terms of a hypothesis from a text. Edges represent entailment rules. There are 3 evidences for the entailment of hi :a rule from Resource1 , another one from Resource3 both suggesting that tj entails it, and a chain from t1through an intermediate term t0. 2.1 Model Description For T to entail H it is usually a necessary, but not sufficient, that every term h ∈ H would be entsauiflefidci by ,at t hleatast e one t teerrmm mt h ∈ ∈T (Glickman eet al., 2006). Figure s1t odneescr tiebrmes tth ∈e process komf entailing hypothesis terms. The trivial case is when identical terms, possibly at the stem or lemma level, appear in T and H (a direct match as tn and hm in Figure 1). Alternatively, we can establish entailment based on knowledge of entailing lexical-semantic relations, such as synonyms, hypernyms and morphological derivations, available in lexical resources (e.g the rule inference → reasoning from WordNet). (We.eg d theneo rutel by R(r) cthee → resource nwgh ficroh provided teht)e. rule r. Since entailment is a transitive relation, rules may compose transitive chains that connect a term t ∈ T ctoo a pteosrme rha ∈ Hive through hinatte cromnendeicatte a tteerrmms. t ∈Fo Tr instance, fr hom ∈ t hHe r thurleosu infer → inference armnds inference → reasoning we can dre →duc inef tehreen rcuele a infer → reasoning (were inference dise dthuec ein thteerm rueldeia intef trer →m as t0 in Figure 1). Multiple chains may connect t to h (as for tj and hi in Figure 1) or connect several terms in T to h (as t1 and tj are indicating the entailment of hi in Figure 1), thus providing multiple evidence for h’s entailment. It is reasonable to expect that if a term t indeed entails a term h, it is likely to find evidences for this relation in several resources. Taking a probabilistic perspective, we assume a parameter θR for each resource R, denoting its reliability, i.e. the prior probability that applying a rule from R corresponds to a valid entailment instance. Direct matches are considered as a special “resource”, called MATCH, for which θMATCH is expected to be close to 1. We now present our probabilistic model. For a text term t ∈ T to entail a hypothesis term h by a tcehxatin te c, mde tn ∈ote Td by etn →tcai h, thhyep application mof h every r ∈ c must be valid. N −→ote h ,t thhaet a pruplleic r i onn a cfh eaviner c rco ∈nne cc mtsu tswt ob ete vramlisd (its oleteft t-hhaatnd a- rsuildee ran ind aits c righthand-side, denoted lhs → rhs). The lhs of the first rhualned i-ns c eis, td ∈ oTte adn ldh sth →e r rhhss )o.f T Tthhee l lahsts r oufle t hine ifitr sist rhu ∈ iHn. c W ise t d ∈en Tote a nthde t event so fo a vhael ilda rtu rluel applicathio ∈n by l Whse →dren orhtes. t Sei envceen a-priori a d ru rluel r aips pvliacliadwith probability θR(r) , ancnde assuming independence of all r ∈ c, we obtain Eq. 1 to specify the probability rof ∈ ∈th ce, weveen otb tt i→cn Ehq. Next, pleetc C(h) ede pnroobtethe set of chains which− → suggest txhte, leentt Cail(mhe)n dt eonfo hte. The probability that T does not entail h at all (by any chain), specified in Eq. 2, is the probability that all these chains are not valid. Finally, the probability that T entails all of H, assuming independence of H’s terms, is the probability that every h ∈ H is entailed, as given ien p Eq. a3b. Nityot tihceat t ehvaet yth here ∈ c oHul ids be a term h which is not covered by any available rule chain. Under this formulation, we assume that each such h is covered by a single rule coming from a special “resource” called UNCOVERED (expecting θUNCOVERED to be relatively small). p(t −→c h) = Yp(lhs →r rhs) = Yr∈c p(T 9 h) = Y YθR(r)(1) Yr∈c [1 − p(t− →c h)] (2) c∈YC(h) p(T → H) = Y p(T → h) (3) hY∈H As can be seen, our model indeed distinguishes varying resource reliability, decreases entailment probability as rule chains grow and increases it when entailment of a term is supported by multiple chains. The above treatment of uncovered terms in H, as captured in Eq. 3, assumes that their entailment probability is independent of the rest of the hypothesis. However, when the number of covered hypothesis terms increases the probability that the remaining terms are actually entailed by T increases too 560 (even though we do not have supporting knowledge for their entailment). Thus, an alternative model is to group all uncovered terms together and estimate the overall probability of their joint entailment as a function of the lexical coverage of the hypothesis. We denote Hc as the subset of H’s terms which are covered by some rule chain and Huc as the remaining uncovered part. Eq. 3a then provides a refined entailment model for H, in which the second term specifies the probability that Huc is entailed given that Hc is validly entailed and the corresponding lengths: p(T→H) = [Yp(T→h)]·p(T→Huc hY∈Hc 2.2 | |Hc|,|H|) (3a) Parameter Estimation The difficulty in estimating the θR values is that these are term-level parameters while the RTEtraining entailment annotation is given for the sentence-level. Therefore, we use EM-based estimation for the hidden parameters (Dempster et al., 1977). In the E step we use the current θR values to compute all whcr (T, H) values for each training pair. whcr (T, H) stands for the posterior probability that application of the rule r in the chain c for h ∈ H tish valid, given nth oaft heieth reurl eT r e innta thiles c Hha or not ra hcc ∈ord Hing to the training annotation (see Eq. 4). Remember that a rule r provides an entailment relation between its left-hand-side (lhs) and its right-hand-side (rhs). Therefore Eq. 4 uses the notation lhs →r rhs to designate the application of the rule r (similar htos Eq. 1). wEhc:r(T,H)= p (lTh9→sH−→ |rlhsrp−→ rh(Tsr→9|hTsH )9→p(lhHs−→ r) =hs)if(4T)9→H After applying Bayes’ rule we get a fraction with Eq. 3 in its denominator and θR(r) as the second term of the numerator. The first numerator term is defined as in Eq. 3 except that for the corresponding rule application we substitute θR(r) by 1(per the conditioning event). The probabilistic model defined by Eq. 1-3 is a loop-free directed acyclic graphical model (aka a Bayesian network). Hence the E-step probabilities can be efficiently calculated using the belief propagation algorithm (Pearl, 1988). The M step uses Eq. 5 to update the parameter set. For each resource R we average the whcr (T, H) val- ues for all its rule applications in the training, whose total number is denoted nR. M : θR=n1RTX,HhX∈Hc∈XC(h)r∈c|RX(r)=wRhcr(T,H) (5) For Eq. 3a we need to estimate also p(T→Huc | |Hc| ,|H|). 3Tah iws eis n ndeoende t directly avteia a amlsaoxi pm(Tu→m Hlikeli-| |hHoo|d, eHst|i)m.a Tthioins over tehe d training set, by calculating the proportion of entailing examples within the set of all examples of a given hypothesis length (|H|) aonfd a a given lneusm ofbe ar goifv ecnov heyrepdo hteersmiss (|Hc|). HA|)s |Hc| we tvaekne tnhuem nbuemrb oefr ocofv videerendtic taelr mtesrm (|sH in| )T. a Ands |HH (exact match) suinmcbee irn o afl imdeonstti caall cases itner Tms a nind H which have an exact match in T are indeed entailed. We also tried initializing the EM algorithm with these direct estimations but did not obtain performance improvements. 3 Evaluations and Results The 5th Recognizing Textual Entailment challenge (RTE-5) introduced a new search task (Bentivogli et al., 2009) which became the main task in RTE6 (Bentivogli et al., 2010). In this task participants should find all sentences that entail a given hypothesis in a given document cluster. This task’s data sets reflect a natural distribution of entailments in a corpus and demonstrate a more realistic scenario than the previous RTE challenges. In our system, sentences are tokenized and stripped of stop words and terms are lemmatized and tagged for part-of-speech. As lexical resources we use WordNet (WN) (Fellbaum, 1998), taking as entailment rules synonyms, derivations, hyponyms and meronyms of the first senses of T and H terms, and the CatVar (Categorial Variation) database (Habash and Dorr, 2003). We allow rule chains of length up to 4 in WordNet (WN4). We compare our model to two types of baselines: (1) RTE published results: the average of the best runs of all systems, the best and second best performing lexical systems and the best full system of each challenge; (2) our implementation of lexical 561 coverage model, tuning the percentage-of-coverage threshold for entailment on the training set. This model uses the same configuration as ourprobabilistic model. We also implemented an Information Re- trieval style baseline3 (both with and without lexical expansions), but given its poorer performance we omit its results here. Table 1 presents the results. We can see that both our implemented models (probabilistic and coverage) outperform all RTE lexical baselines on both data sets, apart from (Majumdar and Bhattacharyya, 2010) which incorporates additional lexical resources, a named entity recognizer and a co-reference system. On RTE-5, the probabilistic model is comparable in performance to the best full system, while the coverage model achieves considerably better results. We notice that our implemented models successfully utilize resources to increase performance, as opposed to typical smaller or less consistent improvements in prior works (see Section 1). ModelRTE-5F1%RTE-6 ERT2b avne sgdst.b floeu fsxl taic slyealxs tisyceysmastlemesyms tem4 3504 . 36.4531 4 34873. 0 .68254 evrcagon+ o CW raeN tsVo4a+urCcaetVr43479685. 25384 4534. 5817 Tabspticrlaoe1:+ Envo CW arlueN tasV4oi+urnCcaetsVularonRTE-5and4 R521 T. 80 E-6.RT4 s25 y. s9635t1ems (1)(MacKinlay and Baldwin, 2009), (2)(Clark and Harrison, 2010), (3)(Mirkin et al., 2009)(2 submitted runs), (4)(Majumdar and Bhattacharyya, 2010) and (5)(Jia et al., 2010). are: While the probabilistic and coverage models are comparable on RTE-6 (with non-significant advantage for the former), on RTE-5 the latter performs 3Utilizing Lucene search engine (http://lucene.apache.org) better, suggesting that the probabilistic model needs to be further improved. In particular, WN4 performs better than the single-step WN only on RTE-5, suggesting the need to improve the modeling of chain- ing. The fluctuations over the data sets and impacts of resources suggest the need for further investigation over additional data sets and resources. As for the coverage model, under our configuration it poses a bigger challenge for RTE systems than perviously reported baselines. It is thus proposed as an easy to implement baseline for future entailment research. 4 Conclusions and Future Work This paper presented, for the first time, a principled and relatively rich probabilistic model for lexical entailment, amenable for estimation of hidden lexicallevel parameters from standard sentence-level annotations. The positive results of the probabilistic model compared to prior art and its ability to exploit lexical resources indicate its future potential. Yet, further investigation is needed. For example, analyzing current model’s limitations, we observed that the multiplicative nature of eqs. 1and 3 (reflecting independence assumptions) is too restrictive, resembling a logical AND. Accordingly we plan to explore relaxing this strict conjunctive behavior through models such as noisy-AND (Pearl, 1988). We also intend to explore the contribution of our model, and particularly its estimated parameter values, within a complex system that integrates multiple levels of inference. Acknowledgments This work was partially supported by the NEGEV Consortium of the Israeli Ministry of Industry, Trade and Labor (www.negev-initiative.org), the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886, the FIRBIsrael research project N. RBIN045PXH and by the Israel Science Foundation grant 1112/08. References Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal Shnarch, and Idan Szpektor. 2008. Efficient semantic deduction and approximate matching over compact parse forests. In Proceedings of Text Analysis Conference (TAC). 562 Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–3 11, June. Aljoscha Burchardt, Nils Reiter, Stefan Thater, and Anette Frank. 2007. A semantic approach to textual entailment: System evaluation and task analysis. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Peter Clark and Phil Harrison. 2010. BLUE-Lite: a knowledge-based lexical entailment system for RTE6. In Proceedings of Text Analysis Conference (TAC). Courtney Corley and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, se- ries [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Oren Glickman, Eyal Shnarch, and Ido Dagan. 2006. Lexical reference: a semantic matching subtask. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 172–179. Association for Computational Linguistics. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for english. In Proceedings of the North American Association for Computational Linguistics. Houping Jia, Xiaojiang Huang, Tengfei Ma, Xiaojun Wan, and Jianguo Xiao. 2010. PKUTM participation at TAC 2010 RTE and summarization track. In Proceedings of Text Analysis Conference (TAC). Bill MacCartney and Christopher D. Manning. 2007. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Andrew MacKinlay and Timothy Baldwin. 2009. A baseline approach to the RTE5 search pilot. In Proceedings of Text Analysis Conference (TAC). Debarghya Majumdar and Pushpak Bhattacharyya. 2010. Lexical based text entailment system for main task of RTE6. In Proceedings of Text Analysis Conference (TAC). Mirkin, Roy Bar-Haim, Jonathan Berant, Ido Eyal Shnarch, Asher Stern, and Idan Szpektor. 2009. Addressing discourse and document structure in the RTE search task. In Proceedings of Text Analysis Conference (TAC). Judea Pearl. 1988. Probabilistic reasoning in intelligent systems: networks ofplausible inference. Morgan Kaufmann. Marta Tatu and Dan Moldovan. 2007. COGEX at RTE 3. In Proceedings of the ACL-PASCAL Workshop on Shachar Dagan, Textual Entailment and Paraphrasing. Rui Wang, Yi Zhang, and Guenter Neumann. 2009. A joint syntactic-semantic representation for recognizing textual relatedness. In Proceedings of Text Analysis Conference (TAC). Fabio Massimo Zanzotto and Alessandro Moschitti. 2006. Automatic learning of textual entailments with cross-pair similarities. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 563
Reference: text
sentIndex sentText sentNum sentScore
1 i l l Abstract Recognizing entailment at the lexical level is an important and commonly-addressed component in textual inference. [sent-10, score-0.877]
2 This paper proposes an initial probabilistic modeling framework for lexical entailment, with suitable EM-based parameter estimation. [sent-12, score-0.219]
3 Our model considers prominent entailment factors, including differences in lexical-resources reliability and the impacts of transitivity and multiple evidence. [sent-13, score-0.806]
4 Evaluations show that the proposed model outperforms most prior systems while pointing at required future improvements. [sent-14, score-0.06]
5 1 Introduction and Background Textual Entailment was proposed as a generic paradigm for applied semantic inference (Dagan et al. [sent-15, score-0.034]
6 Since it was first introduced, the six rounds of the Recognizing Textual Entailment (RTE) challenges1 , currently organized under NIST, have become a standard benchmark for entailment systems. [sent-18, score-0.687]
7 These systems tackle their complex task at various levels of inference, including logical representation (Tatu and Moldovan, 2007; MacCartney and Manning, 2007), semantic analysis (Burchardt et al. [sent-19, score-0.029]
8 Nevertheless, simple entailment methods, performing at the lexical level, provide strong baselines which most systems did not outperform (Mirkin et al. [sent-29, score-0.76]
9 Within complex systems, lexical entailment modeling is an important component. [sent-31, score-0.792]
10 lacking a parser for a targeted language) and one must resort to the simpler lexical approach. [sent-34, score-0.073]
11 While lexical entailment methods are widely used, most of them apply ad hoc heuristics which do not rely on a principled underlying framework. [sent-35, score-0.829]
12 Typically, such methods quantify the degree of lexical coverage of the hypothesis terms by the text’s terms. [sent-36, score-0.228]
13 Coverage is determined either by a direct match of identical terms in T and H or by utilizing lexical semantic resources, such as WordNet (Fellbaum, 1998), that capture lexical entailment relations (denoted here as entailment rules). [sent-37, score-1.603]
14 These heuristics do not capture several important aspects of entailment, such as varying reliability of Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-40, score-0.118]
15 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 558–563, entailment resources and the impact of rule chaining and multiple evidence on entailment likelihood. [sent-42, score-1.525]
16 An additional observation from these and other systems is that their performance improves only moderately when utilizing lexical resources2. [sent-43, score-0.073]
17 We believe that the textual entailment field would benefit from more principled models for various entailment phenomena. [sent-44, score-1.531]
18 , 1993)), we formulate a concrete generative probabilistic modeling framework that captures the basic aspects of lexical entailment. [sent-46, score-0.221]
19 Parameter estimation is addressed by an EM-based approach, which enables estimating the hidden lexical-level entailment parameters from entailment annotations which are available only at the sentence-level. [sent-47, score-1.428]
20 While heuristic methods are limited in their ability to wisely integrate indications for entailment, probabilistic methods have the advantage of being extendable and enabling the utilization of wellfounded probabilistic methods such as the EM algorithm. [sent-48, score-0.172]
21 We compared the performance of several model variations to previously published results on RTE data sets, as well as to our own implementation of typical lexical baselines. [sent-49, score-0.073]
22 Results show that both the probabilistic model and our percentagecoverage baseline perform favorably relative to prior art. [sent-50, score-0.114]
23 These results support the viability of the probabilistic framework while pointing at certain modeling aspects that need to be improved. [sent-51, score-0.18]
24 2 Probabilistic Model Under the lexical entailment scope, our modeling goal is obtaining a probabilistic score for the likelihood that all H’s terms are entailed by T. [sent-52, score-1.067]
25 title=RTE Knowledge Resources#Ablation Tests 559 Figure 1: The generative process of entailing terms of a hypothesis from a text. [sent-57, score-0.208]
26 There are 3 evidences for the entailment of hi :a rule from Resource1 , another one from Resource3 both suggesting that tj entails it, and a chain from t1through an intermediate term t0. [sent-59, score-1.08]
27 1 Model Description For T to entail H it is usually a necessary, but not sufficient, that every term h ∈ H would be entsauiflefidci by ,at t hleatast e one t teerrmm mt h ∈ ∈T (Glickman eet al. [sent-61, score-0.138]
28 Figure s1t odneescr tiebrmes tth ∈e process komf entailing hypothesis terms. [sent-63, score-0.163]
29 The trivial case is when identical terms, possibly at the stem or lemma level, appear in T and H (a direct match as tn and hm in Figure 1). [sent-64, score-0.038]
30 Alternatively, we can establish entailment based on knowledge of entailing lexical-semantic relations, such as synonyms, hypernyms and morphological derivations, available in lexical resources (e. [sent-65, score-0.932]
31 eg d theneo rutel by R(r) cthee → resource nwgh ficroh provided teht)e. [sent-68, score-0.048]
32 Since entailment is a transitive relation, rules may compose transitive chains that connect a term t ∈ T ctoo a pteosrme rha ∈ Hive through hinatte cromnendeicatte a tteerrmms. [sent-70, score-0.965]
33 t ∈Fo Tr instance, fr hom ∈ t hHe r thurleosu infer → inference armnds inference → reasoning we can dre →duc inef tehreen rcuele a infer → reasoning (were inference dise dthuec ein thteerm rueldeia intef trer →m as t0 in Figure 1). [sent-71, score-0.204]
34 Multiple chains may connect t to h (as for tj and hi in Figure 1) or connect several terms in T to h (as t1 and tj are indicating the entailment of hi in Figure 1), thus providing multiple evidence for h’s entailment. [sent-72, score-1.07]
35 It is reasonable to expect that if a term t indeed entails a term h, it is likely to find evidences for this relation in several resources. [sent-73, score-0.204]
36 Taking a probabilistic perspective, we assume a parameter θR for each resource R, denoting its reliability, i. [sent-74, score-0.162]
37 the prior probability that applying a rule from R corresponds to a valid entailment instance. [sent-76, score-0.849]
38 For a text term t ∈ T to entail a hypothesis term h by a tcehxatin te c, mde tn ∈ote Td by etn →tcai h, thhyep application mof h every r ∈ c must be valid. [sent-79, score-0.244]
39 N −→ote h ,t thhaet a pruplleic r i onn a cfh eaviner c rco ∈nne cc mtsu tswt ob ete vramlisd (its oleteft t-hhaatnd a- rsuildee ran ind aits c righthand-side, denoted lhs → rhs). [sent-80, score-0.099]
40 The lhs of the first rhualned i-ns c eis, td ∈ oTte adn ldh sth →e r rhhss )o. [sent-81, score-0.13]
41 c W ise t d ∈en Tote a nthde t event so fo a vhael ilda rtu rluel applicathio ∈n by l Whse →dren orhtes. [sent-83, score-0.054]
42 t Sei envceen a-priori a d ru rluel r aips pvliacliadwith probability θR(r) , ancnde assuming independence of all r ∈ c, we obtain Eq. [sent-84, score-0.125]
43 1 to specify the probability rof ∈ ∈th ce, weveen otb tt i→cn Ehq. [sent-85, score-0.035]
44 Next, pleetc C(h) ede pnroobtethe set of chains which− → suggest txhte, leentt Cail(mhe)n dt eonfo hte. [sent-86, score-0.094]
45 The probability that T does not entail h at all (by any chain), specified in Eq. [sent-87, score-0.11]
46 2, is the probability that all these chains are not valid. [sent-88, score-0.129]
47 Finally, the probability that T entails all of H, assuming independence of H’s terms, is the probability that every h ∈ H is entailed, as given ien p Eq. [sent-89, score-0.15]
48 Nityot tihceat t ehvaet yth here ∈ c oHul ids be a term h which is not covered by any available rule chain. [sent-91, score-0.207]
49 Under this formulation, we assume that each such h is covered by a single rule coming from a special “resource” called UNCOVERED (expecting θUNCOVERED to be relatively small). [sent-92, score-0.144]
50 The above treatment of uncovered terms in H, as captured in Eq. [sent-94, score-0.158]
51 3, assumes that their entailment probability is independent of the rest of the hypothesis. [sent-95, score-0.722]
52 However, when the number of covered hypothesis terms increases the probability that the remaining terms are actually entailed by T increases too 560 (even though we do not have supporting knowledge for their entailment). [sent-96, score-0.357]
53 Thus, an alternative model is to group all uncovered terms together and estimate the overall probability of their joint entailment as a function of the lexical coverage of the hypothesis. [sent-97, score-1.02]
54 We denote Hc as the subset of H’s terms which are covered by some rule chain and Huc as the remaining uncovered part. [sent-98, score-0.344]
55 3a then provides a refined entailment model for H, in which the second term specifies the probability that Huc is entailed given that Hc is validly entailed and the corresponding lengths: p(T→H) = [Yp(T→h)]·p(T→Huc hY∈Hc 2. [sent-100, score-1.073]
56 2 | |Hc|,|H|) (3a) Parameter Estimation The difficulty in estimating the θR values is that these are term-level parameters while the RTEtraining entailment annotation is given for the sentence-level. [sent-101, score-0.715]
57 Therefore, we use EM-based estimation for the hidden parameters (Dempster et al. [sent-102, score-0.026]
58 In the E step we use the current θR values to compute all whcr (T, H) values for each training pair. [sent-104, score-0.082]
59 whcr (T, H) stands for the posterior probability that application of the rule r in the chain c for h ∈ H tish valid, given nth oaft heieth reurl eT r e innta thiles c Hha or not ra hcc ∈ord Hing to the training annotation (see Eq. [sent-105, score-0.258]
60 Remember that a rule r provides an entailment relation between its left-hand-side (lhs) and its right-hand-side (rhs). [sent-107, score-0.786]
61 4 uses the notation lhs →r rhs to designate the application of the rule r (similar htos Eq. [sent-109, score-0.286]
62 wEhc:r(T,H)= p (lTh9→sH−→ |rlhsrp−→ rh(Tsr→9|hTsH )9→p(lhHs−→ r) =hs)if(4T)9→H After applying Bayes’ rule we get a fraction with Eq. [sent-111, score-0.099]
63 3 in its denominator and θR(r) as the second term of the numerator. [sent-112, score-0.063]
64 3 except that for the corresponding rule application we substitute θR(r) by 1(per the conditioning event). [sent-114, score-0.099]
65 For each resource R we average the whcr (T, H) val- ues for all its rule applications in the training, whose total number is denoted nR. [sent-120, score-0.229]
66 3Tah iws eis n ndeoende t directly avteia a amlsaoxi pm(Tu→m Hlikeli-| |hHoo|d, eHst|i)m. [sent-123, score-0.029]
67 a Tthioins over tehe d training set, by calculating the proportion of entailing examples within the set of all examples of a given hypothesis length (|H|) aonfd a a given lneusm ofbe ar goifv ecnov heyrepdo hteersmiss (|Hc|). [sent-124, score-0.163]
68 a Ands |HH (exact match) suinmcbee irn o afl imdeonstti caall cases itner Tms a nind H which have an exact match in T are indeed entailed. [sent-126, score-0.038]
69 In this task participants should find all sentences that entail a given hypothesis in a given document cluster. [sent-131, score-0.118]
70 This task’s data sets reflect a natural distribution of entailments in a corpus and demonstrate a more realistic scenario than the previous RTE challenges. [sent-132, score-0.066]
71 In our system, sentences are tokenized and stripped of stop words and terms are lemmatized and tagged for part-of-speech. [sent-133, score-0.045]
72 As lexical resources we use WordNet (WN) (Fellbaum, 1998), taking as entailment rules synonyms, derivations, hyponyms and meronyms of the first senses of T and H terms, and the CatVar (Categorial Variation) database (Habash and Dorr, 2003). [sent-134, score-0.84]
73 We allow rule chains of length up to 4 in WordNet (WN4). [sent-135, score-0.193]
74 We also implemented an Information Re- trieval style baseline3 (both with and without lexical expansions), but given its poorer performance we omit its results here. [sent-138, score-0.073]
75 We can see that both our implemented models (probabilistic and coverage) outperform all RTE lexical baselines on both data sets, apart from (Majumdar and Bhattacharyya, 2010) which incorporates additional lexical resources, a named entity recognizer and a co-reference system. [sent-140, score-0.146]
76 On RTE-5, the probabilistic model is comparable in performance to the best full system, while the coverage model achieves considerably better results. [sent-141, score-0.153]
77 We notice that our implemented models successfully utilize resources to increase performance, as opposed to typical smaller or less consistent improvements in prior works (see Section 1). [sent-142, score-0.08]
78 are: While the probabilistic and coverage models are comparable on RTE-6 (with non-significant advantage for the former), on RTE-5 the latter performs 3Utilizing Lucene search engine (http://lucene. [sent-156, score-0.153]
79 org) better, suggesting that the probabilistic model needs to be further improved. [sent-158, score-0.116]
80 In particular, WN4 performs better than the single-step WN only on RTE-5, suggesting the need to improve the modeling of chain- ing. [sent-159, score-0.062]
81 The fluctuations over the data sets and impacts of resources suggest the need for further investigation over additional data sets and resources. [sent-160, score-0.084]
82 As for the coverage model, under our configuration it poses a bigger challenge for RTE systems than perviously reported baselines. [sent-161, score-0.067]
83 It is thus proposed as an easy to implement baseline for future entailment research. [sent-162, score-0.687]
84 4 Conclusions and Future Work This paper presented, for the first time, a principled and relatively rich probabilistic model for lexical entailment, amenable for estimation of hidden lexicallevel parameters from standard sentence-level annotations. [sent-163, score-0.225]
85 The positive results of the probabilistic model compared to prior art and its ability to exploit lexical resources indicate its future potential. [sent-164, score-0.239]
86 1and 3 (reflecting independence assumptions) is too restrictive, resembling a logical AND. [sent-167, score-0.036]
87 We also intend to explore the contribution of our model, and particularly its estimated parameter values, within a complex system that integrates multiple levels of inference. [sent-169, score-0.057]
88 A semantic approach to textual entailment: System evaluation and task analysis. [sent-196, score-0.117]
89 Lexical based text entailment system for main task of RTE6. [sent-248, score-0.687]
wordName wordTfidf (topN-words)
[('entailment', 0.687), ('rte', 0.16), ('tac', 0.156), ('hc', 0.145), ('entailed', 0.144), ('majumdar', 0.136), ('dagan', 0.125), ('ido', 0.123), ('entailing', 0.12), ('textual', 0.117), ('uncovered', 0.113), ('rule', 0.099), ('lhs', 0.099), ('huc', 0.096), ('chains', 0.094), ('bentivogli', 0.088), ('mirkin', 0.088), ('rhs', 0.088), ('shnarch', 0.088), ('probabilistic', 0.086), ('bhattacharyya', 0.083), ('whcr', 0.082), ('entail', 0.075), ('eyal', 0.075), ('lexical', 0.073), ('mackinlay', 0.072), ('coverage', 0.067), ('entailments', 0.066), ('term', 0.063), ('glickman', 0.062), ('israel', 0.061), ('reliability', 0.059), ('recognizing', 0.056), ('burchardt', 0.054), ('rluel', 0.054), ('resources', 0.052), ('reasoning', 0.051), ('pascal', 0.048), ('tatu', 0.048), ('resource', 0.048), ('covered', 0.045), ('terms', 0.045), ('pearl', 0.044), ('berant', 0.044), ('corley', 0.044), ('luisa', 0.044), ('ote', 0.044), ('shachar', 0.044), ('zanzotto', 0.044), ('entails', 0.044), ('tj', 0.043), ('wordnet', 0.043), ('hypothesis', 0.043), ('chain', 0.042), ('clark', 0.042), ('danilo', 0.041), ('hy', 0.041), ('connect', 0.041), ('principled', 0.04), ('transitive', 0.04), ('harrison', 0.039), ('hi', 0.038), ('idan', 0.038), ('yr', 0.038), ('match', 0.038), ('fellbaum', 0.037), ('termed', 0.036), ('trang', 0.036), ('independence', 0.036), ('hoa', 0.035), ('jia', 0.035), ('probability', 0.035), ('inference', 0.034), ('roy', 0.034), ('evidences', 0.034), ('bernardo', 0.034), ('dang', 0.033), ('cw', 0.033), ('ablation', 0.033), ('habash', 0.032), ('impacts', 0.032), ('pointing', 0.032), ('modeling', 0.032), ('baldwin', 0.031), ('td', 0.031), ('categorial', 0.03), ('maccartney', 0.03), ('suggesting', 0.03), ('yp', 0.03), ('aspects', 0.03), ('heuristics', 0.029), ('levels', 0.029), ('dempster', 0.029), ('eis', 0.029), ('estimating', 0.028), ('database', 0.028), ('prominent', 0.028), ('prior', 0.028), ('parameter', 0.028), ('estimation', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 22 acl-2011-A Probabilistic Modeling Framework for Lexical Entailment
Author: Eyal Shnarch ; Jacob Goldberger ; Ido Dagan
Abstract: Recognizing entailment at the lexical level is an important and commonly-addressed component in textual inference. Yet, this task has been mostly approached by simplified heuristic methods. This paper proposes an initial probabilistic modeling framework for lexical entailment, with suitable EM-based parameter estimation. Our model considers prominent entailment factors, including differences in lexical-resources reliability and the impacts of transitivity and multiple evidence. Evaluations show that the proposed model outperforms most prior systems while pointing at required future improvements. 1 Introduction and Background Textual Entailment was proposed as a generic paradigm for applied semantic inference (Dagan et al., 2006). This task requires deciding whether a tex- tual statement (termed the hypothesis-H) can be inferred (entailed) from another text (termed the textT). Since it was first introduced, the six rounds of the Recognizing Textual Entailment (RTE) challenges1 , currently organized under NIST, have become a standard benchmark for entailment systems. These systems tackle their complex task at various levels of inference, including logical representation (Tatu and Moldovan, 2007; MacCartney and Manning, 2007), semantic analysis (Burchardt et al., 2007) and syntactic parsing (Bar-Haim et al., 2008; Wang et al., 2009). Inference at these levels usually 1http://www.nist.gov/tac/2010/RTE/index.html 558 requires substantial processing and resources (e.g. parsing) aiming at high performance. Nevertheless, simple entailment methods, performing at the lexical level, provide strong baselines which most systems did not outperform (Mirkin et al., 2009; Majumdar and Bhattacharyya, 2010). Within complex systems, lexical entailment modeling is an important component. Finally, there are cases in which a full system cannot be used (e.g. lacking a parser for a targeted language) and one must resort to the simpler lexical approach. While lexical entailment methods are widely used, most of them apply ad hoc heuristics which do not rely on a principled underlying framework. Typically, such methods quantify the degree of lexical coverage of the hypothesis terms by the text’s terms. Coverage is determined either by a direct match of identical terms in T and H or by utilizing lexical semantic resources, such as WordNet (Fellbaum, 1998), that capture lexical entailment relations (denoted here as entailment rules). Common heuristics for quantifying the degree of coverage are setting a threshold on the percentage coverage of H’s terms (Majumdar and Bhattacharyya, 2010), counting absolute number of uncovered terms (Clark and Harrison, 2010), or applying an Information Retrievalstyle vector space similarity score (MacKinlay and Baldwin, 2009). Other works (Corley and Mihalcea, 2005; Zanzotto and Moschitti, 2006) have applied a heuristic formula to estimate the similarity between text fragments based on a similarity function between their terms. These heuristics do not capture several important aspects of entailment, such as varying reliability of Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 558–563, entailment resources and the impact of rule chaining and multiple evidence on entailment likelihood. An additional observation from these and other systems is that their performance improves only moderately when utilizing lexical resources2. We believe that the textual entailment field would benefit from more principled models for various entailment phenomena. Inspired by the earlier steps in the evolution of Statistical Machine Translation methods (such as the initial IBM models (Brown et al., 1993)), we formulate a concrete generative probabilistic modeling framework that captures the basic aspects of lexical entailment. Parameter estimation is addressed by an EM-based approach, which enables estimating the hidden lexical-level entailment parameters from entailment annotations which are available only at the sentence-level. While heuristic methods are limited in their ability to wisely integrate indications for entailment, probabilistic methods have the advantage of being extendable and enabling the utilization of wellfounded probabilistic methods such as the EM algorithm. We compared the performance of several model variations to previously published results on RTE data sets, as well as to our own implementation of typical lexical baselines. Results show that both the probabilistic model and our percentagecoverage baseline perform favorably relative to prior art. These results support the viability of the probabilistic framework while pointing at certain modeling aspects that need to be improved. 2 Probabilistic Model Under the lexical entailment scope, our modeling goal is obtaining a probabilistic score for the likelihood that all H’s terms are entailed by T. To that end, we model prominent aspects of lexical entailment, which were mostly neglected by previous lexical methods: (1) distinguishing different reliability levels of lexical resources; (2) allowing transitive chains of rule applications and considering their length when estimating their validity; and (3) considering multiple entailments when entailing a term. 2See ablation tests reports in http://aclweb.org/aclwiki/ index.php?title=RTE Knowledge Resources#Ablation Tests 559 Figure 1: The generative process of entailing terms of a hypothesis from a text. Edges represent entailment rules. There are 3 evidences for the entailment of hi :a rule from Resource1 , another one from Resource3 both suggesting that tj entails it, and a chain from t1through an intermediate term t0. 2.1 Model Description For T to entail H it is usually a necessary, but not sufficient, that every term h ∈ H would be entsauiflefidci by ,at t hleatast e one t teerrmm mt h ∈ ∈T (Glickman eet al., 2006). Figure s1t odneescr tiebrmes tth ∈e process komf entailing hypothesis terms. The trivial case is when identical terms, possibly at the stem or lemma level, appear in T and H (a direct match as tn and hm in Figure 1). Alternatively, we can establish entailment based on knowledge of entailing lexical-semantic relations, such as synonyms, hypernyms and morphological derivations, available in lexical resources (e.g the rule inference → reasoning from WordNet). (We.eg d theneo rutel by R(r) cthee → resource nwgh ficroh provided teht)e. rule r. Since entailment is a transitive relation, rules may compose transitive chains that connect a term t ∈ T ctoo a pteosrme rha ∈ Hive through hinatte cromnendeicatte a tteerrmms. t ∈Fo Tr instance, fr hom ∈ t hHe r thurleosu infer → inference armnds inference → reasoning we can dre →duc inef tehreen rcuele a infer → reasoning (were inference dise dthuec ein thteerm rueldeia intef trer →m as t0 in Figure 1). Multiple chains may connect t to h (as for tj and hi in Figure 1) or connect several terms in T to h (as t1 and tj are indicating the entailment of hi in Figure 1), thus providing multiple evidence for h’s entailment. It is reasonable to expect that if a term t indeed entails a term h, it is likely to find evidences for this relation in several resources. Taking a probabilistic perspective, we assume a parameter θR for each resource R, denoting its reliability, i.e. the prior probability that applying a rule from R corresponds to a valid entailment instance. Direct matches are considered as a special “resource”, called MATCH, for which θMATCH is expected to be close to 1. We now present our probabilistic model. For a text term t ∈ T to entail a hypothesis term h by a tcehxatin te c, mde tn ∈ote Td by etn →tcai h, thhyep application mof h every r ∈ c must be valid. N −→ote h ,t thhaet a pruplleic r i onn a cfh eaviner c rco ∈nne cc mtsu tswt ob ete vramlisd (its oleteft t-hhaatnd a- rsuildee ran ind aits c righthand-side, denoted lhs → rhs). The lhs of the first rhualned i-ns c eis, td ∈ oTte adn ldh sth →e r rhhss )o.f T Tthhee l lahsts r oufle t hine ifitr sist rhu ∈ iHn. c W ise t d ∈en Tote a nthde t event so fo a vhael ilda rtu rluel applicathio ∈n by l Whse →dren orhtes. t Sei envceen a-priori a d ru rluel r aips pvliacliadwith probability θR(r) , ancnde assuming independence of all r ∈ c, we obtain Eq. 1 to specify the probability rof ∈ ∈th ce, weveen otb tt i→cn Ehq. Next, pleetc C(h) ede pnroobtethe set of chains which− → suggest txhte, leentt Cail(mhe)n dt eonfo hte. The probability that T does not entail h at all (by any chain), specified in Eq. 2, is the probability that all these chains are not valid. Finally, the probability that T entails all of H, assuming independence of H’s terms, is the probability that every h ∈ H is entailed, as given ien p Eq. a3b. Nityot tihceat t ehvaet yth here ∈ c oHul ids be a term h which is not covered by any available rule chain. Under this formulation, we assume that each such h is covered by a single rule coming from a special “resource” called UNCOVERED (expecting θUNCOVERED to be relatively small). p(t −→c h) = Yp(lhs →r rhs) = Yr∈c p(T 9 h) = Y YθR(r)(1) Yr∈c [1 − p(t− →c h)] (2) c∈YC(h) p(T → H) = Y p(T → h) (3) hY∈H As can be seen, our model indeed distinguishes varying resource reliability, decreases entailment probability as rule chains grow and increases it when entailment of a term is supported by multiple chains. The above treatment of uncovered terms in H, as captured in Eq. 3, assumes that their entailment probability is independent of the rest of the hypothesis. However, when the number of covered hypothesis terms increases the probability that the remaining terms are actually entailed by T increases too 560 (even though we do not have supporting knowledge for their entailment). Thus, an alternative model is to group all uncovered terms together and estimate the overall probability of their joint entailment as a function of the lexical coverage of the hypothesis. We denote Hc as the subset of H’s terms which are covered by some rule chain and Huc as the remaining uncovered part. Eq. 3a then provides a refined entailment model for H, in which the second term specifies the probability that Huc is entailed given that Hc is validly entailed and the corresponding lengths: p(T→H) = [Yp(T→h)]·p(T→Huc hY∈Hc 2.2 | |Hc|,|H|) (3a) Parameter Estimation The difficulty in estimating the θR values is that these are term-level parameters while the RTEtraining entailment annotation is given for the sentence-level. Therefore, we use EM-based estimation for the hidden parameters (Dempster et al., 1977). In the E step we use the current θR values to compute all whcr (T, H) values for each training pair. whcr (T, H) stands for the posterior probability that application of the rule r in the chain c for h ∈ H tish valid, given nth oaft heieth reurl eT r e innta thiles c Hha or not ra hcc ∈ord Hing to the training annotation (see Eq. 4). Remember that a rule r provides an entailment relation between its left-hand-side (lhs) and its right-hand-side (rhs). Therefore Eq. 4 uses the notation lhs →r rhs to designate the application of the rule r (similar htos Eq. 1). wEhc:r(T,H)= p (lTh9→sH−→ |rlhsrp−→ rh(Tsr→9|hTsH )9→p(lhHs−→ r) =hs)if(4T)9→H After applying Bayes’ rule we get a fraction with Eq. 3 in its denominator and θR(r) as the second term of the numerator. The first numerator term is defined as in Eq. 3 except that for the corresponding rule application we substitute θR(r) by 1(per the conditioning event). The probabilistic model defined by Eq. 1-3 is a loop-free directed acyclic graphical model (aka a Bayesian network). Hence the E-step probabilities can be efficiently calculated using the belief propagation algorithm (Pearl, 1988). The M step uses Eq. 5 to update the parameter set. For each resource R we average the whcr (T, H) val- ues for all its rule applications in the training, whose total number is denoted nR. M : θR=n1RTX,HhX∈Hc∈XC(h)r∈c|RX(r)=wRhcr(T,H) (5) For Eq. 3a we need to estimate also p(T→Huc | |Hc| ,|H|). 3Tah iws eis n ndeoende t directly avteia a amlsaoxi pm(Tu→m Hlikeli-| |hHoo|d, eHst|i)m.a Tthioins over tehe d training set, by calculating the proportion of entailing examples within the set of all examples of a given hypothesis length (|H|) aonfd a a given lneusm ofbe ar goifv ecnov heyrepdo hteersmiss (|Hc|). HA|)s |Hc| we tvaekne tnhuem nbuemrb oefr ocofv videerendtic taelr mtesrm (|sH in| )T. a Ands |HH (exact match) suinmcbee irn o afl imdeonstti caall cases itner Tms a nind H which have an exact match in T are indeed entailed. We also tried initializing the EM algorithm with these direct estimations but did not obtain performance improvements. 3 Evaluations and Results The 5th Recognizing Textual Entailment challenge (RTE-5) introduced a new search task (Bentivogli et al., 2009) which became the main task in RTE6 (Bentivogli et al., 2010). In this task participants should find all sentences that entail a given hypothesis in a given document cluster. This task’s data sets reflect a natural distribution of entailments in a corpus and demonstrate a more realistic scenario than the previous RTE challenges. In our system, sentences are tokenized and stripped of stop words and terms are lemmatized and tagged for part-of-speech. As lexical resources we use WordNet (WN) (Fellbaum, 1998), taking as entailment rules synonyms, derivations, hyponyms and meronyms of the first senses of T and H terms, and the CatVar (Categorial Variation) database (Habash and Dorr, 2003). We allow rule chains of length up to 4 in WordNet (WN4). We compare our model to two types of baselines: (1) RTE published results: the average of the best runs of all systems, the best and second best performing lexical systems and the best full system of each challenge; (2) our implementation of lexical 561 coverage model, tuning the percentage-of-coverage threshold for entailment on the training set. This model uses the same configuration as ourprobabilistic model. We also implemented an Information Re- trieval style baseline3 (both with and without lexical expansions), but given its poorer performance we omit its results here. Table 1 presents the results. We can see that both our implemented models (probabilistic and coverage) outperform all RTE lexical baselines on both data sets, apart from (Majumdar and Bhattacharyya, 2010) which incorporates additional lexical resources, a named entity recognizer and a co-reference system. On RTE-5, the probabilistic model is comparable in performance to the best full system, while the coverage model achieves considerably better results. We notice that our implemented models successfully utilize resources to increase performance, as opposed to typical smaller or less consistent improvements in prior works (see Section 1). ModelRTE-5F1%RTE-6 ERT2b avne sgdst.b floeu fsxl taic slyealxs tisyceysmastlemesyms tem4 3504 . 36.4531 4 34873. 0 .68254 evrcagon+ o CW raeN tsVo4a+urCcaetVr43479685. 25384 4534. 5817 Tabspticrlaoe1:+ Envo CW arlueN tasV4oi+urnCcaetsVularonRTE-5and4 R521 T. 80 E-6.RT4 s25 y. s9635t1ems (1)(MacKinlay and Baldwin, 2009), (2)(Clark and Harrison, 2010), (3)(Mirkin et al., 2009)(2 submitted runs), (4)(Majumdar and Bhattacharyya, 2010) and (5)(Jia et al., 2010). are: While the probabilistic and coverage models are comparable on RTE-6 (with non-significant advantage for the former), on RTE-5 the latter performs 3Utilizing Lucene search engine (http://lucene.apache.org) better, suggesting that the probabilistic model needs to be further improved. In particular, WN4 performs better than the single-step WN only on RTE-5, suggesting the need to improve the modeling of chain- ing. The fluctuations over the data sets and impacts of resources suggest the need for further investigation over additional data sets and resources. As for the coverage model, under our configuration it poses a bigger challenge for RTE systems than perviously reported baselines. It is thus proposed as an easy to implement baseline for future entailment research. 4 Conclusions and Future Work This paper presented, for the first time, a principled and relatively rich probabilistic model for lexical entailment, amenable for estimation of hidden lexicallevel parameters from standard sentence-level annotations. The positive results of the probabilistic model compared to prior art and its ability to exploit lexical resources indicate its future potential. Yet, further investigation is needed. For example, analyzing current model’s limitations, we observed that the multiplicative nature of eqs. 1and 3 (reflecting independence assumptions) is too restrictive, resembling a logical AND. Accordingly we plan to explore relaxing this strict conjunctive behavior through models such as noisy-AND (Pearl, 1988). We also intend to explore the contribution of our model, and particularly its estimated parameter values, within a complex system that integrates multiple levels of inference. Acknowledgments This work was partially supported by the NEGEV Consortium of the Israeli Ministry of Industry, Trade and Labor (www.negev-initiative.org), the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886, the FIRBIsrael research project N. RBIN045PXH and by the Israel Science Foundation grant 1112/08. References Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal Shnarch, and Idan Szpektor. 2008. Efficient semantic deduction and approximate matching over compact parse forests. In Proceedings of Text Analysis Conference (TAC). 562 Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–3 11, June. Aljoscha Burchardt, Nils Reiter, Stefan Thater, and Anette Frank. 2007. A semantic approach to textual entailment: System evaluation and task analysis. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Peter Clark and Phil Harrison. 2010. BLUE-Lite: a knowledge-based lexical entailment system for RTE6. In Proceedings of Text Analysis Conference (TAC). Courtney Corley and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, se- ries [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Oren Glickman, Eyal Shnarch, and Ido Dagan. 2006. Lexical reference: a semantic matching subtask. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 172–179. Association for Computational Linguistics. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for english. In Proceedings of the North American Association for Computational Linguistics. Houping Jia, Xiaojiang Huang, Tengfei Ma, Xiaojun Wan, and Jianguo Xiao. 2010. PKUTM participation at TAC 2010 RTE and summarization track. In Proceedings of Text Analysis Conference (TAC). Bill MacCartney and Christopher D. Manning. 2007. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Andrew MacKinlay and Timothy Baldwin. 2009. A baseline approach to the RTE5 search pilot. In Proceedings of Text Analysis Conference (TAC). Debarghya Majumdar and Pushpak Bhattacharyya. 2010. Lexical based text entailment system for main task of RTE6. In Proceedings of Text Analysis Conference (TAC). Mirkin, Roy Bar-Haim, Jonathan Berant, Ido Eyal Shnarch, Asher Stern, and Idan Szpektor. 2009. Addressing discourse and document structure in the RTE search task. In Proceedings of Text Analysis Conference (TAC). Judea Pearl. 1988. Probabilistic reasoning in intelligent systems: networks ofplausible inference. Morgan Kaufmann. Marta Tatu and Dan Moldovan. 2007. COGEX at RTE 3. In Proceedings of the ACL-PASCAL Workshop on Shachar Dagan, Textual Entailment and Paraphrasing. Rui Wang, Yi Zhang, and Guenter Neumann. 2009. A joint syntactic-semantic representation for recognizing textual relatedness. In Proceedings of Text Analysis Conference (TAC). Fabio Massimo Zanzotto and Alessandro Moschitti. 2006. Automatic learning of textual entailments with cross-pair similarities. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 563
2 0.40275866 144 acl-2011-Global Learning of Typed Entailment Rules
Author: Jonathan Berant ; Ido Dagan ; Jacob Goldberger
Abstract: Extensive knowledge bases ofentailment rules between predicates are crucial for applied semantic inference. In this paper we propose an algorithm that utilizes transitivity constraints to learn a globally-optimal set of entailment rules for typed predicates. We model the task as a graph learning problem and suggest methods that scale the algorithm to larger graphs. We apply the algorithm over a large data set of extracted predicate instances, from which a resource of typed entailment rules has been recently released (Schoenmackers et al., 2010). Our results show that using global transitivity information substantially improves performance over this resource and several baselines, and that our scaling methods allow us to increase the scope of global learning of entailment-rule graphs.
3 0.31328323 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico
Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.
4 0.22231057 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment
Author: Peter LoBue ; Alexander Yates
Abstract: Understanding language requires both linguistic knowledge and knowledge about how the world works, also known as common-sense knowledge. We attempt to characterize the kinds of common-sense knowledge most often involved in recognizing textual entailments. We identify 20 categories of common-sense knowledge that are prevalent in textual entailment, many of which have received scarce attention from researchers building collections of knowledge.
5 0.13023113 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi
Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108
7 0.079581015 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
8 0.066577896 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges
9 0.065582052 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation
10 0.056979373 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
11 0.05560796 174 acl-2011-Insights from Network Structure for Text Mining
12 0.055429868 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
13 0.051303051 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
14 0.050683994 44 acl-2011-An exponential translation model for target language morphology
15 0.050080348 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features
16 0.045949481 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
17 0.045103319 25 acl-2011-A Simple Measure to Assess Non-response
18 0.04508315 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports
19 0.044984054 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
20 0.043332458 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
topicId topicWeight
[(0, 0.148), (1, -0.012), (2, -0.046), (3, 0.029), (4, 0.015), (5, 0.02), (6, 0.047), (7, 0.001), (8, -0.063), (9, -0.214), (10, 0.009), (11, 0.093), (12, 0.01), (13, 0.122), (14, -0.032), (15, -0.097), (16, -0.004), (17, -0.064), (18, -0.051), (19, -0.02), (20, 0.151), (21, -0.021), (22, 0.005), (23, 0.077), (24, -0.096), (25, -0.107), (26, -0.203), (27, -0.065), (28, -0.001), (29, 0.359), (30, 0.047), (31, -0.001), (32, 0.094), (33, -0.242), (34, -0.092), (35, -0.07), (36, -0.054), (37, -0.037), (38, 0.065), (39, 0.143), (40, 0.078), (41, 0.061), (42, -0.005), (43, -0.043), (44, -0.143), (45, 0.168), (46, 0.054), (47, -0.035), (48, 0.035), (49, -0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.96011633 22 acl-2011-A Probabilistic Modeling Framework for Lexical Entailment
Author: Eyal Shnarch ; Jacob Goldberger ; Ido Dagan
Abstract: Recognizing entailment at the lexical level is an important and commonly-addressed component in textual inference. Yet, this task has been mostly approached by simplified heuristic methods. This paper proposes an initial probabilistic modeling framework for lexical entailment, with suitable EM-based parameter estimation. Our model considers prominent entailment factors, including differences in lexical-resources reliability and the impacts of transitivity and multiple evidence. Evaluations show that the proposed model outperforms most prior systems while pointing at required future improvements. 1 Introduction and Background Textual Entailment was proposed as a generic paradigm for applied semantic inference (Dagan et al., 2006). This task requires deciding whether a tex- tual statement (termed the hypothesis-H) can be inferred (entailed) from another text (termed the textT). Since it was first introduced, the six rounds of the Recognizing Textual Entailment (RTE) challenges1 , currently organized under NIST, have become a standard benchmark for entailment systems. These systems tackle their complex task at various levels of inference, including logical representation (Tatu and Moldovan, 2007; MacCartney and Manning, 2007), semantic analysis (Burchardt et al., 2007) and syntactic parsing (Bar-Haim et al., 2008; Wang et al., 2009). Inference at these levels usually 1http://www.nist.gov/tac/2010/RTE/index.html 558 requires substantial processing and resources (e.g. parsing) aiming at high performance. Nevertheless, simple entailment methods, performing at the lexical level, provide strong baselines which most systems did not outperform (Mirkin et al., 2009; Majumdar and Bhattacharyya, 2010). Within complex systems, lexical entailment modeling is an important component. Finally, there are cases in which a full system cannot be used (e.g. lacking a parser for a targeted language) and one must resort to the simpler lexical approach. While lexical entailment methods are widely used, most of them apply ad hoc heuristics which do not rely on a principled underlying framework. Typically, such methods quantify the degree of lexical coverage of the hypothesis terms by the text’s terms. Coverage is determined either by a direct match of identical terms in T and H or by utilizing lexical semantic resources, such as WordNet (Fellbaum, 1998), that capture lexical entailment relations (denoted here as entailment rules). Common heuristics for quantifying the degree of coverage are setting a threshold on the percentage coverage of H’s terms (Majumdar and Bhattacharyya, 2010), counting absolute number of uncovered terms (Clark and Harrison, 2010), or applying an Information Retrievalstyle vector space similarity score (MacKinlay and Baldwin, 2009). Other works (Corley and Mihalcea, 2005; Zanzotto and Moschitti, 2006) have applied a heuristic formula to estimate the similarity between text fragments based on a similarity function between their terms. These heuristics do not capture several important aspects of entailment, such as varying reliability of Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 558–563, entailment resources and the impact of rule chaining and multiple evidence on entailment likelihood. An additional observation from these and other systems is that their performance improves only moderately when utilizing lexical resources2. We believe that the textual entailment field would benefit from more principled models for various entailment phenomena. Inspired by the earlier steps in the evolution of Statistical Machine Translation methods (such as the initial IBM models (Brown et al., 1993)), we formulate a concrete generative probabilistic modeling framework that captures the basic aspects of lexical entailment. Parameter estimation is addressed by an EM-based approach, which enables estimating the hidden lexical-level entailment parameters from entailment annotations which are available only at the sentence-level. While heuristic methods are limited in their ability to wisely integrate indications for entailment, probabilistic methods have the advantage of being extendable and enabling the utilization of wellfounded probabilistic methods such as the EM algorithm. We compared the performance of several model variations to previously published results on RTE data sets, as well as to our own implementation of typical lexical baselines. Results show that both the probabilistic model and our percentagecoverage baseline perform favorably relative to prior art. These results support the viability of the probabilistic framework while pointing at certain modeling aspects that need to be improved. 2 Probabilistic Model Under the lexical entailment scope, our modeling goal is obtaining a probabilistic score for the likelihood that all H’s terms are entailed by T. To that end, we model prominent aspects of lexical entailment, which were mostly neglected by previous lexical methods: (1) distinguishing different reliability levels of lexical resources; (2) allowing transitive chains of rule applications and considering their length when estimating their validity; and (3) considering multiple entailments when entailing a term. 2See ablation tests reports in http://aclweb.org/aclwiki/ index.php?title=RTE Knowledge Resources#Ablation Tests 559 Figure 1: The generative process of entailing terms of a hypothesis from a text. Edges represent entailment rules. There are 3 evidences for the entailment of hi :a rule from Resource1 , another one from Resource3 both suggesting that tj entails it, and a chain from t1through an intermediate term t0. 2.1 Model Description For T to entail H it is usually a necessary, but not sufficient, that every term h ∈ H would be entsauiflefidci by ,at t hleatast e one t teerrmm mt h ∈ ∈T (Glickman eet al., 2006). Figure s1t odneescr tiebrmes tth ∈e process komf entailing hypothesis terms. The trivial case is when identical terms, possibly at the stem or lemma level, appear in T and H (a direct match as tn and hm in Figure 1). Alternatively, we can establish entailment based on knowledge of entailing lexical-semantic relations, such as synonyms, hypernyms and morphological derivations, available in lexical resources (e.g the rule inference → reasoning from WordNet). (We.eg d theneo rutel by R(r) cthee → resource nwgh ficroh provided teht)e. rule r. Since entailment is a transitive relation, rules may compose transitive chains that connect a term t ∈ T ctoo a pteosrme rha ∈ Hive through hinatte cromnendeicatte a tteerrmms. t ∈Fo Tr instance, fr hom ∈ t hHe r thurleosu infer → inference armnds inference → reasoning we can dre →duc inef tehreen rcuele a infer → reasoning (were inference dise dthuec ein thteerm rueldeia intef trer →m as t0 in Figure 1). Multiple chains may connect t to h (as for tj and hi in Figure 1) or connect several terms in T to h (as t1 and tj are indicating the entailment of hi in Figure 1), thus providing multiple evidence for h’s entailment. It is reasonable to expect that if a term t indeed entails a term h, it is likely to find evidences for this relation in several resources. Taking a probabilistic perspective, we assume a parameter θR for each resource R, denoting its reliability, i.e. the prior probability that applying a rule from R corresponds to a valid entailment instance. Direct matches are considered as a special “resource”, called MATCH, for which θMATCH is expected to be close to 1. We now present our probabilistic model. For a text term t ∈ T to entail a hypothesis term h by a tcehxatin te c, mde tn ∈ote Td by etn →tcai h, thhyep application mof h every r ∈ c must be valid. N −→ote h ,t thhaet a pruplleic r i onn a cfh eaviner c rco ∈nne cc mtsu tswt ob ete vramlisd (its oleteft t-hhaatnd a- rsuildee ran ind aits c righthand-side, denoted lhs → rhs). The lhs of the first rhualned i-ns c eis, td ∈ oTte adn ldh sth →e r rhhss )o.f T Tthhee l lahsts r oufle t hine ifitr sist rhu ∈ iHn. c W ise t d ∈en Tote a nthde t event so fo a vhael ilda rtu rluel applicathio ∈n by l Whse →dren orhtes. t Sei envceen a-priori a d ru rluel r aips pvliacliadwith probability θR(r) , ancnde assuming independence of all r ∈ c, we obtain Eq. 1 to specify the probability rof ∈ ∈th ce, weveen otb tt i→cn Ehq. Next, pleetc C(h) ede pnroobtethe set of chains which− → suggest txhte, leentt Cail(mhe)n dt eonfo hte. The probability that T does not entail h at all (by any chain), specified in Eq. 2, is the probability that all these chains are not valid. Finally, the probability that T entails all of H, assuming independence of H’s terms, is the probability that every h ∈ H is entailed, as given ien p Eq. a3b. Nityot tihceat t ehvaet yth here ∈ c oHul ids be a term h which is not covered by any available rule chain. Under this formulation, we assume that each such h is covered by a single rule coming from a special “resource” called UNCOVERED (expecting θUNCOVERED to be relatively small). p(t −→c h) = Yp(lhs →r rhs) = Yr∈c p(T 9 h) = Y YθR(r)(1) Yr∈c [1 − p(t− →c h)] (2) c∈YC(h) p(T → H) = Y p(T → h) (3) hY∈H As can be seen, our model indeed distinguishes varying resource reliability, decreases entailment probability as rule chains grow and increases it when entailment of a term is supported by multiple chains. The above treatment of uncovered terms in H, as captured in Eq. 3, assumes that their entailment probability is independent of the rest of the hypothesis. However, when the number of covered hypothesis terms increases the probability that the remaining terms are actually entailed by T increases too 560 (even though we do not have supporting knowledge for their entailment). Thus, an alternative model is to group all uncovered terms together and estimate the overall probability of their joint entailment as a function of the lexical coverage of the hypothesis. We denote Hc as the subset of H’s terms which are covered by some rule chain and Huc as the remaining uncovered part. Eq. 3a then provides a refined entailment model for H, in which the second term specifies the probability that Huc is entailed given that Hc is validly entailed and the corresponding lengths: p(T→H) = [Yp(T→h)]·p(T→Huc hY∈Hc 2.2 | |Hc|,|H|) (3a) Parameter Estimation The difficulty in estimating the θR values is that these are term-level parameters while the RTEtraining entailment annotation is given for the sentence-level. Therefore, we use EM-based estimation for the hidden parameters (Dempster et al., 1977). In the E step we use the current θR values to compute all whcr (T, H) values for each training pair. whcr (T, H) stands for the posterior probability that application of the rule r in the chain c for h ∈ H tish valid, given nth oaft heieth reurl eT r e innta thiles c Hha or not ra hcc ∈ord Hing to the training annotation (see Eq. 4). Remember that a rule r provides an entailment relation between its left-hand-side (lhs) and its right-hand-side (rhs). Therefore Eq. 4 uses the notation lhs →r rhs to designate the application of the rule r (similar htos Eq. 1). wEhc:r(T,H)= p (lTh9→sH−→ |rlhsrp−→ rh(Tsr→9|hTsH )9→p(lhHs−→ r) =hs)if(4T)9→H After applying Bayes’ rule we get a fraction with Eq. 3 in its denominator and θR(r) as the second term of the numerator. The first numerator term is defined as in Eq. 3 except that for the corresponding rule application we substitute θR(r) by 1(per the conditioning event). The probabilistic model defined by Eq. 1-3 is a loop-free directed acyclic graphical model (aka a Bayesian network). Hence the E-step probabilities can be efficiently calculated using the belief propagation algorithm (Pearl, 1988). The M step uses Eq. 5 to update the parameter set. For each resource R we average the whcr (T, H) val- ues for all its rule applications in the training, whose total number is denoted nR. M : θR=n1RTX,HhX∈Hc∈XC(h)r∈c|RX(r)=wRhcr(T,H) (5) For Eq. 3a we need to estimate also p(T→Huc | |Hc| ,|H|). 3Tah iws eis n ndeoende t directly avteia a amlsaoxi pm(Tu→m Hlikeli-| |hHoo|d, eHst|i)m.a Tthioins over tehe d training set, by calculating the proportion of entailing examples within the set of all examples of a given hypothesis length (|H|) aonfd a a given lneusm ofbe ar goifv ecnov heyrepdo hteersmiss (|Hc|). HA|)s |Hc| we tvaekne tnhuem nbuemrb oefr ocofv videerendtic taelr mtesrm (|sH in| )T. a Ands |HH (exact match) suinmcbee irn o afl imdeonstti caall cases itner Tms a nind H which have an exact match in T are indeed entailed. We also tried initializing the EM algorithm with these direct estimations but did not obtain performance improvements. 3 Evaluations and Results The 5th Recognizing Textual Entailment challenge (RTE-5) introduced a new search task (Bentivogli et al., 2009) which became the main task in RTE6 (Bentivogli et al., 2010). In this task participants should find all sentences that entail a given hypothesis in a given document cluster. This task’s data sets reflect a natural distribution of entailments in a corpus and demonstrate a more realistic scenario than the previous RTE challenges. In our system, sentences are tokenized and stripped of stop words and terms are lemmatized and tagged for part-of-speech. As lexical resources we use WordNet (WN) (Fellbaum, 1998), taking as entailment rules synonyms, derivations, hyponyms and meronyms of the first senses of T and H terms, and the CatVar (Categorial Variation) database (Habash and Dorr, 2003). We allow rule chains of length up to 4 in WordNet (WN4). We compare our model to two types of baselines: (1) RTE published results: the average of the best runs of all systems, the best and second best performing lexical systems and the best full system of each challenge; (2) our implementation of lexical 561 coverage model, tuning the percentage-of-coverage threshold for entailment on the training set. This model uses the same configuration as ourprobabilistic model. We also implemented an Information Re- trieval style baseline3 (both with and without lexical expansions), but given its poorer performance we omit its results here. Table 1 presents the results. We can see that both our implemented models (probabilistic and coverage) outperform all RTE lexical baselines on both data sets, apart from (Majumdar and Bhattacharyya, 2010) which incorporates additional lexical resources, a named entity recognizer and a co-reference system. On RTE-5, the probabilistic model is comparable in performance to the best full system, while the coverage model achieves considerably better results. We notice that our implemented models successfully utilize resources to increase performance, as opposed to typical smaller or less consistent improvements in prior works (see Section 1). ModelRTE-5F1%RTE-6 ERT2b avne sgdst.b floeu fsxl taic slyealxs tisyceysmastlemesyms tem4 3504 . 36.4531 4 34873. 0 .68254 evrcagon+ o CW raeN tsVo4a+urCcaetVr43479685. 25384 4534. 5817 Tabspticrlaoe1:+ Envo CW arlueN tasV4oi+urnCcaetsVularonRTE-5and4 R521 T. 80 E-6.RT4 s25 y. s9635t1ems (1)(MacKinlay and Baldwin, 2009), (2)(Clark and Harrison, 2010), (3)(Mirkin et al., 2009)(2 submitted runs), (4)(Majumdar and Bhattacharyya, 2010) and (5)(Jia et al., 2010). are: While the probabilistic and coverage models are comparable on RTE-6 (with non-significant advantage for the former), on RTE-5 the latter performs 3Utilizing Lucene search engine (http://lucene.apache.org) better, suggesting that the probabilistic model needs to be further improved. In particular, WN4 performs better than the single-step WN only on RTE-5, suggesting the need to improve the modeling of chain- ing. The fluctuations over the data sets and impacts of resources suggest the need for further investigation over additional data sets and resources. As for the coverage model, under our configuration it poses a bigger challenge for RTE systems than perviously reported baselines. It is thus proposed as an easy to implement baseline for future entailment research. 4 Conclusions and Future Work This paper presented, for the first time, a principled and relatively rich probabilistic model for lexical entailment, amenable for estimation of hidden lexicallevel parameters from standard sentence-level annotations. The positive results of the probabilistic model compared to prior art and its ability to exploit lexical resources indicate its future potential. Yet, further investigation is needed. For example, analyzing current model’s limitations, we observed that the multiplicative nature of eqs. 1and 3 (reflecting independence assumptions) is too restrictive, resembling a logical AND. Accordingly we plan to explore relaxing this strict conjunctive behavior through models such as noisy-AND (Pearl, 1988). We also intend to explore the contribution of our model, and particularly its estimated parameter values, within a complex system that integrates multiple levels of inference. Acknowledgments This work was partially supported by the NEGEV Consortium of the Israeli Ministry of Industry, Trade and Labor (www.negev-initiative.org), the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886, the FIRBIsrael research project N. RBIN045PXH and by the Israel Science Foundation grant 1112/08. References Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal Shnarch, and Idan Szpektor. 2008. Efficient semantic deduction and approximate matching over compact parse forests. In Proceedings of Text Analysis Conference (TAC). 562 Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–3 11, June. Aljoscha Burchardt, Nils Reiter, Stefan Thater, and Anette Frank. 2007. A semantic approach to textual entailment: System evaluation and task analysis. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Peter Clark and Phil Harrison. 2010. BLUE-Lite: a knowledge-based lexical entailment system for RTE6. In Proceedings of Text Analysis Conference (TAC). Courtney Corley and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, se- ries [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Oren Glickman, Eyal Shnarch, and Ido Dagan. 2006. Lexical reference: a semantic matching subtask. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 172–179. Association for Computational Linguistics. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for english. In Proceedings of the North American Association for Computational Linguistics. Houping Jia, Xiaojiang Huang, Tengfei Ma, Xiaojun Wan, and Jianguo Xiao. 2010. PKUTM participation at TAC 2010 RTE and summarization track. In Proceedings of Text Analysis Conference (TAC). Bill MacCartney and Christopher D. Manning. 2007. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Andrew MacKinlay and Timothy Baldwin. 2009. A baseline approach to the RTE5 search pilot. In Proceedings of Text Analysis Conference (TAC). Debarghya Majumdar and Pushpak Bhattacharyya. 2010. Lexical based text entailment system for main task of RTE6. In Proceedings of Text Analysis Conference (TAC). Mirkin, Roy Bar-Haim, Jonathan Berant, Ido Eyal Shnarch, Asher Stern, and Idan Szpektor. 2009. Addressing discourse and document structure in the RTE search task. In Proceedings of Text Analysis Conference (TAC). Judea Pearl. 1988. Probabilistic reasoning in intelligent systems: networks ofplausible inference. Morgan Kaufmann. Marta Tatu and Dan Moldovan. 2007. COGEX at RTE 3. In Proceedings of the ACL-PASCAL Workshop on Shachar Dagan, Textual Entailment and Paraphrasing. Rui Wang, Yi Zhang, and Guenter Neumann. 2009. A joint syntactic-semantic representation for recognizing textual relatedness. In Proceedings of Text Analysis Conference (TAC). Fabio Massimo Zanzotto and Alessandro Moschitti. 2006. Automatic learning of textual entailments with cross-pair similarities. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 563
2 0.89831412 144 acl-2011-Global Learning of Typed Entailment Rules
Author: Jonathan Berant ; Ido Dagan ; Jacob Goldberger
Abstract: Extensive knowledge bases ofentailment rules between predicates are crucial for applied semantic inference. In this paper we propose an algorithm that utilizes transitivity constraints to learn a globally-optimal set of entailment rules for typed predicates. We model the task as a graph learning problem and suggest methods that scale the algorithm to larger graphs. We apply the algorithm over a large data set of extracted predicate instances, from which a resource of typed entailment rules has been recently released (Schoenmackers et al., 2010). Our results show that using global transitivity information substantially improves performance over this resource and several baselines, and that our scaling methods allow us to increase the scope of global learning of entailment-rule graphs.
3 0.78051662 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment
Author: Peter LoBue ; Alexander Yates
Abstract: Understanding language requires both linguistic knowledge and knowledge about how the world works, also known as common-sense knowledge. We attempt to characterize the kinds of common-sense knowledge most often involved in recognizing textual entailments. We identify 20 categories of common-sense knowledge that are prevalent in textual entailment, many of which have received scarce attention from researchers building collections of knowledge.
4 0.58572298 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico
Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.
5 0.41843978 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
Author: Delip Rao ; David Yarowsky
Abstract: This paper presents an original approach to semi-supervised learning of personal name ethnicity from typed graphs of morphophonemic features and first/last-name co-occurrence statistics. We frame this as a general solution to an inference problem over typed graphs where the edges represent labeled relations between features that are parameterized by the edge types. We propose a framework for parameter estimation on different constructions of typed graphs for this problem using a gradient-free optimization method based on grid search. Results on both in-domain and out-of-domain data show significant gains over 30% accuracy improvement using the techniques presented in the paper.
7 0.31643879 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon
8 0.31014413 291 acl-2011-SystemT: A Declarative Information Extraction System
9 0.29413486 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
10 0.2902284 174 acl-2011-Insights from Network Structure for Text Mining
11 0.26968697 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
12 0.26842257 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
13 0.26667207 1 acl-2011-(11-06-spirl)
14 0.2573947 187 acl-2011-Jointly Learning to Extract and Compress
15 0.24134691 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
16 0.23644492 11 acl-2011-A Fast and Accurate Method for Approximate String Search
17 0.23472445 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
18 0.22639552 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
19 0.22027187 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports
20 0.21758154 89 acl-2011-Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity
topicId topicWeight
[(5, 0.028), (17, 0.051), (26, 0.017), (37, 0.062), (39, 0.044), (41, 0.041), (55, 0.051), (59, 0.045), (72, 0.016), (91, 0.027), (96, 0.127), (97, 0.016), (98, 0.399)]
simIndex simValue paperId paperTitle
same-paper 1 0.80586356 22 acl-2011-A Probabilistic Modeling Framework for Lexical Entailment
Author: Eyal Shnarch ; Jacob Goldberger ; Ido Dagan
Abstract: Recognizing entailment at the lexical level is an important and commonly-addressed component in textual inference. Yet, this task has been mostly approached by simplified heuristic methods. This paper proposes an initial probabilistic modeling framework for lexical entailment, with suitable EM-based parameter estimation. Our model considers prominent entailment factors, including differences in lexical-resources reliability and the impacts of transitivity and multiple evidence. Evaluations show that the proposed model outperforms most prior systems while pointing at required future improvements. 1 Introduction and Background Textual Entailment was proposed as a generic paradigm for applied semantic inference (Dagan et al., 2006). This task requires deciding whether a tex- tual statement (termed the hypothesis-H) can be inferred (entailed) from another text (termed the textT). Since it was first introduced, the six rounds of the Recognizing Textual Entailment (RTE) challenges1 , currently organized under NIST, have become a standard benchmark for entailment systems. These systems tackle their complex task at various levels of inference, including logical representation (Tatu and Moldovan, 2007; MacCartney and Manning, 2007), semantic analysis (Burchardt et al., 2007) and syntactic parsing (Bar-Haim et al., 2008; Wang et al., 2009). Inference at these levels usually 1http://www.nist.gov/tac/2010/RTE/index.html 558 requires substantial processing and resources (e.g. parsing) aiming at high performance. Nevertheless, simple entailment methods, performing at the lexical level, provide strong baselines which most systems did not outperform (Mirkin et al., 2009; Majumdar and Bhattacharyya, 2010). Within complex systems, lexical entailment modeling is an important component. Finally, there are cases in which a full system cannot be used (e.g. lacking a parser for a targeted language) and one must resort to the simpler lexical approach. While lexical entailment methods are widely used, most of them apply ad hoc heuristics which do not rely on a principled underlying framework. Typically, such methods quantify the degree of lexical coverage of the hypothesis terms by the text’s terms. Coverage is determined either by a direct match of identical terms in T and H or by utilizing lexical semantic resources, such as WordNet (Fellbaum, 1998), that capture lexical entailment relations (denoted here as entailment rules). Common heuristics for quantifying the degree of coverage are setting a threshold on the percentage coverage of H’s terms (Majumdar and Bhattacharyya, 2010), counting absolute number of uncovered terms (Clark and Harrison, 2010), or applying an Information Retrievalstyle vector space similarity score (MacKinlay and Baldwin, 2009). Other works (Corley and Mihalcea, 2005; Zanzotto and Moschitti, 2006) have applied a heuristic formula to estimate the similarity between text fragments based on a similarity function between their terms. These heuristics do not capture several important aspects of entailment, such as varying reliability of Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 558–563, entailment resources and the impact of rule chaining and multiple evidence on entailment likelihood. An additional observation from these and other systems is that their performance improves only moderately when utilizing lexical resources2. We believe that the textual entailment field would benefit from more principled models for various entailment phenomena. Inspired by the earlier steps in the evolution of Statistical Machine Translation methods (such as the initial IBM models (Brown et al., 1993)), we formulate a concrete generative probabilistic modeling framework that captures the basic aspects of lexical entailment. Parameter estimation is addressed by an EM-based approach, which enables estimating the hidden lexical-level entailment parameters from entailment annotations which are available only at the sentence-level. While heuristic methods are limited in their ability to wisely integrate indications for entailment, probabilistic methods have the advantage of being extendable and enabling the utilization of wellfounded probabilistic methods such as the EM algorithm. We compared the performance of several model variations to previously published results on RTE data sets, as well as to our own implementation of typical lexical baselines. Results show that both the probabilistic model and our percentagecoverage baseline perform favorably relative to prior art. These results support the viability of the probabilistic framework while pointing at certain modeling aspects that need to be improved. 2 Probabilistic Model Under the lexical entailment scope, our modeling goal is obtaining a probabilistic score for the likelihood that all H’s terms are entailed by T. To that end, we model prominent aspects of lexical entailment, which were mostly neglected by previous lexical methods: (1) distinguishing different reliability levels of lexical resources; (2) allowing transitive chains of rule applications and considering their length when estimating their validity; and (3) considering multiple entailments when entailing a term. 2See ablation tests reports in http://aclweb.org/aclwiki/ index.php?title=RTE Knowledge Resources#Ablation Tests 559 Figure 1: The generative process of entailing terms of a hypothesis from a text. Edges represent entailment rules. There are 3 evidences for the entailment of hi :a rule from Resource1 , another one from Resource3 both suggesting that tj entails it, and a chain from t1through an intermediate term t0. 2.1 Model Description For T to entail H it is usually a necessary, but not sufficient, that every term h ∈ H would be entsauiflefidci by ,at t hleatast e one t teerrmm mt h ∈ ∈T (Glickman eet al., 2006). Figure s1t odneescr tiebrmes tth ∈e process komf entailing hypothesis terms. The trivial case is when identical terms, possibly at the stem or lemma level, appear in T and H (a direct match as tn and hm in Figure 1). Alternatively, we can establish entailment based on knowledge of entailing lexical-semantic relations, such as synonyms, hypernyms and morphological derivations, available in lexical resources (e.g the rule inference → reasoning from WordNet). (We.eg d theneo rutel by R(r) cthee → resource nwgh ficroh provided teht)e. rule r. Since entailment is a transitive relation, rules may compose transitive chains that connect a term t ∈ T ctoo a pteosrme rha ∈ Hive through hinatte cromnendeicatte a tteerrmms. t ∈Fo Tr instance, fr hom ∈ t hHe r thurleosu infer → inference armnds inference → reasoning we can dre →duc inef tehreen rcuele a infer → reasoning (were inference dise dthuec ein thteerm rueldeia intef trer →m as t0 in Figure 1). Multiple chains may connect t to h (as for tj and hi in Figure 1) or connect several terms in T to h (as t1 and tj are indicating the entailment of hi in Figure 1), thus providing multiple evidence for h’s entailment. It is reasonable to expect that if a term t indeed entails a term h, it is likely to find evidences for this relation in several resources. Taking a probabilistic perspective, we assume a parameter θR for each resource R, denoting its reliability, i.e. the prior probability that applying a rule from R corresponds to a valid entailment instance. Direct matches are considered as a special “resource”, called MATCH, for which θMATCH is expected to be close to 1. We now present our probabilistic model. For a text term t ∈ T to entail a hypothesis term h by a tcehxatin te c, mde tn ∈ote Td by etn →tcai h, thhyep application mof h every r ∈ c must be valid. N −→ote h ,t thhaet a pruplleic r i onn a cfh eaviner c rco ∈nne cc mtsu tswt ob ete vramlisd (its oleteft t-hhaatnd a- rsuildee ran ind aits c righthand-side, denoted lhs → rhs). The lhs of the first rhualned i-ns c eis, td ∈ oTte adn ldh sth →e r rhhss )o.f T Tthhee l lahsts r oufle t hine ifitr sist rhu ∈ iHn. c W ise t d ∈en Tote a nthde t event so fo a vhael ilda rtu rluel applicathio ∈n by l Whse →dren orhtes. t Sei envceen a-priori a d ru rluel r aips pvliacliadwith probability θR(r) , ancnde assuming independence of all r ∈ c, we obtain Eq. 1 to specify the probability rof ∈ ∈th ce, weveen otb tt i→cn Ehq. Next, pleetc C(h) ede pnroobtethe set of chains which− → suggest txhte, leentt Cail(mhe)n dt eonfo hte. The probability that T does not entail h at all (by any chain), specified in Eq. 2, is the probability that all these chains are not valid. Finally, the probability that T entails all of H, assuming independence of H’s terms, is the probability that every h ∈ H is entailed, as given ien p Eq. a3b. Nityot tihceat t ehvaet yth here ∈ c oHul ids be a term h which is not covered by any available rule chain. Under this formulation, we assume that each such h is covered by a single rule coming from a special “resource” called UNCOVERED (expecting θUNCOVERED to be relatively small). p(t −→c h) = Yp(lhs →r rhs) = Yr∈c p(T 9 h) = Y YθR(r)(1) Yr∈c [1 − p(t− →c h)] (2) c∈YC(h) p(T → H) = Y p(T → h) (3) hY∈H As can be seen, our model indeed distinguishes varying resource reliability, decreases entailment probability as rule chains grow and increases it when entailment of a term is supported by multiple chains. The above treatment of uncovered terms in H, as captured in Eq. 3, assumes that their entailment probability is independent of the rest of the hypothesis. However, when the number of covered hypothesis terms increases the probability that the remaining terms are actually entailed by T increases too 560 (even though we do not have supporting knowledge for their entailment). Thus, an alternative model is to group all uncovered terms together and estimate the overall probability of their joint entailment as a function of the lexical coverage of the hypothesis. We denote Hc as the subset of H’s terms which are covered by some rule chain and Huc as the remaining uncovered part. Eq. 3a then provides a refined entailment model for H, in which the second term specifies the probability that Huc is entailed given that Hc is validly entailed and the corresponding lengths: p(T→H) = [Yp(T→h)]·p(T→Huc hY∈Hc 2.2 | |Hc|,|H|) (3a) Parameter Estimation The difficulty in estimating the θR values is that these are term-level parameters while the RTEtraining entailment annotation is given for the sentence-level. Therefore, we use EM-based estimation for the hidden parameters (Dempster et al., 1977). In the E step we use the current θR values to compute all whcr (T, H) values for each training pair. whcr (T, H) stands for the posterior probability that application of the rule r in the chain c for h ∈ H tish valid, given nth oaft heieth reurl eT r e innta thiles c Hha or not ra hcc ∈ord Hing to the training annotation (see Eq. 4). Remember that a rule r provides an entailment relation between its left-hand-side (lhs) and its right-hand-side (rhs). Therefore Eq. 4 uses the notation lhs →r rhs to designate the application of the rule r (similar htos Eq. 1). wEhc:r(T,H)= p (lTh9→sH−→ |rlhsrp−→ rh(Tsr→9|hTsH )9→p(lhHs−→ r) =hs)if(4T)9→H After applying Bayes’ rule we get a fraction with Eq. 3 in its denominator and θR(r) as the second term of the numerator. The first numerator term is defined as in Eq. 3 except that for the corresponding rule application we substitute θR(r) by 1(per the conditioning event). The probabilistic model defined by Eq. 1-3 is a loop-free directed acyclic graphical model (aka a Bayesian network). Hence the E-step probabilities can be efficiently calculated using the belief propagation algorithm (Pearl, 1988). The M step uses Eq. 5 to update the parameter set. For each resource R we average the whcr (T, H) val- ues for all its rule applications in the training, whose total number is denoted nR. M : θR=n1RTX,HhX∈Hc∈XC(h)r∈c|RX(r)=wRhcr(T,H) (5) For Eq. 3a we need to estimate also p(T→Huc | |Hc| ,|H|). 3Tah iws eis n ndeoende t directly avteia a amlsaoxi pm(Tu→m Hlikeli-| |hHoo|d, eHst|i)m.a Tthioins over tehe d training set, by calculating the proportion of entailing examples within the set of all examples of a given hypothesis length (|H|) aonfd a a given lneusm ofbe ar goifv ecnov heyrepdo hteersmiss (|Hc|). HA|)s |Hc| we tvaekne tnhuem nbuemrb oefr ocofv videerendtic taelr mtesrm (|sH in| )T. a Ands |HH (exact match) suinmcbee irn o afl imdeonstti caall cases itner Tms a nind H which have an exact match in T are indeed entailed. We also tried initializing the EM algorithm with these direct estimations but did not obtain performance improvements. 3 Evaluations and Results The 5th Recognizing Textual Entailment challenge (RTE-5) introduced a new search task (Bentivogli et al., 2009) which became the main task in RTE6 (Bentivogli et al., 2010). In this task participants should find all sentences that entail a given hypothesis in a given document cluster. This task’s data sets reflect a natural distribution of entailments in a corpus and demonstrate a more realistic scenario than the previous RTE challenges. In our system, sentences are tokenized and stripped of stop words and terms are lemmatized and tagged for part-of-speech. As lexical resources we use WordNet (WN) (Fellbaum, 1998), taking as entailment rules synonyms, derivations, hyponyms and meronyms of the first senses of T and H terms, and the CatVar (Categorial Variation) database (Habash and Dorr, 2003). We allow rule chains of length up to 4 in WordNet (WN4). We compare our model to two types of baselines: (1) RTE published results: the average of the best runs of all systems, the best and second best performing lexical systems and the best full system of each challenge; (2) our implementation of lexical 561 coverage model, tuning the percentage-of-coverage threshold for entailment on the training set. This model uses the same configuration as ourprobabilistic model. We also implemented an Information Re- trieval style baseline3 (both with and without lexical expansions), but given its poorer performance we omit its results here. Table 1 presents the results. We can see that both our implemented models (probabilistic and coverage) outperform all RTE lexical baselines on both data sets, apart from (Majumdar and Bhattacharyya, 2010) which incorporates additional lexical resources, a named entity recognizer and a co-reference system. On RTE-5, the probabilistic model is comparable in performance to the best full system, while the coverage model achieves considerably better results. We notice that our implemented models successfully utilize resources to increase performance, as opposed to typical smaller or less consistent improvements in prior works (see Section 1). ModelRTE-5F1%RTE-6 ERT2b avne sgdst.b floeu fsxl taic slyealxs tisyceysmastlemesyms tem4 3504 . 36.4531 4 34873. 0 .68254 evrcagon+ o CW raeN tsVo4a+urCcaetVr43479685. 25384 4534. 5817 Tabspticrlaoe1:+ Envo CW arlueN tasV4oi+urnCcaetsVularonRTE-5and4 R521 T. 80 E-6.RT4 s25 y. s9635t1ems (1)(MacKinlay and Baldwin, 2009), (2)(Clark and Harrison, 2010), (3)(Mirkin et al., 2009)(2 submitted runs), (4)(Majumdar and Bhattacharyya, 2010) and (5)(Jia et al., 2010). are: While the probabilistic and coverage models are comparable on RTE-6 (with non-significant advantage for the former), on RTE-5 the latter performs 3Utilizing Lucene search engine (http://lucene.apache.org) better, suggesting that the probabilistic model needs to be further improved. In particular, WN4 performs better than the single-step WN only on RTE-5, suggesting the need to improve the modeling of chain- ing. The fluctuations over the data sets and impacts of resources suggest the need for further investigation over additional data sets and resources. As for the coverage model, under our configuration it poses a bigger challenge for RTE systems than perviously reported baselines. It is thus proposed as an easy to implement baseline for future entailment research. 4 Conclusions and Future Work This paper presented, for the first time, a principled and relatively rich probabilistic model for lexical entailment, amenable for estimation of hidden lexicallevel parameters from standard sentence-level annotations. The positive results of the probabilistic model compared to prior art and its ability to exploit lexical resources indicate its future potential. Yet, further investigation is needed. For example, analyzing current model’s limitations, we observed that the multiplicative nature of eqs. 1and 3 (reflecting independence assumptions) is too restrictive, resembling a logical AND. Accordingly we plan to explore relaxing this strict conjunctive behavior through models such as noisy-AND (Pearl, 1988). We also intend to explore the contribution of our model, and particularly its estimated parameter values, within a complex system that integrates multiple levels of inference. Acknowledgments This work was partially supported by the NEGEV Consortium of the Israeli Ministry of Industry, Trade and Labor (www.negev-initiative.org), the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886, the FIRBIsrael research project N. RBIN045PXH and by the Israel Science Foundation grant 1112/08. References Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal Shnarch, and Idan Szpektor. 2008. Efficient semantic deduction and approximate matching over compact parse forests. In Proceedings of Text Analysis Conference (TAC). 562 Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference (TAC). Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–3 11, June. Aljoscha Burchardt, Nils Reiter, Stefan Thater, and Anette Frank. 2007. A semantic approach to textual entailment: System evaluation and task analysis. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Peter Clark and Phil Harrison. 2010. BLUE-Lite: a knowledge-based lexical entailment system for RTE6. In Proceedings of Text Analysis Conference (TAC). Courtney Corley and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, se- ries [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Oren Glickman, Eyal Shnarch, and Ido Dagan. 2006. Lexical reference: a semantic matching subtask. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 172–179. Association for Computational Linguistics. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for english. In Proceedings of the North American Association for Computational Linguistics. Houping Jia, Xiaojiang Huang, Tengfei Ma, Xiaojun Wan, and Jianguo Xiao. 2010. PKUTM participation at TAC 2010 RTE and summarization track. In Proceedings of Text Analysis Conference (TAC). Bill MacCartney and Christopher D. Manning. 2007. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Andrew MacKinlay and Timothy Baldwin. 2009. A baseline approach to the RTE5 search pilot. In Proceedings of Text Analysis Conference (TAC). Debarghya Majumdar and Pushpak Bhattacharyya. 2010. Lexical based text entailment system for main task of RTE6. In Proceedings of Text Analysis Conference (TAC). Mirkin, Roy Bar-Haim, Jonathan Berant, Ido Eyal Shnarch, Asher Stern, and Idan Szpektor. 2009. Addressing discourse and document structure in the RTE search task. In Proceedings of Text Analysis Conference (TAC). Judea Pearl. 1988. Probabilistic reasoning in intelligent systems: networks ofplausible inference. Morgan Kaufmann. Marta Tatu and Dan Moldovan. 2007. COGEX at RTE 3. In Proceedings of the ACL-PASCAL Workshop on Shachar Dagan, Textual Entailment and Paraphrasing. Rui Wang, Yi Zhang, and Guenter Neumann. 2009. A joint syntactic-semantic representation for recognizing textual relatedness. In Proceedings of Text Analysis Conference (TAC). Fabio Massimo Zanzotto and Alessandro Moschitti. 2006. Automatic learning of textual entailments with cross-pair similarities. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 563
2 0.71308428 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus
Author: Heather Friedberg
Abstract: Most spoken dialogue systems are still lacking in their ability to accurately model the complex process that is human turntaking. This research analyzes a humanhuman tutoring corpus in order to identify prosodic turn-taking cues, with the hopes that they can be used by intelligent tutoring systems to predict student turn boundaries. Results show that while there was variation between subjects, three features were significant turn-yielding cues overall. In addition, a positive relationship between the number of cues present and the probability of a turn yield was demonstrated. 1
3 0.64288938 282 acl-2011-Shift-Reduce CCG Parsing
Author: Yue Zhang ; Stephen Clark
Abstract: CCGs are directly compatible with binarybranching bottom-up parsing algorithms, in particular CKY and shift-reduce algorithms. While the chart-based approach has been the dominant approach for CCG, the shift-reduce method has been little explored. In this paper, we develop a shift-reduce CCG parser using a discriminative model and beam search, and compare its strengths and weaknesses with the chart-based C&C; parser. We study different errors made by the two parsers, and show that the shift-reduce parser gives competitive accuracies compared to C&C.; Considering our use of a small beam, and given the high ambiguity levels in an automatically-extracted grammar and the amount of information in the CCG lexical categories which form the shift actions, this is a surprising result.
Author: Michael Mohler ; Razvan Bunescu ; Rada Mihalcea
Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.
5 0.56705093 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens
Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.
7 0.48988587 112 acl-2011-Efficient CCG Parsing: A* versus Adaptive Supertagging
8 0.471542 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
9 0.45882317 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue
10 0.45388916 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework
11 0.4501926 144 acl-2011-Global Learning of Typed Entailment Rules
12 0.42275146 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
13 0.41988376 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
14 0.41281646 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing
15 0.40689784 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation
16 0.40365556 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
17 0.40000781 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
18 0.39947763 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
19 0.39914352 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life
20 0.39875913 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization