acl acl2012 acl2012-72 knowledge-graph by maker-knowledge-mining

72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

Source: pdf

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

Abstract: We address a core aspect of the multilingual content synchronization task: the identification of novel, more informative or semantically equivalent pieces of information in two documents about the same topic. This can be seen as an application-oriented variant of textual entailment recognition where: i) T and H are in different languages, and ii) entailment relations between T and H have to be checked in both directions. Using a combination of lexical, syntactic, and semantic features to train a cross-lingual textual entailment system, we report promising results on different datasets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents Yashar Mehdad Matteo Negri Marcello Federico Fondazione Bruno Kessler, FBK-irst Trento ,Italy {mehdad | negri | fede rico} @ fbk . [sent-1, score-0.231]

2 eu Abstract We address a core aspect of the multilingual content synchronization task: the identification of novel, more informative or semantically equivalent pieces of information in two documents about the same topic. [sent-2, score-0.518]

3 This can be seen as an application-oriented variant of textual entailment recognition where: i) T and H are in different languages, and ii) entailment relations between T and H have to be checked in both directions. [sent-3, score-0.905]

4 Using a combination of lexical, syntactic, and semantic features to train a cross-lingual textual entailment system, we report promising results on different datasets. [sent-4, score-0.569]

5 1 Introduction Given two documents about the same topic written in different languages (e. [sent-5, score-0.028]

6 Wiki pages), content synchronization deals with the problem of au- tomatically detecting and resolving differences in the information they provide, in order to produce aligned, mutually enriched versions. [sent-7, score-0.364]

7 A roadmap towards the solution of this problem has to take into account, among the many sub-tasks, the identification of information in one page that is semantically equivalent, novel, or more informative with respect to the content of the other page. [sent-8, score-0.195]

8 In this paper we set such problem as an application-oriented, crosslingual variant of the Textual Entailment (TE) recognition task (Dagan and Glickman, 2004). [sent-9, score-0.043]

9 Along this direction, we make two main contributions: (a) Experiments with multi-directional crosslingual textual entailment. [sent-10, score-0.138]

10 Instead, we experiment with the only corpus representative of the multilingual content synchronization scenario, and the richer inventory of phenomena arising from it (multi-directional entailment relations). [sent-13, score-0.737]

11 , 2010), or an “integrated solution” that exploits bilingual phrase tables to capture lexical relations and contextual information (Mehdad et al. [sent-16, score-0.309]

12 The promising results achieved with the integrated approach, however, still rely on phrasal matching techniques that disregard relevant semantic aspects of the problem. [sent-18, score-0.22]

13 By filling this gap integrating linguistically motivated features, we propose a novel approach that improves the state-of-the-art in CLTE. [sent-19, score-0.023]

14 2 CLTE-based content synchronization CLTE has been proposed by (Mehdad et al. [sent-20, score-0.331]

15 , 2010) as an extension of textual entailment which consists of deciding, given a text T and an hypothesis H in different languages, if the meaning of H can be inferred from the meaning of T. [sent-21, score-0.579]

16 The adoption of entailmentbased techniques to address content synchronization looks promising, as several issues inherent to such task can be formalized as entailment-related probProce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-22, score-0.399]

17 Given two pages (P1 and P2), these issues include identifying, and properly managing: (1) Text portions in P1 and P2 that express the same meaning (bi-directional entailment). [sent-25, score-0.185]

18 In such cases no information has to migrate across P1 and P2, and the two text portions will remain the same; (2) Text portions in P1 that are more informative than portions in P2 (forward entailment). [sent-26, score-0.428]

19 In such cases, the novel information from both sides has to be translated and migrated in order to mutually enrich the two pages; (5) Meaning discrepancies between text portions in the two pages (“contradictions” in RTE parlance). [sent-28, score-0.295]

20 CLTE has been previously modeled as a phrase matching problem that exploits dictionaries and phrase tables extracted from bilingual parallel corpora to determine the number of word sequences in H that can be mapped to word sequences in T. [sent-29, score-0.483]

21 In this way a semantic judgement about entailment is made exclusively on the basis of lexical evidence. [sent-30, score-0.494]

22 When only unidirectional entailment relations from T to H have to be determined (RTE-like setting), the full mapping of the hypothesis into the text usually provides enough evidence for a positive entailment judgement. [sent-31, score-0.88]

23 Unfortunately, when dealing with multidirectional entailment, the correlation between the proportion of matching terms and the correct entailment decisions is less strong. [sent-32, score-0.54]

24 In such framework, for instance, the full mapping of the hypothesis into the text is per se not sufficient to discriminate between forward entailment and semantic equivalence. [sent-33, score-0.517]

25 To cope with these issues, we explore the contribution of syntactic and semantic features as a complement to lexical ones in a supervised learning framework. [sent-34, score-0.197]

26 3 Beyond lexical CLTE In order to enrich the feature space beyond pure lexical match through phrase table entries, our model 121 builds on two additional feature sets, derived from i) semantic phrase tables, and ii) dependency relations. [sent-35, score-0.435]

27 Semantic Phrase Table (SPT) matching represents a novel way to leverage the integration of semantics and MT-derived techniques. [sent-36, score-0.15]

28 SPT matching extends CLTE methods based on pure lexical match by means of “generalized” phrase tables annotated with shallow semantic labels. [sent-37, score-0.431]

29 wordn [LABEL]”, are used as a recall-oriented complement to the phrase tables used in MT. [sent-41, score-0.208]

30 A motivation for this augmentation is that semantic tags allow to match tokens that do not occur in the original bilingual parallel corpora used for phrase table extraction. [sent-42, score-0.302]

31 Our hypothesis is that the increase in recall obtained from relaxed matches through semantic tags in place of “out of vocabulary” terms (e. [sent-43, score-0.119]

32 Like lexical phrase tables, SPTs are extracted from parallel corpora. [sent-46, score-0.174]

33 As a first step we annotate the parallel corpora with named-entity taggers for the source and target languages, replacing named entities with general semantic labels chosen from a coarse-grained taxonomy (person, location, organization, date and numeric expression). [sent-47, score-0.156]

34 Then, we combine the sequences of unique labels into one single token of the same label, and we run Giza++ (Och and Ney, 2000) to align the resulting semantically augmented corpora. [sent-48, score-0.091]

35 Finally, we extract the semantic phrase table from the augmented aligned corpora using the Moses toolkit (Koehn et al. [sent-49, score-0.196]

36 For the matching phase, we first annotate T and H in the same way we labeled our parallel corpora. [sent-51, score-0.159]

37 Then, for each n-gram order (n=1 to 5) we use the SPT to calculate a matching score as the number of n-grams in H that match with phrases in T divided by the number of n-grams in H. [sent-52, score-0.168]

38 1 Dependency Relation (DR) matching targets the increase of CLTE precision. [sent-53, score-0.105]

39 Adding syntactic constraints to the matching process, DR features aim to reduce the amount of wrong matches often occur- ring with bag-of-words methods (both at the lexical level and with recall-oriented SPTs). [sent-54, score-0.204]

40 For instance, the contradiction between “Yahoo acquired 1When checking for entailment from H to T, the normalization is carried out dividing by the number of n-grams in T. [sent-55, score-0.463]

41 Overture” and “Overture compr ´o Yahoo”, which is evident when syntax is taken into account, can not be caught by shallow methods. [sent-56, score-0.065]

42 We define a dependency relation as a triple that connects pairs of words through a grammatical relation. [sent-57, score-0.082]

43 DR matching captures similarities between dependency relations, combining the syntactic and lexical level. [sent-58, score-0.209]

44 In a valid match, while the relation has to be the same, the connected words can be either the same, or semantically equivalent terms in the two languages (e. [sent-59, score-0.132]

45 Given the dependency tree representations of T and H, for each grammatical relation (r) we calculate a DR matching score as the number of matching occurrences of r in T and H, divided by the number of occurrences of r in H. [sent-62, score-0.316]

46 Separate DR matching scores are calculated for each relation r appearing both in T and H. [sent-63, score-0.159]

47 1 Content synchronization scenario In our first experiment we used the English-German portion of the CLTE corpus described in (Negri et al. [sent-65, score-0.307]

48 , 2011), consisting of 500 multi-directional entailment pairs which we equally divided into training and test sets. [sent-66, score-0.406]

49 Each pair in the dataset is annotated with “Bidirectional”, “Forward”, or “Backward” entailment judgements. [sent-67, score-0.409]

50 Although highly relevant for the content synchronization task, “Contradiction” and “Unknown” cases (i. [sent-68, score-0.331]

51 “NO” entailment in both directions) are not present in the annotation. [sent-70, score-0.382]

52 However, this is the only available dataset suitable to gather insights about the viability of our approach to multi-directional CLTE recognition. [sent-71, score-0.027]

53 2 We chose the ENG-GER portion of the dataset since for such language pair MT systems performance is often lower, making the adoption of simpler solutions based on pivoting more vulnerable. [sent-72, score-0.185]

54 To build the English-German phrase tables we combined the Europarl, News Commentary and “denews”3 parallel corpora. [sent-73, score-0.199]

55 After tokenization, Giza++ and Moses were respectively used to align the corpora and extract a lexical phrase table (PT). [sent-74, score-0.181]

56 Similarly, the semantic phrase table (SPT) has been ex- 2Recently, a new dataset including “Unknown” pairs has been used in the “Cross-Lingual Textual Entailment for Content Synchronization” task at SemEval-2012 (Negri et al. [sent-75, score-0.161]

57 uk/pkoehn/ 122 tracted from the same corpora annotated with the Stanford NE tagger (Faruqui and Pad o´, 2010; Finkel et al. [sent-81, score-0.039]

58 Dependency relations (DR) have been extracted running the Stanford parser (Rafferty and Manning, 2008; De Marneffe et al. [sent-83, score-0.046]

59 The dictionary created during the alignment of the parallel corpora provided the lexical knowledge to perform matches when the connected words are different, but semantically equivalent in the two languages. [sent-85, score-0.249]

60 Two-way classification casts multi-directional entailment as a unidirectional problem, where each pair is analyzed checking for entailment both from left to right and from right to left. [sent-88, score-0.87]

61 In this condi- tion, each original test example is correctly classified if both pairs originated from it are correctly judged (“YES-YES” for bidirectional, “YES-NO” for forward, and “NO-YES” for backward entailment). [sent-89, score-0.043]

62 Two-way classification represents an intuitive solution to capture multidirectional entailment relations but, at the same time, a suboptimal approach in terms of efficiency since two checks are performed for each pair. [sent-90, score-0.481]

63 Three-way classification is more efficient, but at the same time more challenging due to the higher difficulty of multiclass learning, especially with small datasets. [sent-91, score-0.027]

64 Results are compared with two pivoting approaches, checking for entailment between the original English texts and the translated German hypotheses. [sent-92, score-0.573]

65 The second (Pivot-PPT) exploits paraphrase tables for phrase matching, and represents the best monolingual model presented in (Mehdad et al. [sent-95, score-0.216]

66 6% accuracy achieved in the most challenging setting 4Using Google Translate. [sent-100, score-0.027]

67 5PPT Table 1: CLTE accuracy results over content synchronization and RTE3-derived datasets. [sent-118, score-0.331]

68 (3-way) demonstrates the effectiveness of our approach to capture meaning equivalence and information disparity in cross-lingual texts. [sent-119, score-0.181]

69 (b) In both settings the combination of lexical, syntactic and semantic features (PT+SPT+DR) significantly improves5 the state-of-the-art CLTE model (PT). [sent-120, score-0.085]

70 Such improvement is motivated by the joint contribution of SPTs (matching more and longer ngrams, with a consequent recall improvement), and DR matching (adding constraints, with a consequent gain in precision). [sent-121, score-0.197]

71 This might be due to the fact that both PT and DR features are precision-oriented, and their effectiveness becomes evident only in combination with recall-oriented features (SPT). [sent-123, score-0.064]

72 This suggests that the noise introduced by incorrect translations makes the pivoting approach less attractive in comparison with the more robust cross-lingual models. [sent-125, score-0.116]

73 2 RTE-like CLTE scenario Our second experiment aims at verifying the effectiveness of the improved model over RTE-derived CLTE data. [sent-127, score-0.063]

74 , 2011), calculated over an EnglishSpanish entailment corpus derived from the RTE-3 dataset (Negri and Mehdad, 2010). [sent-129, score-0.438]

75 In order to build the English-Spanish lexical phrase table (PT), we used the Europarl, News Commentary and United Nations parallel corpora. [sent-130, score-0.174]

76 The semantic phrase table (SPT) was extracted from the same corpora annotated with FreeLing (Carreras et al. [sent-131, score-0.173]

77 Dependency relations (DR) have been extracted parsing English texts and Spanish hypotheses with DepPattern (Gamallo and Gonzalez, 2011). [sent-133, score-0.046]

78 05, calculated using the approximate randomization test implemented in (Pad o´, 2006). [sent-135, score-0.029]

79 123 Accuracy results have been calculated over 800 test pairs ofthe CLTE corpus, after training the SVM binary classifier over the 800 development pairs. [sent-136, score-0.029]

80 Our new features have been compared with: i) the state-of-the-art CLTE model (PT), ii) the best monolingual model (Pivot-PPT) presented in (Mehdad et al. [sent-137, score-0.038]

81 , 2011), and iii) the average result achieved by participants in the monolingual English RTE-3 evaluation campaign (RTE-3 AVG). [sent-138, score-0.038]

82 As shown in Table 1, the combined feature set (PT+SPT+DR) significantly5 outperforms the lexical model (64. [sent-139, score-0.049]

83 5 Conclusion We addressed the identification of semantic equivalence and information disparity in two documents about the same topic, written in different languages. [sent-143, score-0.207]

84 This is a core aspect of the multilingual content synchronization task, which represents a challenging application scenario for a variety of NLP technologies, and a shared research framework for the integration of semantics and MT technology. [sent-144, score-0.445]

85 Casting the problem as a CLTE task, we extended previous lexical models with syntactic and semantic features. [sent-145, score-0.134]

86 A grammatical formalism based on patterns of part of speech tags. [sent-189, score-0.024]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clte', 0.463), ('entailment', 0.382), ('mehdad', 0.302), ('spt', 0.266), ('synchronization', 0.266), ('dr', 0.238), ('negri', 0.231), ('pt', 0.145), ('portions', 0.122), ('pivoting', 0.116), ('spts', 0.106), ('matching', 0.105), ('textual', 0.095), ('disparity', 0.08), ('kouylekov', 0.079), ('tables', 0.074), ('phrase', 0.071), ('rte', 0.065), ('content', 0.065), ('pad', 0.063), ('complement', 0.063), ('semantic', 0.063), ('informative', 0.062), ('parallel', 0.054), ('faruqui', 0.053), ('freeling', 0.053), ('gamallo', 0.053), ('migrated', 0.053), ('multidirectional', 0.053), ('overture', 0.053), ('padr', 0.053), ('parlance', 0.053), ('lexical', 0.049), ('consequent', 0.046), ('rafferty', 0.046), ('semantically', 0.046), ('relations', 0.046), ('forward', 0.044), ('crosslingual', 0.043), ('backward', 0.043), ('adoption', 0.042), ('evident', 0.042), ('unidirectional', 0.042), ('equivalence', 0.042), ('checking', 0.041), ('scenario', 0.041), ('contradiction', 0.04), ('match', 0.039), ('german', 0.039), ('corpora', 0.039), ('monolingual', 0.038), ('meaning', 0.037), ('bilingual', 0.036), ('commentary', 0.036), ('translated', 0.034), ('bentivogli', 0.034), ('equivalent', 0.033), ('exploits', 0.033), ('mutually', 0.033), ('yahoo', 0.033), ('dependency', 0.033), ('moses', 0.032), ('pure', 0.03), ('ii', 0.03), ('carreras', 0.03), ('marneffe', 0.03), ('enrich', 0.03), ('dagan', 0.03), ('promising', 0.029), ('calculated', 0.029), ('mt', 0.029), ('hypothesis', 0.028), ('languages', 0.028), ('bidirectional', 0.028), ('matches', 0.028), ('dataset', 0.027), ('mechanical', 0.027), ('challenging', 0.027), ('issues', 0.026), ('finkel', 0.025), ('europarl', 0.025), ('relation', 0.025), ('multilingual', 0.024), ('divided', 0.024), ('grammatical', 0.024), ('unknown', 0.024), ('novel', 0.023), ('casts', 0.023), ('casting', 0.023), ('compr', 0.023), ('disregard', 0.023), ('fondazione', 0.023), ('submitting', 0.023), ('augmented', 0.023), ('identification', 0.022), ('align', 0.022), ('syntactic', 0.022), ('federico', 0.022), ('integration', 0.022), ('effectiveness', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

2 0.18001521 65 acl-2012-Crowdsourcing Inference-Rule Evaluation

Author: Naomi Zeichner ; Jonathan Berant ; Ido Dagan

Abstract: The importance of inference rules to semantic applications has long been recognized and extensive work has been carried out to automatically acquire inference-rule resources. However, evaluating such resources has turned out to be a non-trivial task, slowing progress in the field. In this paper, we suggest a framework for evaluating inference-rule resources. Our framework simplifies a previously proposed “instance-based evaluation” method that involved substantial annotator training, making it suitable for crowdsourcing. We show that our method produces a large amount of annotations with high inter-annotator agreement for a low cost at a short period of time, without requiring training expert annotators.

3 0.16651335 82 acl-2012-Entailment-based Text Exploration with Application to the Health-care Domain

Author: Meni Adler ; Jonathan Berant ; Ido Dagan

Abstract: We present a novel text exploration model, which extends the scope of state-of-the-art technologies by moving from standard concept-based exploration to statement-based exploration. The proposed scheme utilizes the textual entailment relation between statements as the basis of the exploration process. A user of our system can explore the result space of a query by drilling down/up from one statement to another, according to entailment relations specified by an entailment graph and an optional concept taxonomy. As a prominent use case, we apply our exploration system and illustrate its benefit on the health-care domain. To the best of our knowledge this is the first implementation of an exploration system at the statement level that is based on the textual entailment relation. 1

4 0.15687513 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

Author: Asher Stern ; Ido Dagan

Abstract: This paper introduces BIUTEE1 , an opensource system for recognizing textual entailment. Its main advantages are its ability to utilize various types of knowledge resources, and its extensibility by which new knowledge resources and inference components can be easily integrated. These abilities make BIUTEE an appealing RTE system for two research communities: (1) researchers of end applications, that can benefit from generic textual inference, and (2) RTE researchers, who can integrate their novel algorithms and knowledge resources into our system, saving the time and effort of developing a complete RTE system from scratch. Notable assistance for these re- searchers is provided by a visual tracing tool, by which researchers can refine and “debug” their knowledge resources and inference components.

5 0.15490645 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

Author: Jonathan Berant ; Ido Dagan ; Meni Adler ; Jacob Goldberger

Abstract: Learning entailment rules is fundamental in many semantic-inference applications and has been an active field of research in recent years. In this paper we address the problem of learning transitive graphs that describe entailment rules between predicates (termed entailment graphs). We first identify that entailment graphs exhibit a “tree-like” property and are very similar to a novel type of graph termed forest-reducible graph. We utilize this property to develop an iterative efficient approximation algorithm for learning the graph edges, where each iteration takes linear time. We compare our approximation algorithm to a recently-proposed state-of-the-art exact algorithm and show that it is more efficient and scalable both theoretically and empirically, while its output quality is close to that given by the optimal solution of the exact algorithm.

6 0.11582317 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

7 0.099260218 78 acl-2012-Efficient Search for Transformation-based Inference

8 0.089662924 184 acl-2012-String Re-writing Kernel

9 0.082127951 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

10 0.068513379 64 acl-2012-Crosslingual Induction of Semantic Roles

11 0.058412824 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

12 0.058375854 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

13 0.057754923 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

14 0.056640524 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

15 0.055124421 140 acl-2012-Machine Translation without Words through Substring Alignment

16 0.053894475 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

17 0.052689515 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

18 0.051870447 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

19 0.051389761 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API

20 0.050874203 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.172), (1, 0.007), (2, -0.021), (3, 0.047), (4, 0.03), (5, 0.069), (6, -0.054), (7, 0.215), (8, -0.0), (9, 0.022), (10, -0.077), (11, 0.237), (12, 0.093), (13, -0.182), (14, 0.049), (15, -0.024), (16, 0.005), (17, -0.029), (18, 0.04), (19, 0.047), (20, -0.133), (21, -0.016), (22, 0.068), (23, -0.024), (24, 0.098), (25, -0.043), (26, -0.041), (27, 0.036), (28, 0.044), (29, -0.053), (30, -0.015), (31, -0.01), (32, 0.125), (33, 0.045), (34, -0.043), (35, -0.167), (36, -0.01), (37, 0.014), (38, 0.06), (39, 0.086), (40, 0.055), (41, 0.034), (42, 0.042), (43, -0.085), (44, -0.048), (45, -0.083), (46, 0.12), (47, 0.024), (48, 0.119), (49, -0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92640567 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

2 0.8776896 82 acl-2012-Entailment-based Text Exploration with Application to the Health-care Domain

Author: Meni Adler ; Jonathan Berant ; Ido Dagan

3 0.7411539 65 acl-2012-Crowdsourcing Inference-Rule Evaluation

Author: Naomi Zeichner ; Jonathan Berant ; Ido Dagan

4 0.71522081 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

Author: Jonathan Berant ; Ido Dagan ; Meni Adler ; Jacob Goldberger

5 0.62100726 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

Author: Elena Cabrio ; Serena Villata

Abstract: Blogs and forums are widely adopted by online communities to debate about various issues. However, a user that wants to cut in on a debate may experience some difficulties in extracting the current accepted positions, and can be discouraged from interacting through these applications. In our paper, we combine textual entailment with argumentation theory to automatically extract the arguments from debates and to evaluate their acceptability.

6 0.5004344 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

7 0.40246439 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

8 0.38924685 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

9 0.37285972 184 acl-2012-String Re-writing Kernel

10 0.36451283 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

11 0.33846238 78 acl-2012-Efficient Search for Transformation-based Inference

12 0.3099525 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

13 0.30950934 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

14 0.29784849 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

15 0.28116363 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

16 0.2680949 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

17 0.26645955 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

18 0.26428166 163 acl-2012-Prediction of Learning Curves in Machine Translation

19 0.25310537 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

20 0.2452939 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.036), (26, 0.041), (28, 0.051), (30, 0.045), (37, 0.046), (39, 0.043), (49, 0.011), (57, 0.015), (74, 0.04), (82, 0.017), (84, 0.021), (85, 0.068), (86, 0.254), (90, 0.089), (92, 0.046), (94, 0.028), (99, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78384328 32 acl-2012-Automated Essay Scoring Based on Finite State Transducer: towards ASR Transcription of Oral English Speech

Author: Xingyuan Peng ; Dengfeng Ke ; Bo Xu

Abstract: Conventional Automated Essay Scoring (AES) measures may cause severe problems when directly applied in scoring Automatic Speech Recognition (ASR) transcription as they are error sensitive and unsuitable for the characteristic of ASR transcription. Therefore, we introduce a framework of Finite State Transducer (FST) to avoid the shortcomings. Compared with the Latent Semantic Analysis with Support Vector Regression (LSA-SVR) method (stands for the conventional measures), our FST method shows better performance especially towards the ASR transcription. In addition, we apply the synonyms similarity to expand the FST model. The final scoring performance reaches an acceptable level of 0.80 which is only 0.07 lower than the correlation (0.87) between human raters.

2 0.76218343 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive

Author: Joel Nothman ; Matthew Honnibal ; Ben Hachey ; James R. Curran

Abstract: Interpreting news requires identifying its constituent events. Events are complex linguistically and ontologically, so disambiguating their reference is challenging. We introduce event linking, which canonically labels an event reference with the article where it was first reported. This implicitly relaxes coreference to co-reporting, and will practically enable augmenting news archives with semantic hyperlinks. We annotate and analyse a corpus of 150 documents, extracting 501 links to a news archive with reasonable inter-annotator agreement.

same-paper 3 0.7406497 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

4 0.56158787 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.

5 0.5241251 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

6 0.51621604 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

7 0.51461107 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

8 0.51367092 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API

9 0.51118618 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

10 0.50889039 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

11 0.50684738 136 acl-2012-Learning to Translate with Multiple Objectives

12 0.50536156 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

13 0.5036518 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

14 0.50343299 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

15 0.50267237 83 acl-2012-Error Mining on Dependency Trees

16 0.50052804 191 acl-2012-Temporally Anchored Relation Extraction

17 0.49975005 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

18 0.49762607 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

19 0.49733043 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

20 0.49665922 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information