acl acl2012 acl2012-63 knowledge-graph by maker-knowledge-mining

63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

Source: pdf

Author: Lea Frermann ; Francis Bond

Abstract: We present a system for cross-lingual parse disambiguation, exploiting the assumption that the meaning of a sentence remains unchanged during translation and the fact that different languages have different ambiguities. We simultaneously reduce ambiguity in multiple languages in a fully automatic way. Evaluation shows that the system reliably discards dispreferred parses from the raw parser output, which results in a pre-selection that can speed up manual treebanking.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de i Abstract We present a system for cross-lingual parse disambiguation, exploiting the assumption that the meaning of a sentence remains unchanged during translation and the fact that different languages have different ambiguities. [sent-3, score-0.568]

2 We simultaneously reduce ambiguity in multiple languages in a fully automatic way. [sent-4, score-0.303]

3 Evaluation shows that the system reliably discards dispreferred parses from the raw parser output, which results in a pre-selection that can speed up manual treebanking. [sent-5, score-0.391]

4 The manual construction of treebanks, where a hu- man annotator selects a gold parse from all parses returned by a parser, is a tedious and error prone process. [sent-7, score-0.631]

5 We present a system for simultaneous and accurate partial parse disambiguation of multiple languages. [sent-8, score-0.553]

6 Using the pre-selected set of parses returned by the system, the treebanking process for multiple languages can be sped up. [sent-9, score-0.547]

7 The languages of the parallel corpus are considered as mutual semantic tags: As the meaning of a sentence stays constant during translation, we are able to resolve ambiguities which exist in only one of the langauges by only accepting those interpretations which are licensed by the other language. [sent-11, score-0.523]

8 In particular, we select one language as the target language, translate the other language’s semantics for every parse into the target language and thus align maximally similar semantic representations. [sent-12, score-0.488]

9 org The parses with the most overlapping semantics are selected as preferred parses. [sent-14, score-0.315]

10 (3) は彼ら５時に店を閉めた kare ra wa 5 ji ni mise wo shime ta he PL TOP 5 hour at shop ACC close PAST “At 5 o’clock, they closed the shop. [sent-16, score-0.462]

11 ” clo se ( they shop ) ; at ( close 5 ) “At 5 o’clock, as for them, someone closed the shop. [sent-17, score-0.688]

12 ” clo se ( shop ) ; at ( clo se , 5 ) t opi c ( they clo se ) , (4) φ, , , We show the semantic representation of the ambiguity with each sentence. [sent-18, score-1.268]

13 Both languages are disambiguated by the other language as only the English interpretation (1) is supported in Japanese, and only the Japanese interpretation (3) leads to a grammatical English sentence. [sent-19, score-0.365]

14 2 Related Work There is no group using exactly the same approach as ours: automated parallel parse disambiguation on the basis of semantic analyses. [sent-20, score-0.608]

15 Zhechev and 1In fact it has four, as they can be either plural or the androgynous singular, this is also disambiguated by the Japanese. [sent-21, score-0.175]

16 c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 2s5–129, Way (2008) automatically generate parallel treebanks for training of statistical machine translation (SMT) systems through sub-tree alignment. [sent-24, score-0.24]

17 We do not aim to carry out the complete treebanking process, but to optimize speed and precision of manual creation of high-quality treebanks. [sent-25, score-0.302]

18 Wu (1997) and others have tried to simultaneously learn grammars from bilingual texts. [sent-26, score-0.144]

19 Burkett and Klein (2008) induce node-alignments of syntactic trees with a log-linear model, in order to guide bilingual parsing. [sent-27, score-0.041]

20 (201 1) translate an existing treebank using an SMT system and then project parse results from the treebank to the other language. [sent-29, score-0.371]

21 These approaches align at the syntactic level (using CFGs and dependencies respectively). [sent-31, score-0.096]

22 In contrast to the above approaches, we assume the existence of grammars and use a semantic representation as the appropriate level for cross-lingual processing. [sent-32, score-0.129]

23 We compare semantic sub-structures, as those are more straightforwardly comparable across different languages. [sent-33, score-0.112]

24 As a consequence, our system is applicable to any combination of languages. [sent-34, score-0.042]

25 The input is plain parallel text, neither side needs to be treebanked. [sent-35, score-0.086]

26 3 Materials and Methods We use grammars within the grammatical framework of head-driven phrase-structure grammar (HPSG Pollard and Sag (1994)), with the semantic representation of minimal recursion semantics (MRS; Copestake et al. [sent-36, score-0.286]

27 We use two largescale HPSG grammars and a Japanese-English machine translation system, all of which were developed in the DELPH-IN framework:2 The English Resource Grammar (ERG; Flickinger (2000)) is used for English parsing, and Jacy (Bender and Siegel, 2004) for parsing Japanese. [sent-38, score-0.205]

28 For Japanese to English translation we use Jaen, a semantictransfer based machine translation system (Bond et al. [sent-39, score-0.23]

29 1 Semantic Interface and Alignment For the alignment, we convert the MRS structures into simplified elementary dependency graphs 2http://www. [sent-42, score-0.091]

30 net/ 126 x4 :pronoun_q [ ] e2 :_cl o s e_v_c [ARG1 x4 :pron x9 :_the_q [ ] e 8 :_at_p_temp [ARG1 e2 , , ARG2 ARG2 x 9 :_shop_n_o f ] x 1 :_num_hour 6 (5 ) ] x16 :_de f_impl i cit_q [ ] Figure 1: EDG for They closed the shop at five. [sent-44, score-0.429]

31 (EDGs), which abstract away information about grammatical properties of relations and scopal information. [sent-45, score-0.161]

32 Preliminary experiments showed that the former kind of information did not contribute to disambiguation performance, as number is typically underspecified in Japanese. [sent-46, score-0.215]

33 As we only consider local information in the alignment, scopal information can be ignored as well. [sent-47, score-0.077]

34 An EDG consists of a bag of elementary predicates (EPs) which are themselves composed of relations. [sent-49, score-0.091]

35 Relations are the elementary building blocks of the EDG, and loosely correspond to words of the surface string. [sent-51, score-0.091]

36 EPs consist either of atomic relations (corresponding to quantifiers), or a predicateargument structure which is composed of several relations. [sent-52, score-0.084]

37 During alignment, we only consider nonatomic EPs, as quantifiers should be considered as grammatical properties of (lexical) relations, which we chose to ignore. [sent-53, score-0.114]

38 Given the EDG representations of the translated Japanese sentence, and the original target language EDGs, we can straightforwardly align by matching substructures of different granularity. [sent-54, score-0.194]

39 We are experimenting with aligning further dependency relation based tuples, which would allow us to resolve more structural ambiguities. [sent-56, score-0.097]

40 2 The Disambiguation System Ambiguity in the analyses for both languages is reduced on the basis of the semantic analyses returned for each sentence-pair, and a reduced set of preferred analyses is returned for both languages. [sent-58, score-1.264]

41 We are comparing semantic representations of the same language, the English text from the bilingual corpus and the English machine translation of the Japanese text. [sent-60, score-0.198]

42 In order to increase robustness of our alignment system we not only consider complete translations, but also accept partially translated MRSs in case no complete translation could be produced. [sent-61, score-0.326]

43 This step significantly increases the recall, while the partial MRSs proved to be informative enough for parse disambiguation. [sent-62, score-0.332]

44 4 Evaluation and Results We evaluate our model on the task of parse disambiguation. [sent-63, score-0.243]

45 We use full sentence match as evaluation metric, a challenging target. [sent-64, score-0.046]

46 It is an open corpus of JapaneseEnglish sentence pairs. [sent-66, score-0.046]

47 We use version (2008-1 1) which contains 147,190 sentence pairs. [sent-67, score-0.046]

48 We hold out 4,500 sentence pairs each for development and test. [sent-68, score-0.046]

49 For each sentence, we compare the number of theoretically possible alignments with the number of preferred alignments returned by our system. [sent-69, score-0.219]

50 87 parses out of (at most) 11 analyses remain in the partially disambiguated list: both languages benefit equally from the disambiguation. [sent-73, score-0.654]

51 We evaluate disambiguation accuracy by counting the number oftimes the gold parse was present in the partially disambiguated set (full sentence match). [sent-74, score-0.777]

52 The correct parse is included in the reduced set in 80% of the cases for Japanese, and for 82% of the cases in English. [sent-76, score-0.393]

53 We match atomic relations when aligning the semantic structures, which is a very generic method applicable to the vast majority of sentence pairs. [sent-77, score-0.245]

54 This leads to a recall score of 3These are ranked with a model trained on a handtreebanked set. [sent-78, score-0.085]

55 The cutoff was determined empirically: For both languages the gold parse is included in the top 11parses in more than 97% of the cases. [sent-79, score-0.578]

56 837 Table 1: Accuracy and F-scores for disambiguation performance of our system. [sent-92, score-0.179]

57 ’Included’ : inclusion of the gold parse in the reduced set of parses or not. [sent-94, score-0.578]

58 ’First Rank’ : ranking of the preferred parse as top in the reduced list. [sent-95, score-0.595]

59 ’MRR’ : mean reciprocal rank of the gold parse in the list. [sent-96, score-0.34]

60 The reduced list of parser analyses can be further ranked by the parse ranking model which is included in the parsers of the respective languages (the same models with which we determined the top 11 analyses). [sent-100, score-0.978]

61 Given this ranking, we can evaluate how often the preferred parse is ranked top in our partially disambiguated list; results are shown in the two bottom lines of Table 1. [sent-101, score-0.687]

62 A ranked list of possible preferred parses whose top rank corresponds with a high probability to the gold parse should further speed up the manual treebanking process. [sent-102, score-1.006]

63 Performance in the context of the whole pipeline The performance of parsers and MT system strongly influences the end-to-end results of the presented system. [sent-103, score-0.094]

64 We lose around 29% of our data because no parse could be produced in one or both languages, or no translation could be produced. [sent-105, score-0.398]

65 and a further 5% of the sentences did not have the gold parse in the original set of analyses (before align- ment): our system could not possibly select the correct parse in those cases. [sent-106, score-0.792]

66 5 Discussion Our system builds on the output of two parsers and a machine translation system. [sent-107, score-0.188]

67 We reduce ambiguity for all sentence pairs where a parse could be created for both languages, and for which there was at least a partial translation. [sent-108, score-0.501]

68 For these sentences, the cross-lingual alignment component achieves a recall of above 99%, such that we do not lose any additional data. [sent-109, score-0.199]

69 The parsers and the MT system include a parse ranking system trained on human gold annotations. [sent-110, score-0.541]

70 We use these models in parsing and translation to select the top 11 analyses. [sent-111, score-0.19]

71 Our system thus depends on a range of existing technologies. [sent-112, score-0.042]

72 The effectiveness of cross-lingual parse disambiguation on the basis of semantic alignment highly depends on the languages of choice. [sent-114, score-0.769]

73 Given that we exploit the differences between languages, pairs of less related languages should lead to better disam- biguation performance. [sent-115, score-0.143]

74 Furthermore, disambiguating with more than two languages should improve performance. [sent-116, score-0.143]

75 4 One weakness when considering the disambiguated sentences as training for a parse ranking model is that the translation fails on similar kinds of sentences, so there are some phenomena which we get no examples of — the automatically trained treebank does not have a uniform coverage of phenomena. [sent-118, score-0.66]

76 Our models may not discriminate some phenomena at all. [sent-119, score-0.04]

77 Our system provides large amounts of automatically annotated data at the only cost of CPU time: so far we have disambiguated 25,000 sentences: 10 times more than the existing hand annotated gold data. [sent-120, score-0.314]

78 Using the parser output for speeding up manual treebanking is most effective if the gold parse is reliably included in the reduced set of parses. [sent-121, score-0.841]

79 Increasing precision by accepting more than only the most overlapping parses may lead to more effective manual treebanking. [sent-122, score-0.263]

80 The alignment method we propose does not make any language-specific assumptions, nor is it limited to align two languages only. [sent-123, score-0.343]

81 6 Conclusion and Future Work Translating a sentence into a different language changes its surface form, but not its meaning. [sent-125, score-0.046]

82 In 4For example the PP attachment ambiguity in John said that he went on Tuesday where either the saying or the going could have happened on Tuesday holds in both English and Japanese. [sent-126, score-0.191]

83 128 parallel corpora, one language can be viewed as a semantic tag of the other language and vice versa, which allows for disambiguation of phenomena which are ambiguous in only one of the languages. [sent-127, score-0.368]

84 We use the above observations for cross-lingual parse disambiguation. [sent-128, score-0.243]

85 We experimented with the language pair of English and Japanese, and were able to accurately reduce ambiguity in parser analyses simultaneously for both languages to 30% of the starting ambiguity. [sent-129, score-0.526]

86 The remaining parses can be used as a pre-selection to speed up the manual treebanking process. [sent-130, score-0.434]

87 We started working on an extrinsic evaluation of the presented system by training a discriminative parse ranking model on the output of our alignment process. [sent-131, score-0.494]

88 Our next step will be to evaluate the system as part of the treebanking process, and optimize the parameters such as disambiguation precision vs. [sent-133, score-0.404]

89 As no language-specific assumptions are hard coded in our disambiguation system, it would be very interesting to apply the system to different language pairs as well as groups of more than two languages. [sent-135, score-0.221]

90 Using a group of languages for disambiguation will likely lead to increased and more accurate disambiguation, as more constraints are imposed on the data. [sent-136, score-0.322]

91 Probably the most important goal for future work is improving the recall achieved in the complete disambiguation pipeline. [sent-137, score-0.213]

92 Many sentence-pairs cannot be disambiguated because either no parse can be generated for one or both languages, or no (partial) translation can be produced. [sent-138, score-0.512]

93 Following the idea of partial translations, partial parses may be a valid backoff. [sent-139, score-0.31]

94 For purposes of cross-lingual align- ment, partial structures may contribute enough information for disambiguation. [sent-140, score-0.125]

95 There has been work regarding partial parsing in the HPSG community (Zhang and Kordoni, 2008), which we would like to explore. [sent-141, score-0.134]

96 There is also current work on learning more types and instances of transfer rules (Haugereid and Bond, 2011). [sent-142, score-0.056]

97 Finally, we would like to investigate more alignment methods, such as dependency relation based alignment which we started experimenting with, or EDM-based metrics as presented in (Dridan and Oepen, 2011). [sent-143, score-0.293]

98 Two languages are better than one (for syntactic parsing). [sent-157, score-0.143]

99 Extracting transfer rules for multiword expressions from parallel corpora. [sent-180, score-0.193]

100 Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. [sent-193, score-0.238]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('shop', 0.305), ('parse', 0.243), ('edg', 0.23), ('japanese', 0.197), ('mrss', 0.191), ('clo', 0.183), ('treebanking', 0.183), ('disambiguation', 0.179), ('disambiguated', 0.175), ('analyses', 0.167), ('languages', 0.143), ('bond', 0.134), ('parses', 0.132), ('preferred', 0.13), ('closed', 0.124), ('ambiguity', 0.123), ('hpsg', 0.118), ('eps', 0.115), ('flickinger', 0.114), ('reduced', 0.106), ('alignment', 0.104), ('gold', 0.097), ('align', 0.096), ('translation', 0.094), ('pollard', 0.091), ('mrs', 0.091), ('elementary', 0.091), ('returned', 0.089), ('partial', 0.089), ('parallel', 0.086), ('bender', 0.077), ('edgs', 0.077), ('haugereid', 0.077), ('petter', 0.077), ('scopal', 0.077), ('se', 0.076), ('francis', 0.073), ('manual', 0.07), ('quantifiers', 0.067), ('tuesday', 0.067), ('clock', 0.067), ('dridan', 0.067), ('oepen', 0.067), ('tanaka', 0.067), ('zhechev', 0.067), ('grammars', 0.066), ('ranking', 0.065), ('semantic', 0.063), ('lose', 0.061), ('accepting', 0.061), ('treebanks', 0.06), ('english', 0.058), ('copestake', 0.057), ('recursion', 0.057), ('parser', 0.056), ('transfer', 0.056), ('carl', 0.054), ('semantics', 0.053), ('parsers', 0.052), ('aligning', 0.052), ('ranked', 0.051), ('multiword', 0.051), ('top', 0.051), ('speed', 0.049), ('translated', 0.049), ('burkett', 0.049), ('interpretations', 0.049), ('straightforwardly', 0.049), ('atomic', 0.047), ('grammatical', 0.047), ('sentence', 0.046), ('experimenting', 0.045), ('parsing', 0.045), ('included', 0.044), ('treebank', 0.043), ('reliably', 0.042), ('ivan', 0.042), ('ambiguities', 0.042), ('system', 0.042), ('bilingual', 0.041), ('started', 0.04), ('ment', 0.04), ('phenomena', 0.04), ('smt', 0.037), ('simultaneously', 0.037), ('basis', 0.037), ('stephan', 0.037), ('partially', 0.037), ('relations', 0.037), ('contribute', 0.036), ('attachment', 0.035), ('recall', 0.034), ('melanie', 0.033), ('compilation', 0.033), ('licensed', 0.033), ('maximally', 0.033), ('mise', 0.033), ('saying', 0.033), ('technological', 0.033), ('velldal', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

Author: Lea Frermann ; Francis Bond

2 0.15755005 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

Author: Hui Zhang ; David Chiang

Abstract: Syntax-based translation models that operate on the output of a source-language parser have been shown to perform better if allowed to choose from a set of possible parses. In this paper, we investigate whether this is because it allows the translation stage to overcome parser errors or to override the syntactic structure itself. We find that it is primarily the latter, but that under the right conditions, the translation stage does correct parser errors, improving parsing accuracy on the Chinese Treebank.

3 0.12930177 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Author: Xianchao Wu ; Katsuhito Sudoh ; Kevin Duh ; Hajime Tsukada ; Masaaki Nagata

Abstract: This paper presents a comparative study of target dependency structures yielded by several state-of-the-art linguistic parsers. Our approach is to measure the impact of these nonisomorphic dependency structures to be used for string-to-dependency translation. Besides using traditional dependency parsers, we also use the dependency structures transformed from PCFG trees and predicate-argument structures (PASs) which are generated by an HPSG parser and a CCG parser. The experiments on Chinese-to-English translation show that the HPSG parser’s PASs achieved the best dependency and translation accuracies. 1

4 0.12616052 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

5 0.11941727 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

Author: Nan Yang ; Mu Li ; Dongdong Zhang ; Nenghai Yu

Abstract: Long distance word reordering is a major challenge in statistical machine translation research. Previous work has shown using source syntactic trees is an effective way to tackle this problem between two languages with substantial word order difference. In this work, we further extend this line of exploration and propose a novel but simple approach, which utilizes a ranking model based on word order precedence in the target language to reposition nodes in the syntactic parse tree of a source sentence. The ranking model is automatically derived from word aligned parallel data with a syntactic parser for source language based on both lexical and syntactical features. We evaluated our approach on largescale Japanese-English and English-Japanese machine translation tasks, and show that it can significantly outperform the baseline phrase- based SMT system.

6 0.10984815 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

7 0.10441691 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

8 0.097997881 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

9 0.097624011 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

10 0.097556531 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

11 0.094932012 134 acl-2012-Learning to Find Translations and Transliterations on the Web

12 0.090977542 162 acl-2012-Post-ordering by Parsing for Japanese-English Statistical Machine Translation

13 0.088710248 71 acl-2012-Dependency Hashing for n-best CCG Parsing

14 0.086477794 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

15 0.0849232 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

16 0.082994044 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

17 0.08203809 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

18 0.081083953 64 acl-2012-Crosslingual Induction of Semantic Roles

19 0.081078358 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

20 0.079765484 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.243), (1, -0.106), (2, -0.074), (3, -0.048), (4, 0.022), (5, -0.056), (6, -0.017), (7, -0.001), (8, 0.015), (9, 0.008), (10, 0.112), (11, 0.043), (12, -0.005), (13, 0.049), (14, -0.006), (15, -0.082), (16, -0.032), (17, -0.072), (18, -0.054), (19, 0.037), (20, -0.106), (21, -0.01), (22, 0.028), (23, -0.026), (24, -0.019), (25, 0.038), (26, -0.061), (27, -0.008), (28, 0.109), (29, -0.036), (30, 0.058), (31, 0.007), (32, 0.012), (33, -0.025), (34, -0.034), (35, 0.062), (36, 0.038), (37, -0.017), (38, 0.004), (39, 0.044), (40, 0.032), (41, 0.057), (42, -0.081), (43, 0.038), (44, -0.127), (45, -0.154), (46, -0.023), (47, 0.021), (48, 0.013), (49, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9544192 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

Author: Lea Frermann ; Francis Bond

2 0.64145249 162 acl-2012-Post-ordering by Parsing for Japanese-English Statistical Machine Translation

Author: Isao Goto ; Masao Utiyama ; Eiichiro Sumita

Abstract: Reordering is a difficult task in translating between widely different languages such as Japanese and English. We employ the postordering framework proposed by (Sudoh et al., 2011b) for Japanese to English translation and improve upon the reordering method. The existing post-ordering method reorders a sequence of target language words in a source language word order via SMT, while our method reorders the sequence by: 1) parsing the sequence to obtain syntax structures similar to a source language structure, and 2) transferring the obtained syntax structures into the syntax structures of the target language.

3 0.63468885 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

Author: Hui Zhang ; David Chiang

4 0.62204206 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

Author: Seyed Abolghasem Mirroshandel ; Alexis Nasr ; Joseph Le Roux

Abstract: Treebanks are not large enough to reliably model precise lexical phenomena. This deficiency provokes attachment errors in the parsers trained on such data. We propose in this paper to compute lexical affinities, on large corpora, for specific lexico-syntactic configurations that are hard to disambiguate and introduce the new information in a parser. Experiments on the French Treebank showed a relative decrease ofthe error rate of 7. 1% Labeled Accuracy Score yielding the best parsing results on this treebank.

5 0.59363592 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

Author: Marcis Pinnis ; Radu Ion ; Dan Stefanescu ; Fangzhong Su ; Inguna Skadina ; Andrejs Vasiljevs ; Bogdan Babych

Abstract: The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English, Latvian, Lithuanian, and Romanian languages.

6 0.58811617 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

7 0.56884038 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation

8 0.56455922 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

9 0.56143779 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

10 0.56123787 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

11 0.55426556 140 acl-2012-Machine Translation without Words through Substring Alignment

12 0.54733187 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

13 0.53902972 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

14 0.53897429 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

15 0.5316959 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

16 0.52138084 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

17 0.51949298 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

18 0.51607817 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

19 0.50471985 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

20 0.50435829 160 acl-2012-Personalized Normalization for a Multilingual Chat System

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(15, 0.013), (25, 0.019), (26, 0.048), (28, 0.055), (30, 0.032), (37, 0.063), (39, 0.054), (40, 0.265), (59, 0.012), (71, 0.017), (74, 0.031), (82, 0.016), (84, 0.026), (85, 0.057), (90, 0.095), (92, 0.062), (94, 0.012), (99, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72832358 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

Author: Lea Frermann ; Francis Bond

2 0.63905412 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

Author: Arianna Bisazza ; Marcello Federico

Abstract: This paper presents a novel method to suggest long word reorderings to a phrase-based SMT decoder. We address language pairs where long reordering concentrates on few patterns, and use fuzzy chunk-based rules to predict likely reorderings for these phenomena. Then we use reordered n-gram LMs to rank the resulting permutations and select the n-best for translation. Finally we encode these reorderings by modifying selected entries of the distortion cost matrix, on a per-sentence basis. In this way, we expand the search space by a much finer degree than if we simply raised the distortion limit. The proposed techniques are tested on Arabic-English and German-English using well-known SMT benchmarks.

3 0.52266562 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

Author: Danilo Croce ; Alessandro Moschitti ; Roberto Basili ; Martha Palmer

Abstract: In this paper, we propose innovative representations for automatic classification of verbs according to mainstream linguistic theories, namely VerbNet and FrameNet. First, syntactic and semantic structures capturing essential lexical and syntactic properties of verbs are defined. Then, we design advanced similarity functions between such structures, i.e., semantic tree kernel functions, for exploiting distributional and grammatical information in Support Vector Machines. The extensive empirical analysis on VerbNet class and frame detection shows that our models capture mean- ingful syntactic/semantic structures, which allows for improving the state-of-the-art.

4 0.51770401 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

5 0.51297629 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

Author: Jonathan Berant ; Ido Dagan ; Meni Adler ; Jacob Goldberger

Abstract: Learning entailment rules is fundamental in many semantic-inference applications and has been an active field of research in recent years. In this paper we address the problem of learning transitive graphs that describe entailment rules between predicates (termed entailment graphs). We first identify that entailment graphs exhibit a “tree-like” property and are very similar to a novel type of graph termed forest-reducible graph. We utilize this property to develop an iterative efficient approximation algorithm for learning the graph edges, where each iteration takes linear time. We compare our approximation algorithm to a recently-proposed state-of-the-art exact algorithm and show that it is more efficient and scalable both theoretically and empirically, while its output quality is close to that given by the optimal solution of the exact algorithm.

6 0.51148295 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

7 0.50984931 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

8 0.50833166 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

9 0.50684667 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

10 0.5065825 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

11 0.5045293 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

12 0.50357169 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

13 0.50329232 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

14 0.50265425 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

15 0.5017823 191 acl-2012-Temporally Anchored Relation Extraction

16 0.50152856 184 acl-2012-String Re-writing Kernel

17 0.49993578 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

18 0.49932864 136 acl-2012-Learning to Translate with Multiple Objectives

19 0.49884042 167 acl-2012-QuickView: NLP-based Tweet Search

20 0.49849412 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords