acl acl2012 acl2012-116 knowledge-graph by maker-knowledge-mining

116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

Source: pdf

Author: Wei He ; Hua Wu ; Haifeng Wang ; Ting Liu

Abstract: unkown-abstract

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract1 We propose a novel approach to improve SMT via paraphrase rules which are automatically extracted from the bilingual training data. [sent-5, score-0.992]

2 Without using extra paraphrase resources, we acquire the rules by comparing the source side of the parallel corpus with the target-to-source translations of the target side. [sent-6, score-1.103]

3 Besides the word and phrase paraphrases, the acquired paraphrase rules mainly cover the structured paraphrases on the sentence level. [sent-7, score-1.399]

4 These rules are employed to enrich the SMT inputs for translation quality improvement. [sent-8, score-0.41]

5 The experimental results show that our proposed approach achieves significant improvements of 1. [sent-9, score-0.029]

6 1 Introduction The translation quality of the SMT system is highly related to the coverage of translation models. [sent-13, score-0.334]

7 However, no matter how much data is used for training, it is still impossible to completely cover the unlimited input sentences. [sent-14, score-0.184]

8 Naturally, a solution to the coverage problem is to bridge the gaps between the input sentences and the translation models, either from the input side, which targets on rewriting the input sentences to the MT-favored expressions, or from This work was done when the first author was visiting Baidu. [sent-16, score-0.661]

9 cn 979 the side of translation models, which tries to enrich the translation models to cover more expressions. [sent-20, score-0.516]

10 In recent years, paraphrasing has been proven useful for improving SMT quality. [sent-21, score-0.103]

11 The proposed methods can be classified into two categories according to the paraphrase targets: (1) enrich translation models to cover more bilingual expressions; (2) paraphrase the input sentences to reduce OOVs or generate multiple inputs. [sent-22, score-1.927]

12 (2008) and Nakov (2008) enriched the SMT models via paraphrasing the training corpora. [sent-25, score-0.073]

13 (2010) and Max (2010) used paraphrases to smooth translation models. [sent-27, score-0.422]

14 For the second category, previous studies mainly focus on finding translations for unknown terms using phrasal paraphrases. [sent-28, score-0.206]

15 (2009) paraphrase unknown terms in the input sentences using phrasal paraphrases extracted from bilingual and monolingual corpora. [sent-31, score-1.34]

16 (2009) rewrite OOVs with entailments and paraphrases acquired from WordNet. [sent-33, score-0.377]

17 (2010) use phrasal paraphrases to build a word lattice to get multiple input candidates. [sent-36, score-0.546]

18 In the above methods, only word or phrasal paraphrases are used for input sentence rewriting. [sent-37, score-0.559]

19 No structured paraphrases on the sentence level have been investigated. [sent-38, score-0.32]

20 However, the information in the sentence level is very important for disambiguation. [sent-39, score-0.076]

21 For example, we can only substitute play with drama in a context related to stage or theatre. [sent-40, score-0.037]

22 Phrasal paraphrase substitutions can hardly solve such kind of problems. [sent-41, score-0.701]

23 In this paper, we propose a method that rewrites Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-42, score-0.028]

24 the input sentences of the SMT system using automatically extracted paraphrase rules which can capture structures on sentence level in addition to paraphrases on the word or phrase level. [sent-45, score-1.396]

25 Without extra paraphrase resources, a novel approach is proposed to acquire paraphrase rules from the bilingual training corpus based on the results of Forward-Translation and Back-Translation. [sent-46, score-1.749]

26 The rules target on rewriting the input sentences to an MT-favored expression to ensure a better translation. [sent-47, score-0.407]

27 The paraphrase rules cover all kinds of paraphrases on the word, phrase and sentence levels, enabling structure reordering, word or phrase insertion, deletion and substitution. [sent-48, score-1.337]

28 The experimental results show that our proposed approach achieves significant improvements of 1. [sent-49, score-0.029]

29 Section 3 introduces our methods that extract paraphrase rules from the bilingual corpus of SMT. [sent-54, score-0.961]

30 Section 4 describes the strategies for constructing word lattice with paraphrase rules. [sent-55, score-0.79]

31 Finally, Section 8 concludes the paper and suggests directions for future work. [sent-58, score-0.03]

32 Back- The Back-Translation method is mainly used for automatic MT evaluation (Rapp 2009). [sent-60, score-0.038]

33 This 980 approach is very helpful when no target language reference is available. [sent-61, score-0.046]

34 The procedure includes translating a text into certain foreign language with the MT system (ForwardTranslation), and translating it back into the original language with the same system (Back- Translation). [sent-63, score-0.138]

35 Finally the translation quality of Back-Translation is evaluated by using the original source texts as references. [sent-64, score-0.183]

36 (2010) reported an interesting phenomenon: given a bilingual text, the BackTranslation results of the target sentences is better than the Forward-Translation results of the source sentences. [sent-66, score-0.197]

37 Clearly, let (S0, T0) be the initial pair of bilingual text. [sent-67, score-0.144]

38 A source-to-target translation system SYS_ST and a target-to-source translation system SYS_TS are trained using the bilingual corpus. [sent-68, score-0.405]

39 is a function of Back-Translation which can be deduced with two rounds of translations: ? [sent-77, score-0.034]

40 In the first round of translation, S0 and T0 are fed into SYS_ST and SYS_TS, and we get T1 and S1 as translation results. [sent-99, score-0.282]

41 In the second round, we translate S1 back into the target side with SYS_ST, and get the translation T2. [sent-100, score-0.275]

42 The procedure is illustrated in Figure 1, which can also formally be described as: 1. [sent-101, score-0.032]

43 (2010) that T2 achieves a higher score than T1 in automatic MT evaluation. [sent-106, score-0.029]

44 This outcome is important because T2 is translated N1342o. [sent-107, score-0.053]

45 Why the machine-generated text results in a better translation than the human-write text? [sent-109, score-0.151]

46 Note that all the texts of S0, S1, S2, T0 and T1 are sentence aligned because the initial parallel corpus (S0, T0) is aligned in the sentence level. [sent-111, score-0.38]

47 The aligned sentence pairs in (S0, S1) can be considered as paraphrases. [sent-112, score-0.155]

48 Taking (S0, S1) as paraphrase resource, we propose a method that automatically extracts paraphrase rules to capture the MT-favored structures. [sent-115, score-1.559]

49 1 Definition of Paraphrase Rules We define a paraphrase rule as follows: 1. [sent-117, score-0.771]

50 A paraphrase rule consists of two parts, lefthand-side (LHS) and right-hand-side (RHS). [sent-118, score-0.771]

51 Both of LHS and RHS consist of nonterminals (slot) and terminals (words). [sent-119, score-0.028]

52 A paraphrase rule in the format of: LHS  RHS which means the words matched by LHS can be paraphrased to RHS. [sent-124, score-0.922]

53 2 of paraphrase 981 rules are Selecting Paraphrase Sentence Pairs Following the methods in Section 2, the initial bilingual corpus is (S0, T0). [sent-127, score-1.002]

54 We train a source-totarget PBMT system (SYS_ST) and a target-tosource PBMT system (SYS_TS) on the parallel corpus. [sent-128, score-0.029]

55 As mentioned above, the detailed procedure is: T1 = SYS_ST(S0), S1 = SYS_TS(T0), T2 = SYS_ST(S1). [sent-130, score-0.032]

56 2002) score for every sentence in T2 and T1, using the corresponding sentence in T0 as reference. [sent-132, score-0.152]

57 If the sentence in T2 has a higher BLEU score than the aligned sentence in T1, the corresponding sentences in S0 and S1 are selected as candidate paraphrase sentence pairs, which are used in the following steps of paraphrase extractions. [sent-133, score-1.757]

58 3 Word Alignments Filtering We can construct word alignment between S0 and S1 through T0. [sent-135, score-0.093]

59 On the initial corpus of (S0, T0), we conduct word alignment with Giza++ (Och and Ney, 2000) in both directions and then apply the grow-diag-final heuristic (Koehn et al. [sent-136, score-0.164]

60 Because S1 is generated by feeding T0 into the PBMT system SYS_TS, the word alignment between T0 and S1 can be acquired from the verbose information of the decoder. [sent-138, score-0.222]

61 The word alignments of S0 and S1 contain noises which are produced by either wrong alignment of GIZA++ or translation errors of SYS_TS. [sent-139, score-0.309]

62 To ensure the alignment quality, we use some heuristics to filter the alignment between S0 and S1: 1. [sent-140, score-0.224]

63 If two identical words are aligned in S0 and S1, then remove all the other links to the two words. [sent-141, score-0.079]

64 Stop words (including some function words and punctuations) can only be aligned to either stop words or null. [sent-143, score-0.115]

65 Figure 2 illustrates an example of using the heuristics to filter alignment. [sent-144, score-0.062]

66 4 Extracting Paraphrase Rules From the word-aligned sentence pairs, we then extract a set of rules that are consistent with the word alignments. [sent-146, score-0.259]

67 We use the rule extracting methods of Chiang (2005). [sent-147, score-0.07]

68 However, it is risky to directly replace the input sentence with a paraphrased sentence, since the errors in automatic paraphrase substitution may jeopardize the translation result seriously. [sent-149, score-1.084]

69 To avoid such damage, for a given input sentence, we first transform all paraphrase rules that match the input sentences to phrasal paraphrases, and then build a word lattice 982 LHS:乘坐/ride Rule X1 公共汽车/bus RHS:乘坐/ride X1 巴士/bus 0 1 welcome ride 欢迎乘坐乘坐 ride 2 No. [sent-150, score-1.491]

70 10 巴士 bus Figure 3 : Example for Applying Paraphrase Rules for SMT decoder using the phrasal paraphrases. [sent-152, score-0.228]

71 In this case, the decoder can search for the best result among all the possible paths. [sent-153, score-0.03]

72 The input sentences are first segmented into subsentences by punctuations. [sent-154, score-0.13]

73 Then for each subsentence, the matched paraphrase rules are ranked according to: (1) the number of matched words; (2) the frequency of the paraphrase rule in the training data. [sent-155, score-1.783]

74 Actually, the ranking strategy tends to select paraphrase rules that have more matched words (therefore less ambiguity) and higher frequency (therefore more reliable). [sent-156, score-0.935]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('paraphrase', 0.701), ('paraphrases', 0.244), ('lhs', 0.202), ('handbag', 0.169), ('rules', 0.157), ('translation', 0.151), ('phrasal', 0.131), ('rhs', 0.126), ('smt', 0.124), ('bilingual', 0.103), ('pbmt', 0.101), ('round', 0.1), ('blue', 0.097), ('ivery', 0.085), ('ride', 0.085), ('input', 0.082), ('interest', 0.08), ('aligned', 0.079), ('matched', 0.077), ('sentence', 0.076), ('reserved', 0.074), ('oovs', 0.074), ('paraphrased', 0.074), ('paraphrasing', 0.073), ('cover', 0.071), ('enrich', 0.07), ('feel', 0.07), ('rule', 0.07), ('mt', 0.069), ('oral', 0.067), ('bus', 0.067), ('alignment', 0.067), ('lattice', 0.063), ('acquired', 0.055), ('bleu', 0.054), ('sentences', 0.048), ('acquire', 0.047), ('target', 0.046), ('side', 0.046), ('rewriting', 0.046), ('rewrite', 0.044), ('targets', 0.042), ('initial', 0.041), ('orders', 0.04), ('extra', 0.04), ('points', 0.039), ('mainly', 0.038), ('translations', 0.037), ('rapp', 0.037), ('sxl', 0.037), ('verbose', 0.037), ('avti', 0.037), ('mirkin', 0.037), ('feeding', 0.037), ('noises', 0.037), ('damage', 0.037), ('drama', 0.037), ('translating', 0.037), ('reordering', 0.037), ('stop', 0.036), ('phenomenon', 0.035), ('baidu', 0.034), ('tliu', 0.034), ('entailments', 0.034), ('deduced', 0.034), ('sun', 0.033), ('back', 0.032), ('procedure', 0.032), ('giza', 0.032), ('heuristics', 0.032), ('quality', 0.032), ('reversed', 0.031), ('fed', 0.031), ('unlimited', 0.031), ('welcome', 0.031), ('phrase', 0.031), ('extracted', 0.031), ('filter', 0.03), ('decoder', 0.03), ('directions', 0.03), ('bond', 0.03), ('proven', 0.03), ('nakov', 0.03), ('parallel', 0.029), ('achieves', 0.029), ('alignments', 0.028), ('harbin', 0.028), ('terminals', 0.028), ('rewrites', 0.028), ('bridge', 0.028), ('ensure', 0.028), ('translated', 0.027), ('tries', 0.027), ('smooth', 0.027), ('marton', 0.027), ('word', 0.026), ('outcome', 0.026), ('slot', 0.026), ('visiting', 0.026), ('gaps', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

Author: Wei He ; Hua Wu ; Haifeng Wang ; Ting Liu

Abstract: unkown-abstract

2 0.56948429 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

Author: Hong Sun ; Ming Zhou

Abstract: SMT has been used in paraphrase generation by translating a source sentence into another (pivot) language and then back into the source. The resulting sentences can be used as candidate paraphrases ofthe source sentence. Existing work that uses two independently trained SMT systems cannot directly optimize the paraphrase results. Paraphrase criteria especially the paraphrase rate is not able to be ensured in that way. In this paper, we propose a joint learning method of two SMT systems to optimize the process of paraphrase generation. In addition, a revised BLEU score (called iBLEU) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems. Our experiments on NIST 2008 testing data with automatic evaluation as well as human judgments suggest that the proposed method is able to enhance the paraphrase quality by adjusting between semantic equivalency and surface dissimilarity.

3 0.24095339 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

Author: MeiHua Chen ; ShihTing Huang ; HungTing Hsieh ; TingHui Kao ; Jason S. Chang

Abstract: Writing in English might be one of the most difficult tasks for EFL (English as a Foreign Language) learners. This paper presents FLOW, a writing assistance system. It is built based on first-language-oriented input function and context sensitive approach, aiming at providing immediate and appropriate suggestions including translations, paraphrases, and n-grams during composing and revising processes. FLOW is expected to help EFL writers achieve their writing flow without being interrupted by their insufficient lexical knowledge. 1.

4 0.19104326 184 acl-2012-String Re-writing Kernel

Author: Fan Bu ; Hang Li ; Xiaoyan Zhu

Abstract: Learning for sentence re-writing is a fundamental task in natural language processing and information retrieval. In this paper, we propose a new class of kernel functions, referred to as string re-writing kernel, to address the problem. A string re-writing kernel measures the similarity between two pairs of strings, each pair representing re-writing of a string. It can capture the lexical and structural similarity between two pairs of sentences without the need of constructing syntactic trees. We further propose an instance of string rewriting kernel which can be computed efficiently. Experimental results on benchmark datasets show that our method can achieve better results than state-of-the-art methods on two sentence re-writing learning tasks: paraphrase identification and recognizing textual entailment.

5 0.1610454 65 acl-2012-Crowdsourcing Inference-Rule Evaluation

Author: Naomi Zeichner ; Jonathan Berant ; Ido Dagan

Abstract: The importance of inference rules to semantic applications has long been recognized and extensive work has been carried out to automatically acquire inference-rule resources. However, evaluating such resources has turned out to be a non-trivial task, slowing progress in the field. In this paper, we suggest a framework for evaluating inference-rule resources. Our framework simplifies a previously proposed “instance-based evaluation” method that involved substantial annotator training, making it suitable for crowdsourcing. We show that our method produces a large amount of annotations with high inter-annotator agreement for a low cost at a short period of time, without requiring training expert annotators.

6 0.13588133 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

7 0.1313218 140 acl-2012-Machine Translation without Words through Substring Alignment

8 0.12350089 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

9 0.12297256 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

10 0.11202235 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

11 0.11095588 69 acl-2012-Deep Learning for NLP (without Magic)

12 0.10893459 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

13 0.1040976 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

14 0.10305449 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

15 0.10230737 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

16 0.10145123 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

17 0.099576727 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

18 0.093682148 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

19 0.093517676 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

20 0.090575673 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.228), (1, -0.232), (2, 0.137), (3, 0.075), (4, 0.122), (5, -0.039), (6, -0.044), (7, 0.181), (8, -0.086), (9, -0.026), (10, -0.066), (11, 0.211), (12, 0.023), (13, 0.223), (14, 0.191), (15, 0.209), (16, 0.058), (17, -0.253), (18, -0.303), (19, 0.194), (20, 0.134), (21, 0.129), (22, -0.239), (23, -0.068), (24, -0.079), (25, -0.034), (26, 0.075), (27, -0.03), (28, -0.023), (29, -0.003), (30, -0.004), (31, 0.06), (32, -0.05), (33, 0.066), (34, -0.024), (35, 0.002), (36, -0.034), (37, 0.003), (38, -0.051), (39, 0.048), (40, -0.068), (41, -0.018), (42, 0.019), (43, -0.048), (44, -0.017), (45, 0.01), (46, 0.034), (47, -0.043), (48, -0.019), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94359028 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

Author: Wei He ; Hua Wu ; Haifeng Wang ; Ting Liu

Abstract: unkown-abstract

2 0.9140054 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

Author: Hong Sun ; Ming Zhou

3 0.70802736 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

Author: MeiHua Chen ; ShihTing Huang ; HungTing Hsieh ; TingHui Kao ; Jason S. Chang

4 0.574242 184 acl-2012-String Re-writing Kernel

Author: Fan Bu ; Hang Li ; Xiaoyan Zhu

5 0.45214108 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

Author: Sander Wubben ; Antal van den Bosch ; Emiel Krahmer

Abstract: In this paper we describe a method for simplifying sentences using Phrase Based Machine Translation, augmented with a re-ranking heuristic based on dissimilarity, and trained on a monolingual parallel corpus. We compare our system to a word-substitution baseline and two state-of-the-art systems, all trained and tested on paired sentences from the English part of Wikipedia and Simple Wikipedia. Human test subjects judge the output of the different systems. Analysing the judgements shows that by relatively careful phrase-based paraphrasing our model achieves similar sim- a. plification results to state-of-the-art systems, while generating better formed output. We also argue that text readability metrics such as the Flesch-Kincaid grade level should be used with caution when evaluating the output of simplification systems.

6 0.38295543 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

7 0.36307696 162 acl-2012-Post-ordering by Parsing for Japanese-English Statistical Machine Translation

8 0.34891975 65 acl-2012-Crowdsourcing Inference-Rule Evaluation

9 0.33260369 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

10 0.32195038 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation

11 0.32045415 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

12 0.31999165 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

13 0.3104479 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

14 0.30749875 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

15 0.2979655 140 acl-2012-Machine Translation without Words through Substring Alignment

16 0.29088557 108 acl-2012-Hierarchical Chunk-to-String Translation

17 0.28851932 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

18 0.28540608 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

19 0.27627596 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

20 0.27037331 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(26, 0.03), (28, 0.062), (30, 0.066), (37, 0.029), (39, 0.063), (48, 0.01), (60, 0.256), (74, 0.04), (82, 0.016), (84, 0.018), (85, 0.037), (90, 0.108), (92, 0.05), (94, 0.052), (99, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.6974414 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

Author: Remy Kessler ; Xavier Tannier ; Caroline Hagege ; Veronique Moriceau ; Andre Bittar

Abstract: We present an approach for detecting salient (important) dates in texts in order to automatically build event timelines from a search query (e.g. the name of an event or person, etc.). This work was carried out on a corpus of newswire texts in English provided by the Agence France Presse (AFP). In order to extract salient dates that warrant inclusion in an event timeline, we first recognize and normalize temporal expressions in texts and then use a machine-learning approach to extract salient dates that relate to a particular topic. We focused only on extracting the dates and not the events to which they are related.

same-paper 2 0.67580676 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

Author: Wei He ; Hua Wu ; Haifeng Wang ; Ting Liu

Abstract: unkown-abstract

3 0.58280337 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

Author: Brian Roark ; Richard Sproat ; Cyril Allauzen ; Michael Riley ; Jeffrey Sorensen ; Terry Tai

Abstract: In this paper, we present a new collection of open-source software libraries that provides command line binary utilities and library classes and functions for compiling regular expression and context-sensitive rewrite rules into finite-state transducers, and for n-gram language modeling. The OpenGrm libraries use the OpenFst library to provide an efficient encoding of grammars and general algorithms for building, modifying and applying models.

4 0.54364407 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

Author: Jonathan Berant ; Ido Dagan ; Meni Adler ; Jacob Goldberger

Abstract: Learning entailment rules is fundamental in many semantic-inference applications and has been an active field of research in recent years. In this paper we address the problem of learning transitive graphs that describe entailment rules between predicates (termed entailment graphs). We first identify that entailment graphs exhibit a “tree-like” property and are very similar to a novel type of graph termed forest-reducible graph. We utilize this property to develop an iterative efficient approximation algorithm for learning the graph edges, where each iteration takes linear time. We compare our approximation algorithm to a recently-proposed state-of-the-art exact algorithm and show that it is more efficient and scalable both theoretically and empirically, while its output quality is close to that given by the optimal solution of the exact algorithm.

5 0.54005861 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

Author: Ce Zhang ; Feng Niu ; Christopher Re ; Jude Shavlik

Abstract: Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing. There is, however, no study comparing the relative impact of these two sources on the precision and recall of post-learning answers. To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the corpus size for distant supervision has a statistically significant, positive impact on quality (F1 score). In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall.

6 0.53684747 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

7 0.5362556 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

8 0.53337193 139 acl-2012-MIX Is Not a Tree-Adjoining Language

9 0.53297997 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

10 0.53264052 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

11 0.53187335 83 acl-2012-Error Mining on Dependency Trees

12 0.53183818 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

13 0.53099024 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

14 0.53098589 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

15 0.5304718 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

16 0.52964312 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

17 0.52897328 140 acl-2012-Machine Translation without Words through Substring Alignment

18 0.52816474 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

19 0.52775913 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

20 0.52738327 136 acl-2012-Learning to Translate with Multiple Objectives