acl acl2013 acl2013-355 knowledge-graph by maker-knowledge-mining

355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain


Source: pdf

Author: Anoop Kunchukuttan ; Rajen Chatterjee ; Shourya Roy ; Abhijit Mishra ; Pushpak Bhattacharyya

Abstract: Large amount of parallel corpora is required for building Statistical Machine Translation (SMT) systems. We describe the TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators. Our system uses a Map-Reduce-like approach to translation crowdsourcing where sentence translation is decomposed into the following smaller tasks: (a) translation ofconstituent phrases of the sentence; (b) validation of quality of the phrase translations; and (c) composition of complete sentence translations from phrase translations. Trans- Doop incorporates quality control mechanisms and easy-to-use worker user interfaces designed to address issues with translation crowdsourcing. We have evaluated the crowd’s output using the METEOR metric. For a complex domain like judicial proceedings, the higher scores obtained by the map-reduce based approach compared to complete sentence translation establishes the efficacy of our work.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Large amount of parallel corpora is required for building Statistical Machine Translation (SMT) systems. [sent-9, score-0.079]

2 We describe the TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators. [sent-10, score-0.629]

3 Trans- Doop incorporates quality control mechanisms and easy-to-use worker user interfaces designed to address issues with translation crowdsourcing. [sent-12, score-0.83]

4 For a complex domain like judicial proceedings, the higher scores obtained by the map-reduce based approach compared to complete sentence translation establishes the efficacy of our work. [sent-14, score-0.378]

5 Amazon Mechanical Turk(AMT) and CrowdFlower 1 are representative general purpose crowdsourcing platforms where as Lingotek and Gengo2 are companies targeted at localization and translation of content typically leveraging freelancers. [sent-18, score-0.858]

6 Our interest is towards developing a crowdsourcing based system to enable general, nonexpert crowd-workers generate natural language content equivalent in quality to that of expert linguists. [sent-19, score-0.741]

7 Realization of the potential of attaining great scalability and cost-benefit of crowdsourcing for natural language tasks is limited by the ability of novice multi-lingual workers generate high quality translations. [sent-20, score-0.997]

8 We have specific interest in Indian languages due to the large linguistic diversity as well as the scarcity oflinguistic resources in these languages when compared to European languages. [sent-21, score-0.064]

9 However, this is a non-trivial task owing to lack of expertise of novice crowd workers in translation of content. [sent-23, score-0.707]

10 It is well understood that familiarity with multiple languages might not be good enough for people to generate high quality translations. [sent-24, score-0.153]

11 Common techniques for quality control like gold data based validation and worker reputation are not effective for a subjective task 1http : / /www . [sent-26, score-0.521]

12 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 175–180, like translation which does not have any task specific measurements. [sent-38, score-0.225]

13 Having expert linguists manually validate crowd generated content defies the purpose of deploying crowdsourcing on a large scale. [sent-39, score-0.837]

14 The technique can be considered similar to a Map-Reduce task run on crowd processors, where the translation task is split into simpler tasks distributed to the crowd (the map stage) and the results are later combined in a reduce stage to generate complete translations. [sent-41, score-0.853]

15 The attempt is to make translation tasks easy and intuitive for novice crowd-workers by providing translations aids to help them generate high quality of translations. [sent-42, score-0.604]

16 The multi-stage, Mapreduce approach simplifies the translation task for crowd workers, while novel design of user inter- face makes the task convenient for the worker and discourages spamming. [sent-44, score-0.857]

17 The system thus offers the potential to generate high quality parallel corpora on a large scale. [sent-45, score-0.194]

18 Section 4 describes the system architecture and workflow, while Section 5 presents important aspects of the user interfaces in the system. [sent-47, score-0.2]

19 2 Related Work Lately, crowdsourcing has been explored as a source for generating data for NLP tasks (Snow et al. [sent-50, score-0.639]

20 Specifically, it has been explored as a channel for collecting different resources for SMT - evaluations of MT output (Callison-Burch, 2009), word alignments in parallel sentences (Gao et al. [sent-52, score-0.099]

21 (2012) have shown the feasibility of crowdsourcing for collecting parallel corpora and pointed out that quality assurance is a major issue for successful translation crowdsourcing. [sent-56, score-1.003]

22 The most popular methods for quality control of crowdsourced tasks are based on sampling and redundancy. [sent-57, score-0.357]

23 (2010) use inter-translator agreement for selection of a good translation from multiple, redundant worker translations. [sent-59, score-0.496]

24 Zaidan and CallisonBurch (201 1) score translations using a feature based model comprising sentence level, worker level and crowd ranking based features. [sent-60, score-0.677]

25 However, automatic evaluation of translation quality is difficult, such automatic methods being either inaccurate or expensive. [sent-61, score-0.312]

26 (2012) have collected Indic language corpora data utilizing the crowd for collecting translations as well as validations. [sent-63, score-0.444]

27 The quality of the validations is ensured using goldstandard sentence translations. [sent-64, score-0.209]

28 Our approach to quality control is similar to Post et al. [sent-65, score-0.194]

29 While most crowdsourcing activities for data gathering has been concerned with collecting simple annotations like relevancejudgments, there has been work to explore the use of crowdsourcing for more complex tasks, of which translation is a good example. [sent-67, score-1.464]

30 (2010) propose that many complex tasks can be modeled either as iterative workflows (where workers iteratively build on each other’s works) or as parallel workflows (where workers solve the tasks in parallel, with the best result voted upon later). [sent-69, score-0.736]

31 (201 1) suggest a map-and-reduce approach to solve complex problems, where a problem is decomposed into smaller problems, which are solved in the map stage and the results are combined in the reduce stage. [sent-71, score-0.218]

32 Our method can be seen as an instance of the map-reduce approach applied to translation crowdsourcing, with two map stages (phrase translation and translation validation) and one reduce stage (sentence combination). [sent-72, score-0.831]

33 3 Multi-Stage Crowdsourcing Pipeline Our system is based on a multi-stage pipeline, whose central idea is to simplify the translation task into smaller tasks. [sent-73, score-0.279]

34 Source language documents are sentencified using standard NLP tokenizers and sentence splitters. [sent-75, score-0.038]

35 This step creates small phrases 176 Figure 1: Multistage crowdsourced translation from complex sentences which can be easily and independently translated. [sent-77, score-0.406]

36 This leads to a crowdsourcing pipeline, with three stages of tasks for the crowd: Phrase Translation (PT), Phrase Translation Validation (PV), Sentence Composition (SC). [sent-78, score-0.664]

37 A group of crowd workers translate source language phrases, the translations are validated by a different group ofworkers and finally a third group of workers put the phrase translation together to create target language sentences. [sent-79, score-1.115]

38 The validation is done by workers by providing ratings on a kpoint scale. [sent-80, score-0.267]

39 The engine manages the execution of the crowdsourcing pipeline, lifecycle of jobs and quality control of submitted tasks. [sent-85, score-0.975]

40 The Engine exposes its capabilities through the Requester API, which can be used by clients for setting up, customizing and monitoring translation crowdsourcing jobs and controlling their execution. [sent-86, score-0.869]

41 In order to make the crowdsourcing engine independent of any specific crowdsourcing platform, platform specific Connectors are developed. [sent-88, score-1.274]

42 The Crowdsourcing system makes the tasks to be crowdsourced available through the Connector API. [sent-89, score-0.191]

43 The connectors are responsible for polling the engine for tasks to be crowdsourced, pushing the tasks to crowdsourcing platforms, hosting worker interfaces for the tasks and pushing the results back to the engine after they have been completed by workers on the crowdsourcing platform. [sent-90, score-2.248]

44 Figure 3 depicts the lifecycle of a translation crowdsourcing job. [sent-92, score-0.831]

45 The requester initiates a translation job for a document (a set of sentences). [sent-93, score-0.52]

46 For the job, PT tasks are created and made available through the Connector API. [sent-96, score-0.075]

47 The connector for the specified platform periodically polls the Crowdsourcing Engine via the Connector API. [sent-97, score-0.234]

48 Once the connector has new PT tasks for crowdsourcing, it interacts with the crowdsourcing platform to request crowdsourcing services. [sent-98, score-1.437]

49 The connector monitors the progress of the tasks and on completion provides the results and execution status to the Crowdsourcing Engine. [sent-99, score-0.287]

50 Once all the PT tasks for the job are completed, the crowdsourcing Engine initiates the PV task to obtain validations for the translations. [sent-100, score-0.87]

51 The Quality Control system kicks in when all the PV tasks for the job have been completed. [sent-101, score-0.204]

52 The quality control (QC) relies on a combination of sampling and redundancy. [sent-102, score-0.194]

53 Each PV task has a few gold-standard phrase translation pairs, which is used to ensure that the validators are honestly doing their tasks. [sent-103, score-0.351]

54 The judgments from the 177 Figure 2: Architecture of TransDoop good validators are used to determine the quality of the phrase translation, based on majority voting, average rating, etc. [sent-104, score-0.213]

55 If any phrase validations or translations are incorrect, then the corresponding phrases/translations are again sent to the PT/PV stage as the case may be. [sent-106, score-0.42]

56 This will continue until all phrase translations in the job are correctly translated or a pre-configured number of iterations are done. [sent-107, score-0.332]

57 Once phrase translations are obtained for all phrases in a sentence, the Crowdsourcing Engine creates SC tasks, where the workers are asked to compose a single correct, coherent translation from the phrase translation obtained in the previous stages. [sent-108, score-1.051]

58 1 Worker User Interfaces This section describes the worker user interfaces for each stage in the pipeline. [sent-110, score-0.516]

59 These are managed by the Connector and have been designed to make the task convenient for the worker and prevent spam submissions. [sent-111, score-0.3]

60 PV UI is similar to k-scale voting tasks commonly found in crowdsourcing platforms. [sent-113, score-0.664]

61 Twhse t user nins-terface discourages spamming by: (a) displaying source text as images; and (b) alert- ing workers if they don’t provide a translation or spend very little time on a task. [sent-115, score-0.526]

62 A Vocabulary Support, which shows translation suggestions for word sequences appearing in the source phrase, is also available. [sent-117, score-0.225]

63 Suggested translations can be copied to the input area with ease and speed. [sent-118, score-0.157]

64 • Sentence Translation Composition UI: The sentence tr Tarnasnlsaltiaotnio composition oUnI (shown in Figure 4b) facilitates composition of sentence translations from phrase translations. [sent-119, score-0.487]

65 First, the worker can drag and rearrange the translated phrases into the right order, followed by reordering of individual words. [sent-120, score-0.328]

66 Finally, the synthesized language sentence can be post-edited to correct spelling, case marking, inflectional errors, etc. [sent-122, score-0.038]

67 The system also cap- tures the reordering performed by the worker, an important byproduct, which can be used for training reordering models for SMT. [sent-123, score-0.082]

68 2 Requester UI The system provides a Requester Portal through which the requester can create, control and monitor jobs and retrieve results. [sent-125, score-0.334]

69 The portal allows the requester to customize the job during creation by configuring various parameters: (a) domain and language pair (b) entire sentence vs multistage translation (c) price for task at each stage (d) task design (number of tasks in a task group, etc. [sent-126, score-0.864]

70 Translation redundancy refers to the number of translations requested for a source phrase. [sent-128, score-0.206]

71 Validation redundancy refers to the number of validations collected for each phrase translation pair and the redundancy based acceptance criteria for phrase translations (majority, consensus, threshold, etc. [sent-129, score-0.712]

72 ) 178 (a) Phrase Translation UI (b) Sentence Composition UI Figure 4: Worker User Interfaces 6 Experiments and Observations Using TransDoop, we conducted a set of smallscale, preliminary translation experiments. [sent-130, score-0.225]

73 We obtained translations for English-Hindi and EnglishMarathi language pairs for the Judicial and Tourism domains. [sent-131, score-0.157]

74 For evaluation, we chose METEOR, a well-known translation evaluation metric (Banerjee and Lavie, 2005). [sent-133, score-0.225]

75 We compared the results obtained from the crowdsourcing system with a expert human translation and the output of Google Translate. [sent-134, score-0.879]

76 We also compared two expert translations using METEOR to establish a skyline for the translation accuracy. [sent-135, score-0.444]

77 The translations with Quality Control and multistage pipeline are better than Google translations and translations obtained from the crowd without any quality control, as evaluated by METEOR. [sent-137, score-0.919]

78 Moreover, the translation quality is comparable to that of expert human translation. [sent-139, score-0.374]

79 This can be seen in some examples of crowdsourced translations obtained through the system which are shown in Table 2. [sent-141, score-0.273]

80 Incorrect splitting of sentences can cause difficulties in translation for the worker. [sent-142, score-0.225]

81 For instance, discontinuous phrases will not be available to the worker as a single translation unit. [sent-143, score-0.526]

82 In the English interrogative sentence, the noun phrase splits the verb phrase, therefore the auxiliary and main verb could be in different translation units. [sent-144, score-0.299]

83 In addition, the phrase structures of the source and target languages may not map, making translation difficult. [sent-148, score-0.331]

84 It does not contain any tense information, therefore the tense of the English clause cannot be determined by the worker. [sent-150, score-0.052]

85 Lucknow vaalaa ladkaa could translate to any one of: the boy who lives/lived/is living in Lucknow We rely on the worker in sentence composition stage to correct mistakes due to these inadequacies and compose a good translation. [sent-153, score-0.559]

86 In addition, the worker in the PT stage could be provided with the sentence context for translation. [sent-154, score-0.414]

87 However, there is a tradeoff between the cognitive load of context processing versus uncertainty in translation. [sent-155, score-0.07]

88 More elaborately, to what extent can the cognitive load be reduced before uncertainty of translation sets in? [sent-156, score-0.295]

89 Similarly, how much of context can be shown before the cognitive load becomes pressing? [sent-157, score-0.07]

90 7 Conclusions In this system demonstration, we present TransDoop as a translation crowdsourcing system which has the potential to harness the strength of the crowd to collect high quality human translations on a large scale. [sent-158, score-1.3]

91 It simplifies the tedious translation tasks by decomposing them into several “easy-to-solve” subtasks while ensuring quality. [sent-159, score-0.331]

92 Our evaluation on small scale data shows that the multistage approach performs better than com- plete sentence translation. [sent-160, score-0.122]

93 We would like to extensively use this platform for large scale experiments on more language pairs and complex domains like Health, Parliamentary Proceedings, Technical and Scientific literature etc. [sent-161, score-0.09]

94 30 Translation with QC single stage multi stage 0. [sent-168, score-0.21]

95 ainrAe- Table 2: Examples of translation from Google and three staged pipeline for source sentence (2nd, 3rd and 1st rows of each table respectively). [sent-205, score-0.329]

96 the method for collection of parallel corpora on a large scale. [sent-207, score-0.079]

97 The impact of crowdsourcing postediting with the collaborative translation framework. [sent-210, score-0.789]

98 Can crowds build parallel corpora for machine translation systems? [sent-215, score-0.304]

99 Experiences in resource generation for machine translation through crowdsourcing. [sent-231, score-0.225]

100 Constructing parallel corpora for six indian languages via crowdsourcing. [sent-239, score-0.182]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('crowdsourcing', 0.564), ('worker', 0.271), ('translation', 0.225), ('crowd', 0.211), ('workers', 0.211), ('connector', 0.181), ('translations', 0.157), ('transdoop', 0.155), ('requester', 0.148), ('ui', 0.115), ('control', 0.107), ('stage', 0.105), ('job', 0.101), ('pv', 0.099), ('engine', 0.093), ('interfaces', 0.09), ('composition', 0.09), ('crowdsourced', 0.088), ('meteor', 0.087), ('quality', 0.087), ('ambati', 0.084), ('multistage', 0.084), ('validations', 0.084), ('judicial', 0.078), ('kunchukuttan', 0.078), ('tasks', 0.075), ('phrase', 0.074), ('indian', 0.071), ('platforms', 0.069), ('pipeline', 0.066), ('pt', 0.066), ('expert', 0.062), ('novice', 0.06), ('qc', 0.057), ('validation', 0.056), ('platform', 0.053), ('aikawa', 0.052), ('kittur', 0.052), ('lucknow', 0.052), ('shourya', 0.052), ('validators', 0.052), ('parallel', 0.051), ('jobs', 0.051), ('user', 0.05), ('redundancy', 0.049), ('tourism', 0.049), ('amt', 0.048), ('collecting', 0.048), ('pricing', 0.046), ('initiates', 0.046), ('indic', 0.046), ('portal', 0.046), ('vamshi', 0.046), ('load', 0.045), ('mechanical', 0.045), ('amazon', 0.044), ('lifecycle', 0.042), ('customize', 0.042), ('connectors', 0.04), ('discourages', 0.04), ('sentence', 0.038), ('workflows', 0.038), ('snow', 0.037), ('sc', 0.037), ('complex', 0.037), ('familiarity', 0.034), ('zaidan', 0.034), ('pushing', 0.033), ('architecture', 0.032), ('languages', 0.032), ('completed', 0.031), ('execution', 0.031), ('pushpak', 0.031), ('simplifies', 0.031), ('phrases', 0.03), ('google', 0.03), ('dredze', 0.029), ('capabilities', 0.029), ('convenient', 0.029), ('post', 0.029), ('anoop', 0.029), ('compose', 0.029), ('roy', 0.029), ('system', 0.028), ('corpora', 0.028), ('reordering', 0.027), ('banerjee', 0.027), ('map', 0.026), ('translate', 0.026), ('tense', 0.026), ('gathering', 0.026), ('smaller', 0.026), ('creates', 0.026), ('turk', 0.026), ('vogel', 0.025), ('scripts', 0.025), ('stages', 0.025), ('cognitive', 0.025), ('voting', 0.025), ('decomposed', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000015 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain

Author: Anoop Kunchukuttan ; Rajen Chatterjee ; Shourya Roy ; Abhijit Mishra ; Pushpak Bhattacharyya

Abstract: Large amount of parallel corpora is required for building Statistical Machine Translation (SMT) systems. We describe the TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators. Our system uses a Map-Reduce-like approach to translation crowdsourcing where sentence translation is decomposed into the following smaller tasks: (a) translation ofconstituent phrases of the sentence; (b) validation of quality of the phrase translations; and (c) composition of complete sentence translations from phrase translations. Trans- Doop incorporates quality control mechanisms and easy-to-use worker user interfaces designed to address issues with translation crowdsourcing. We have evaluated the crowd’s output using the METEOR metric. For a complex domain like judicial proceedings, the higher scores obtained by the map-reduce based approach compared to complete sentence translation establishes the efficacy of our work.

2 0.22135451 265 acl-2013-Outsourcing FrameNet to the Crowd

Author: Marco Fossati ; Claudio Giuliano ; Sara Tonelli

Abstract: We present the first attempt to perform full FrameNet annotation with crowdsourcing techniques. We compare two approaches: the first one is the standard annotation methodology of lexical units and frame elements in two steps, while the second is a novel approach aimed at acquiring frames in a bottom-up fashion, starting from frame element annotation. We show that our methodology, relying on a single annotation step and on simplified role definitions, outperforms the standard one both in terms of accuracy and time.

3 0.13162152 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

Author: Jiajun Zhang ; Chengqing Zong

Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1

4 0.11809402 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

Author: Samira Tofighi Zahabi ; Somayeh Bakhshaei ; Shahram Khadivi

Abstract: Mapping phrases between languages as translation of each other by using an intermediate language (pivot language) may generate translation pairs that are wrong. Since a word or a phrase has different meanings in different contexts, we should map source and target phrases in an intelligent way. We propose a pruning method based on the context vectors to remove those phrase pairs that connect to each other by a polysemous pivot phrase or by weak translations. We use context vectors to implicitly disambiguate the phrase senses and to recognize irrelevant phrase translation pairs. Using the proposed method a relative improvement of 2.8 percent in terms of BLEU score is achieved. 1

5 0.11608132 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

Author: Yang Feng ; Trevor Cohn

Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.

6 0.1042091 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

7 0.10381298 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

8 0.10352691 100 acl-2013-Crowdsourcing Interaction Logs to Understand Text Reuse from the Web

9 0.10304727 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

10 0.10284972 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

11 0.096081123 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

12 0.093690433 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

13 0.092197321 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

14 0.092096657 255 acl-2013-Name-aware Machine Translation

15 0.085369781 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

16 0.08416748 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

17 0.079445481 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora

18 0.07832177 240 acl-2013-Microblogs as Parallel Corpora

19 0.077787712 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

20 0.07762561 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.172), (1, -0.068), (2, 0.158), (3, 0.033), (4, 0.002), (5, -0.01), (6, 0.004), (7, -0.01), (8, 0.104), (9, 0.051), (10, -0.077), (11, 0.08), (12, -0.037), (13, 0.069), (14, -0.033), (15, -0.062), (16, -0.053), (17, -0.049), (18, 0.035), (19, 0.021), (20, -0.041), (21, -0.058), (22, -0.118), (23, 0.01), (24, -0.109), (25, -0.094), (26, -0.024), (27, 0.063), (28, 0.026), (29, 0.078), (30, -0.022), (31, 0.038), (32, 0.078), (33, 0.008), (34, -0.033), (35, 0.073), (36, -0.027), (37, -0.049), (38, 0.0), (39, -0.01), (40, 0.161), (41, -0.068), (42, 0.025), (43, 0.033), (44, -0.094), (45, 0.034), (46, -0.078), (47, 0.006), (48, 0.001), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91354108 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain

Author: Anoop Kunchukuttan ; Rajen Chatterjee ; Shourya Roy ; Abhijit Mishra ; Pushpak Bhattacharyya

Abstract: Large amount of parallel corpora is required for building Statistical Machine Translation (SMT) systems. We describe the TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators. Our system uses a Map-Reduce-like approach to translation crowdsourcing where sentence translation is decomposed into the following smaller tasks: (a) translation ofconstituent phrases of the sentence; (b) validation of quality of the phrase translations; and (c) composition of complete sentence translations from phrase translations. Trans- Doop incorporates quality control mechanisms and easy-to-use worker user interfaces designed to address issues with translation crowdsourcing. We have evaluated the crowd’s output using the METEOR metric. For a complex domain like judicial proceedings, the higher scores obtained by the map-reduce based approach compared to complete sentence translation establishes the efficacy of our work.

2 0.65732795 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu

Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.

3 0.6542688 265 acl-2013-Outsourcing FrameNet to the Crowd

Author: Marco Fossati ; Claudio Giuliano ; Sara Tonelli

Abstract: We present the first attempt to perform full FrameNet annotation with crowdsourcing techniques. We compare two approaches: the first one is the standard annotation methodology of lexical units and frame elements in two steps, while the second is a novel approach aimed at acquiring frames in a bottom-up fashion, starting from frame element annotation. We show that our methodology, relying on a single annotation step and on simplified role definitions, outperforms the standard one both in terms of accuracy and time.

4 0.63802457 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl

Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.

5 0.62743002 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

Author: Samira Tofighi Zahabi ; Somayeh Bakhshaei ; Shahram Khadivi

Abstract: Mapping phrases between languages as translation of each other by using an intermediate language (pivot language) may generate translation pairs that are wrong. Since a word or a phrase has different meanings in different contexts, we should map source and target phrases in an intelligent way. We propose a pruning method based on the context vectors to remove those phrase pairs that connect to each other by a polysemous pivot phrase or by weak translations. We use context vectors to implicitly disambiguate the phrase senses and to recognize irrelevant phrase translation pairs. Using the proposed method a relative improvement of 2.8 percent in terms of BLEU score is achieved. 1

6 0.61680365 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

7 0.59799433 100 acl-2013-Crowdsourcing Interaction Logs to Understand Text Reuse from the Web

8 0.5966661 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

9 0.59038574 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

10 0.55906302 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages

11 0.55564064 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

12 0.5490762 255 acl-2013-Name-aware Machine Translation

13 0.53549874 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

14 0.53421563 250 acl-2013-Models of Translation Competitions

15 0.53301519 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

16 0.52538836 289 acl-2013-QuEst - A translation quality estimation framework

17 0.52307701 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

18 0.52245325 135 acl-2013-English-to-Russian MT evaluation campaign

19 0.52163517 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

20 0.51430678 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.066), (6, 0.041), (11, 0.062), (15, 0.016), (24, 0.044), (26, 0.052), (28, 0.011), (35, 0.055), (42, 0.079), (48, 0.03), (64, 0.234), (70, 0.022), (71, 0.026), (80, 0.01), (88, 0.035), (90, 0.032), (95, 0.082)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90270293 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections

Author: Eneko Agirre ; Nikolaos Aletras ; Paul Clough ; Samuel Fernando ; Paula Goodale ; Mark Hall ; Aitor Soroa ; Mark Stevenson

Abstract: This paper describes a system for navigating large collections of information about cultural heritage which is applied to Europeana, the European Library. Europeana contains over 20 million artefacts with meta-data in a wide range of European languages. The system currently provides access to Europeana content with meta-data in English and Spanish. The paper describes how Natural Language Processing is used to enrich and organise this meta-data to assist navigation through Europeana and shows how this information is used within the system.

2 0.8579191 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment

Author: Qun Liu ; Zhaopeng Tu ; Shouxun Lin

Abstract: In this paper, we propose a novel compact representation called weighted bipartite hypergraph to exploit the fertility model, which plays a critical role in word alignment. However, estimating the probabilities of rules extracted from hypergraphs is an NP-complete problem, which is computationally infeasible. Therefore, we propose a divide-and-conquer strategy by decomposing a hypergraph into a set of independent subhypergraphs. The experiments show that our approach outperforms both 1-best and n-best alignments.

same-paper 3 0.84328634 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain

Author: Anoop Kunchukuttan ; Rajen Chatterjee ; Shourya Roy ; Abhijit Mishra ; Pushpak Bhattacharyya

Abstract: Large amount of parallel corpora is required for building Statistical Machine Translation (SMT) systems. We describe the TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators. Our system uses a Map-Reduce-like approach to translation crowdsourcing where sentence translation is decomposed into the following smaller tasks: (a) translation ofconstituent phrases of the sentence; (b) validation of quality of the phrase translations; and (c) composition of complete sentence translations from phrase translations. Trans- Doop incorporates quality control mechanisms and easy-to-use worker user interfaces designed to address issues with translation crowdsourcing. We have evaluated the crowd’s output using the METEOR metric. For a complex domain like judicial proceedings, the higher scores obtained by the map-reduce based approach compared to complete sentence translation establishes the efficacy of our work.

4 0.80844575 152 acl-2013-Extracting Definitions and Hypernym Relations relying on Syntactic Dependencies and Support Vector Machines

Author: Guido Boella ; Luigi Di Caro

Abstract: In this paper we present a technique to reveal definitions and hypernym relations from text. Instead of using pattern matching methods that rely on lexico-syntactic patterns, we propose a technique which only uses syntactic dependencies between terms extracted with a syntactic parser. The assumption is that syntactic information are more robust than patterns when coping with length and complexity of the sentences. Afterwards, we transform such syntactic contexts in abstract representations, that are then fed into a Support Vector Machine classifier. The results on an annotated dataset of definitional sentences demonstrate the validity of our approach overtaking current state-of-the-art techniques.

5 0.80529702 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing

Author: Dan Goldwasser ; Dan Roth

Abstract: Semantic parsing is a domain-dependent process by nature, as its output is defined over a set of domain symbols. Motivated by the observation that interpretation can be decomposed into domain-dependent and independent components, we suggest a novel interpretation model, which augments a domain dependent model with abstract information that can be shared by multiple domains. Our experiments show that this type of information is useful and can reduce the annotation effort significantly when moving between domains.

6 0.75250316 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference

7 0.67484808 6 acl-2013-A Java Framework for Multilingual Definition and Hypernym Extraction

8 0.66334844 265 acl-2013-Outsourcing FrameNet to the Crowd

9 0.62424016 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation

10 0.62195605 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

11 0.616741 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval

12 0.61346215 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

13 0.60301071 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

14 0.60133427 176 acl-2013-Grounded Unsupervised Semantic Parsing

15 0.59583521 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

16 0.59454107 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

17 0.59411907 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

18 0.59402192 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

19 0.59361786 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

20 0.59340513 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation