acl acl2013 acl2013-120 knowledge-graph by maker-knowledge-mining

120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl


Source: pdf

Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez

Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. [sent-15, score-0.426]

2 We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. [sent-16, score-0.842]

3 Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. [sent-17, score-0.568]

4 Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. [sent-18, score-0.579]

5 Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. [sent-19, score-0.29]

6 1 1 Introduction A key bottleneck translation (SMT) and domains is the lel corpora beyond of language pairs, in porting statistical machine technology to new languages lack of readily available paralcurated datasets. [sent-21, score-0.259]

7 For a handful large amounts of parallel data ∗ This research was conducted while Chris CallisonBurch was at Johns Hopkins University. [sent-22, score-0.328]

8 However, for most language pairs and domains there is little to no curated parallel data available. [sent-26, score-0.544]

9 Hence discovery of parallel data is an important first step for translation between most of the world’s languages. [sent-27, score-0.431]

10 Many websites are available in multiple languages, and unlike other potential sources— such as multilingual news feeds (Munteanu and Marcu, 2005) or Wikipedia (Smith et al. [sent-29, score-0.186]

11 , 2010)— it is common to find document pairs that are direct translations of one another. [sent-30, score-0.264]

12 However, anything more sophisticated generally requires direct access to web-crawled documents themselves along with the computing power to process them. [sent-34, score-0.107]

13 As a consequence, web-mined parallel text has become the exclusive purview of large companies with the computational resources to crawl, store, and process the entire Web. [sent-36, score-0.398]

14 To put web-mined parallel text back in the hands of individual researchers, we mine parallel text from the Common Crawl, a regularly updated 81-terabyte snapshot of the public internet hosted 1374 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h. [sent-37, score-0.924]

15 2 Using the Common Crawl completely removes the bottleneck of web crawling, and makes it possible to run algorithms on a substantial portion of the web at very low cost. [sent-40, score-0.16]

16 Starting from nothing other than a set of language codes, our extension of the STRAND algorithm (Resnik and Smith, 2003) identifies potentially parallel documents using cues from URLs and document content (§2). [sent-41, score-0.489]

17 f the web-mined data, demonstrating coverage in a wide variety of languages and domains (§3). [sent-43, score-0.156]

18 improves translation performance on strong baseline news translation systems in five different language pairs (§4). [sent-45, score-0.343]

19 In our pipeline, we perform the first step of identifying candidate document pairs using Amazon EMR, download the resulting document pairs, and perform the remaining steps on our local cluster. [sent-52, score-0.32]

20 Candidate pair selection: Retrieve candidate document pairs from the CommonCrawl corpus. [sent-55, score-0.221]

21 (b) Align the linearized HTML of candidate document pairs. [sent-61, score-0.149]

22 Sentence Alignment: For each aligned pair of text chunks, perform the sentence alignment method of Gale and Church (1993). [sent-66, score-0.155]

23 Candidate Pair Selection We adopt a strategy similar to that ofResnik and Smith (2003) for finding candidate parallel documents, adapted to the parallel architecture of Map-Reduce. [sent-69, score-0.706]

24 The mapper operates on each website entry in the CommonCrawl data. [sent-70, score-0.104]

25 entry) The reducer then receives all websites mapped to the same “language independent” URL. [sent-85, score-0.193]

26 If two or more websites are associated with the same key, the reducer will output all associated values, as long as they are not in the same language, as determined by the language identifier in the URL. [sent-86, score-0.193]

27 This URL-based matching is a simple and inexpensive solution to the problem of finding candidate document pairs. [sent-87, score-0.194]

28 The mapper will discard 1375 most, and neither the mapper nor the reducer do anything with the HTML of the documents aside from reading and writing them. [sent-88, score-0.342]

29 This alignment is used to determine which document pairs are actually parallel, and if they are, to align pairs of text blocks within the documents. [sent-91, score-0.408]

30 The objective of the alignment is to maximize the number of matching items. [sent-96, score-0.128]

31 They annotated a set of document pairs as parallel or non-parallel, and trained a classifier on this data. [sent-98, score-0.499]

32 We also annotated 101 Spanish-English document pairs in this way and trained a maximum entropy classifier. [sent-99, score-0.171]

33 The strong performance of the naive baseline was likely due to the unbalanced nature of the annotated data— 80% of the document pairs that we annotated were parallel. [sent-102, score-0.216]

34 Segmentation The text chunks from the previ- ous step may contain several sentences, so before the sentence alignment step we must perform sentence segmentation. [sent-103, score-0.238]

35 We use the Punkt sentence splitter from NLTK (Loper and Bird, 2002) to perform both sentence and word segmentation on each text chunk. [sent-104, score-0.109]

36 Sentence Alignment For each aligned text chunk pair, we perform sentence alignment using the algorithm of Gale and Church (1993). [sent-105, score-0.189]

37 Sentence Filtering Since we do not perform any boilerplate removal in earlier steps, there are many sentence pairs produced by the pipeline which contain menu items or other bits of text which are not useful to an SMT system. [sent-106, score-0.31]

38 We avoid performing any complex boilerplate removal and only remove segment pairs where either the source and target text are identical, or where the source or target segments appear more than once in the extracted corpus. [sent-107, score-0.235]

39 This is different from running a single Map-Reduce job over the entire dataset, since websites in different subsets of the data cannot be matched. [sent-112, score-0.121]

40 However, since the data is stored as it is crawled, it is likely that matching websites will be found in the same split of the data. [sent-113, score-0.211]

41 Table 1 shows the amount of raw parallel data obtained for a large selection oflanguage pairs. [sent-114, score-0.328]

42 As far as we know, ours is the first system built to mine parallel text from the Common Crawl. [sent-115, score-0.422]

43 Since our mining heuristics are very simple, these results can be construed as a lower bound on what is actually possible. [sent-118, score-0.109]

44 1 Recall Estimates Our first question is about recall: of all the possible parallel text that is actually available on the Web, how much does our algorithm actually find in the Common Crawl? [sent-120, score-0.457]

45 Although this question is difficult to answer precisely, we can estimate an answer by comparing our mined URLs against a large collection of previously mined URLs that were found using targeted techniques: those in the French-English Gigaword corpus (Callison-Burch et al. [sent-121, score-0.404]

46 h0tKo SToaurrgceet T Tookkeennss553773KK445797KK335386KK331285KK320975KK221088KK Table 1: The amount of parallel data mined from CommonCrawl for each language paired with English. [sent-147, score-0.53]

47 4 If we had included “f” and “e” as identifiers for French and English respectively, coverage of the URL pairs would increase to 74%. [sent-151, score-0.149]

48 2 Precision Estimates Since our algorithms rely on cues that are mostly external to the contents of the extracted data and have no knowledge of actual languages, we wanted to evaluate the precision of our algorithm: how much of the mined data actually consists of parallel sentences? [sent-154, score-0.577]

49 To measure this, we conducted a manual analysis of 200 randomly selected sentence pairs for each of three language pairs. [sent-155, score-0.109]

50 Furthermore, 22% of the true positives are potentially machine translations (judging by the quality), whereas in 13% of the cases one of the sentences contains additional content not ex4The difference is likely due to the coverage of the CommonCrawl corpus. [sent-158, score-0.183]

51 5% of them have either the source or target sentence in the wrong language, and the remaining ones representing failures in the alignment process. [sent-161, score-0.12]

52 LGSaFpnerag nmuciashgnePr87c281i% sion Table 2: Manual evaluation of precision (by sentence pair) on the extracted parallel data for Spanish, French, and German (paired with English). [sent-165, score-0.365]

53 In addition to the manual evaluation of precision, we applied language identification to our extracted parallel data for several additional languages. [sent-166, score-0.37]

54 py” tool (Lui and Baldwin, 2012) at the segment level, and report the percentage of sentence pairs where both sentences were recognized as the correct language. [sent-168, score-0.109]

55 Comparing against our manual evaluation from Table 2, it appears that many sentence pairs are being incorrectly judged as nonparallel. [sent-170, score-0.109]

56 A major difficulty in applying SMT even on languages for which we have significant quantities of parallel text is that most of that parallel text is in the news and government domains. [sent-175, score-0.867]

57 In LDA a topic is a unigram distribution over words, and each document is modeled as a distribution over topics. [sent-183, score-0.142]

58 To create a set of documents from the extracted CommonCrawl data, we took the English side of the extracted parallel segments for each URL in the Spanish-English portion of the data. [sent-184, score-0.39]

59 Some of the topics that LDA finds correspond closely with specific domains, such as topics 1 (bl ingee . [sent-187, score-0.138]

60 Several of the topics correspond to the travel domain. [sent-190, score-0.104]

61 We created a set of documents from both CommonCrawl and Europarl, and again used MALLET to generate 100 topics for this data. [sent-195, score-0.131]

62 6 We then labeled each document by its most likely topic (as determined by that topic’s mixture weights), and counted the number of documents from Europarl and CommonCrawl for which each topic was most prominent. [sent-196, score-0.292]

63 In addition to exploring topics in the datasets, we also performed additional intrinsic evaluation at the domain level, choosing top domains for three language pairs. [sent-199, score-0.198]

64 We specifically classified sentence pairs as useful or boilerplate (Table 7). [sent-200, score-0.2]

65 Among our observations, we find that commercial websites tend to contain less boilerplate material than encyclopedic websites, and that the ratios tend to be similar across languages in the same domain. [sent-201, score-0.253]

66 In these experiments, a baseline system is trained on an existing parallel corpus, and the experimental system is trained on the baseline corpus plus the mined parallel data. [sent-209, score-0.858]

67 In all experiments we include the target side of the mined parallel data in the language model, in order to distinguish whether results are due to influences from parallel or monolingual data. [sent-210, score-0.858]

68 6Documents were created from Europarl by taking “SPEAKER” tags as document boundaries, giving us 208,431 documents total. [sent-212, score-0.199]

69 are ranked by Table 5: A list of 20 topics generated using the MALLET toolkit (McCallum, 2002) and their most likely tokens. [sent-225, score-0.114]

70 , 2012) using all available parallel and monolingual data for that task, aside from the French-English Gigaword. [sent-228, score-0.328]

71 These results show that even on top of a different, larger parallel corpus mined from the web, adding CommonCrawl data still yields an improvement. [sent-235, score-0.53]

72 2 Open Domain Translation A substantial appeal of web-mined parallel data is that it might be suitable to translation of domains other than news, and our topic modeling analysis (§3. [sent-237, score-0.594]

73 1379 Table 6: A sample of topics along with the number of Europarl and CommonCrawl documents where they are the most likely topic in the mixture. [sent-241, score-0.219]

74 1376E0412 Table 8: BLEU scores for several language pairs before and after adding the mined parallel data to systems trained on data from WMT data. [sent-248, score-0.602]

75 756R 09 Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al. [sent-251, score-0.53]

76 For these experiments, we also include training data mined from Wikipedia using a simplified version of the sentence aligner described by Smith et al. [sent-253, score-0.239]

77 Wikipedia copied across the public internet, and we did not have a simple way to filter such data from our mined datasets. [sent-268, score-0.245]

78 org was discovered by our URL matching heuristics, but we excluded any sentence pairs that were found in the CommonCrawl data from this test set. [sent-272, score-0.154]

79 1380 The second dataset is a set of crowdsourced translation of Spanish speech transcriptions from the Spanish Fisher corpus. [sent-273, score-0.103]

80 The advantage of this data for our open domain translation test is twofold. [sent-276, score-0.153]

81 Eu+r oBW poaitekrhbli98 629/ 78W7248/ M5 46T82/ 2 14709 9 46/ 87T5890a/ 546to850e/ 21b68a1798 671/ 867953F/ 345is9421h/ 12e350r9 Table 11: n-gram coverage percentages (up to 4- grams) of the source side of our test sets given our different parallel training corpora computed at the type level. [sent-279, score-0.364]

82 321e432r Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system. [sent-283, score-0.53]

83 5 on Tatoeba and Fisher (almost 5 Discussion Web-mined parallel texts have been an exclusive resource of large companies for several years. [sent-291, score-0.363]

84 However, when web-mined parallel text is available to everyone at little or no cost, there will be much greater potential for groundbreaking research to come from all corners. [sent-292, score-0.363]

85 As we have shown, it is possible to obtain parallel text for many language pairs in a variety of domains very cheaply and quickly, and in sufficient quantity and quality to improve statistical machine translation systems. [sent-294, score-0.617]

86 There are many possible means through which the system could be improved, including more sophisticated techniques for identifying matching URLs, better alignment, better language identification, better filtering of data, and better exploitation of resulting cross-domain datasets. [sent-298, score-0.148]

87 (201 1) gathered almost 1trillion tokens of French-English parallel data this way. [sent-302, score-0.328]

88 Another strategy for mining parallel webpage pairs is to scan the HTML for links to the same page in another language (Nie et al. [sent-303, score-0.462]

89 (2010), for example, translated all non-English webpages into English using an existing translation system and used near-duplicate detection methods to find candidate parallel document pairs. [sent-307, score-0.58]

90 Ture and Lin (2012) had a similar approach for finding parallel Wikipedia documents by using near-duplicate detection, though they did not need to apply a full translation system to all non-English documents. [sent-308, score-0.493]

91 1381 Instead, they represented documents in bag-ofwords vector space, and projected non-English document vectors into the English vector space using the translation probabilities of a word alignment model. [sent-309, score-0.347]

92 However, with this system in place, we could obtain enough parallel data to bootstrap these more sophisticated approaches. [sent-311, score-0.408]

93 (2010) mine parallel sentences from comparable documents in Wikipedia, demonstrating substantial gains on open domain translation. [sent-314, score-0.499]

94 However, their approach required seed parallel data to learn models used in a classifier. [sent-315, score-0.328]

95 We imagine a two-step process, first obtaining parallel data from the web, followed by comparable data from sources such as Wikipedia using models bootstrapped from the web-mined data. [sent-316, score-0.328]

96 Such a process could be used to build translation systems for new language pairs in a very short period of time, hence fulfilling one of the original promises of SMT. [sent-317, score-0.175]

97 Europarl: A parallel corpus for statis- tical machine translation. [sent-377, score-0.328]

98 Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. [sent-415, score-0.718]

99 News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. [sent-440, score-0.328]

100 mining large corpora for parallel sentences to improve translation modeling. [sent-450, score-0.493]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('commoncrawl', 0.384), ('parallel', 0.328), ('crawl', 0.278), ('mined', 0.202), ('wmt', 0.167), ('url', 0.167), ('europarl', 0.139), ('websites', 0.121), ('tatoeba', 0.118), ('mapper', 0.104), ('translation', 0.103), ('amazon', 0.1), ('document', 0.099), ('elastic', 0.096), ('hosted', 0.096), ('boilerplate', 0.091), ('smith', 0.086), ('strand', 0.086), ('fisher', 0.086), ('alignment', 0.083), ('johns', 0.082), ('domains', 0.079), ('urls', 0.079), ('emr', 0.078), ('resnik', 0.075), ('nie', 0.075), ('reducer', 0.072), ('pairs', 0.072), ('html', 0.07), ('topics', 0.069), ('chris', 0.069), ('munteanu', 0.068), ('hopkins', 0.066), ('curated', 0.065), ('news', 0.065), ('mallet', 0.064), ('smt', 0.063), ('bleu', 0.063), ('mining', 0.062), ('web', 0.062), ('documents', 0.062), ('codes', 0.061), ('wikipedia', 0.061), ('herve', 0.059), ('mine', 0.059), ('filtering', 0.058), ('koehn', 0.058), ('translations', 0.057), ('uszkoreit', 0.053), ('omar', 0.053), ('terabytes', 0.052), ('domain', 0.05), ('candidate', 0.05), ('monz', 0.049), ('gale', 0.048), ('germanenglish', 0.048), ('actually', 0.047), ('chunks', 0.046), ('christof', 0.046), ('matching', 0.045), ('sophisticated', 0.045), ('positives', 0.045), ('likely', 0.045), ('loper', 0.043), ('topic', 0.043), ('philipp', 0.043), ('public', 0.043), ('lda', 0.043), ('identification', 0.042), ('spanish', 0.042), ('appeal', 0.041), ('identifiers', 0.041), ('languages', 0.041), ('cloud', 0.04), ('mapreduce', 0.04), ('genres', 0.04), ('french', 0.039), ('venugopal', 0.038), ('ite', 0.038), ('lui', 0.038), ('dean', 0.038), ('zaidan', 0.038), ('pipeline', 0.038), ('tags', 0.038), ('sentence', 0.037), ('removal', 0.037), ('nltk', 0.036), ('excellence', 0.036), ('crawling', 0.036), ('bottleneck', 0.036), ('common', 0.036), ('coverage', 0.036), ('text', 0.035), ('companies', 0.035), ('travel', 0.035), ('jakob', 0.035), ('www', 0.035), ('bootstrap', 0.035), ('government', 0.035), ('chunk', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez

Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1

2 0.26017588 240 acl-2013-Microblogs as Parallel Corpora

Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso

Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.

3 0.16684222 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

Author: Felix Hieber ; Laura Jehl ; Stefan Riezler

Abstract: We present an approach to mine comparable data for parallel sentences using translation-based cross-lingual information retrieval (CLIR). By iteratively alternating between the tasks of retrieval and translation, an initial general-domain model is allowed to adapt to in-domain data. Adaptation is done by training the translation system on a few thousand sentences retrieved in the step before. Our setup is time- and memory-efficient and of similar quality as CLIR-based adaptation on millions of parallel sentences.

4 0.14535636 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

Author: Jiajun Zhang ; Chengqing Zong

Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1

5 0.14265861 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang

Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.

6 0.13985448 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

7 0.13721265 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

8 0.13561811 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

9 0.12993436 255 acl-2013-Name-aware Machine Translation

10 0.12354423 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

11 0.11949573 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

12 0.11768735 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain

13 0.11425064 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

14 0.11369652 154 acl-2013-Extracting bilingual terminologies from comparable corpora

15 0.11031114 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

16 0.11017385 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

17 0.11000935 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

18 0.10905857 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

19 0.10584345 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

20 0.10284972 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.287), (1, -0.05), (2, 0.193), (3, 0.061), (4, 0.088), (5, -0.02), (6, 0.007), (7, -0.008), (8, 0.033), (9, -0.084), (10, 0.015), (11, 0.023), (12, 0.01), (13, 0.139), (14, -0.003), (15, -0.047), (16, -0.024), (17, -0.019), (18, -0.023), (19, -0.018), (20, -0.015), (21, -0.015), (22, -0.043), (23, -0.008), (24, -0.013), (25, 0.011), (26, -0.061), (27, 0.016), (28, 0.051), (29, 0.021), (30, -0.016), (31, -0.013), (32, -0.053), (33, 0.004), (34, 0.047), (35, -0.026), (36, 0.031), (37, 0.005), (38, 0.04), (39, 0.007), (40, -0.01), (41, -0.012), (42, -0.045), (43, 0.032), (44, 0.021), (45, 0.039), (46, 0.057), (47, -0.084), (48, 0.013), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96774715 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez

Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1

2 0.88494784 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

Author: Felix Hieber ; Laura Jehl ; Stefan Riezler

Abstract: We present an approach to mine comparable data for parallel sentences using translation-based cross-lingual information retrieval (CLIR). By iteratively alternating between the tasks of retrieval and translation, an initial general-domain model is allowed to adapt to in-domain data. Adaptation is done by training the translation system on a few thousand sentences retrieved in the step before. Our setup is time- and memory-efficient and of similar quality as CLIR-based adaptation on millions of parallel sentences.

3 0.86675477 240 acl-2013-Microblogs as Parallel Corpora

Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso

Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.

4 0.777592 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

Author: Lei Cui ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou

Abstract: The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks.

5 0.75337023 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

Author: Sanjika Hewavitharana ; Dennis Mehay ; Sankaranarayanan Ananthakrishnan ; Prem Natarajan

Abstract: We describe a translation model adaptation approach for conversational spoken language translation (CSLT), which encourages the use of contextually appropriate translation options from relevant training conversations. Our approach employs a monolingual LDA topic model to derive a similarity measure between the test conversation and the set of training conversations, which is used to bias translation choices towards the current context. A significant novelty of our adaptation technique is its incremental nature; we continuously update the topic distribution on the evolving test conversation as new utterances become available. Thus, our approach is well-suited to the causal constraint of spoken conversations. On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER, and NIST. Interestingly, the incremental approach outperforms a non-incremental oracle that has up-front knowledge of the whole conversation.

6 0.73974848 255 acl-2013-Name-aware Machine Translation

7 0.73733729 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

8 0.73431337 154 acl-2013-Extracting bilingual terminologies from comparable corpora

9 0.72890395 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

10 0.72007346 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

11 0.71665937 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages

12 0.69548595 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

13 0.69537234 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

14 0.68917626 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

15 0.68778437 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

16 0.68553364 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

17 0.68323857 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

18 0.68051386 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization

19 0.67855155 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

20 0.67386997 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.06), (6, 0.044), (11, 0.059), (15, 0.012), (24, 0.073), (26, 0.09), (35, 0.081), (40, 0.014), (42, 0.052), (47, 0.131), (48, 0.024), (70, 0.038), (88, 0.033), (90, 0.063), (95, 0.139), (97, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90410447 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez

Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1

2 0.89534295 288 acl-2013-Punctuation Prediction with Transition-based Parsing

Author: Dongdong Zhang ; Shuangzhi Wu ; Nan Yang ; Mu Li

Abstract: Punctuations are not available in automatic speech recognition outputs, which could create barriers to many subsequent text processing tasks. This paper proposes a novel method to predict punctuation symbols for the stream of words in transcribed speech texts. Our method jointly performs parsing and punctuation prediction by integrating a rich set of syntactic features when processing words from left to right. It can exploit a global view to capture long-range dependencies for punctuation prediction with linear complexity. The experimental results on the test data sets of IWSLT and TDT4 show that our method can achieve high-level performance in punctuation prediction over the stream of words in transcribed speech text. 1

3 0.85141349 240 acl-2013-Microblogs as Parallel Corpora

Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso

Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.

4 0.8446272 380 acl-2013-VSEM: An open library for visual semantics representation

Author: Elia Bruni ; Ulisse Bordignon ; Adam Liska ; Jasper Uijlings ; Irina Sergienya

Abstract: VSEM is an open library for visual semantics. Starting from a collection of tagged images, it is possible to automatically construct an image-based representation of concepts by using off-theshelf VSEM functionalities. VSEM is entirely written in MATLAB and its objectoriented design allows a large flexibility and reusability. The software is accompanied by a website with supporting documentation and examples.

5 0.83795142 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

Author: Phillippe Langlais

Abstract: Analogical learning over strings is a holistic model that has been investigated by a few authors as a means to map forms of a source language to forms of a target language. In this study, we revisit this learning paradigm and apply it to the transliteration task. We show that alone, it performs worse than a statistical phrase-based machine translation engine, but the combination of both approaches outperforms each one taken separately, demonstrating the usefulness of the information captured by a so-called formal analogy.

6 0.83570117 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

7 0.83336008 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

8 0.83122981 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

9 0.83104825 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

10 0.83089471 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

11 0.82958317 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

12 0.82868266 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain

13 0.8283937 267 acl-2013-PARMA: A Predicate Argument Aligner

14 0.82802701 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction

15 0.82718158 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction

16 0.82644504 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

17 0.82592356 97 acl-2013-Cross-lingual Projections between Languages from Different Families

18 0.82339525 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset

19 0.82317388 130 acl-2013-Domain-Specific Coreference Resolution with Lexicalized Features

20 0.82306504 255 acl-2013-Name-aware Machine Translation