acl acl2013 acl2013-240 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso
Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.
Reference: text
sentIndex sentText sentNum sentScore
1 pctm Abstract In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. [sent-5, score-0.747]
2 We present an efficient method for detecting these messages and extracting parallel segments from them. [sent-6, score-0.701]
3 We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. [sent-7, score-0.558]
4 As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. [sent-8, score-1.086]
5 Machine translation suffers acutely from the domain-mismatch problem caused by microblog text. [sent-23, score-0.307]
6 However, more acutely, the data used to develop these systems and train their models is drawn from formal and carefully edited domains, such as parallel web pages and translated legal documents. [sent-25, score-0.449]
7 This paper introduces a method for finding naturally occurring parallel microblog text, which helps address the domain-mismatch problem. [sent-27, score-0.526]
8 Our method is inspired by the perhaps surprising observation that a reasonable number of microblog users tweet “in parallel” in two or more languages. [sent-28, score-0.331]
9 For instance, the American entertainer Snoop Dogg regularly posts parallel messages on Sina Weibo (Mainland China’s equivalent of Twitter), for example, watup Kenny Mayne! [sent-29, score-0.568]
10 Briefly, this requires determining if a tweet contains more than one language, if these multilingual utterances contain translated material (or are due to some, thing else, such as code switching), and what the translated spans are. [sent-33, score-0.35]
11 Section 2 describes the related work in parallel data extraction. [sent-35, score-0.356]
12 Section 3 presents our model to extract parallel data within the same document. [sent-36, score-0.356]
13 We then present experiments showing that our harvested data not only substantially improves translations of microblog text with 176 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-39, score-0.295]
14 2 Related Work Automatic collection of parallel data is a wellstudied problem. [sent-43, score-0.356]
15 These broadly work by identifying promising candidates using simple features, such as URL similarity or “gist translations” and then identifying truly parallel segments with more expensive classifiers. [sent-47, score-0.558]
16 Mining parallel or comparable messages from microblogs has mainly relied on Cross-Lingual Information Retrieval techniques (CLIR). [sent-49, score-0.633]
17 (2012) attempt to find pairs of tweets in Twitter using Arabic tweets as search queries in a CLIR system. [sent-51, score-0.685]
18 , 2001) is applied to retrieve a set of ranked translation candidates for each Arabic tweet, which are then used as parallel candidates. [sent-53, score-0.449]
19 , 2008), which attempts to find translations within the same document, has some similarities with our work, since parenthetical translations are within the same document. [sent-55, score-0.239]
20 We aim to propose a method that acquires large amounts of parallel data for free. [sent-61, score-0.356]
21 The drawback is that there is a margin of error in the parallel segment identification and alignment. [sent-62, score-0.443]
22 3 Parallel Segment Retrieval We will first abstract from the domain of Microblogs and focus on the task of retrieving parallel segments from single documents. [sent-64, score-0.558]
23 Prior work on finding parallel data attempts to reason about the probability that pairs of documents (x, y) are parallel. [sent-65, score-0.385]
24 , xn, and consisting of n tokens, and need to determine whether there is parallel data in x, and if so, where are the parallel segments and their languages. [sent-69, score-0.914]
25 As representation for the parallel segments within the document, we use the tuple ([p, q] , l, [u, v] , r, a). [sent-71, score-0.558]
26 The word indexes [p, q] and [u, v] are used to identify the left segment (from p to q) and right segment (from u to v), which are parallel. [sent-72, score-0.28]
27 We shall refer [p, q] and [u, v] as the spans of the left and right segments. [sent-73, score-0.252]
28 The main problem we address is to find the parallel data when the boundaries of the parallel segments are not defined explicitly. [sent-78, score-0.914]
29 Thus, our model will attempt to find the optimal values for the segments [p, q] [u, v], languages l,r and word alignments a jointly. [sent-90, score-0.271]
30 In fact, because our model can freely choose the segments to align, choosing only one word as the left segment that is well aligned to a word in the right segment would be the best choice. [sent-93, score-0.491]
31 1 Model We propose a simple (non-probabilistic) threefactor model that models the spans of the parallel segments, their languages, and word alignments jointly. [sent-100, score-0.565]
32 The lexical tables PM1 for the various language pairs are trained a priori using available parallel corpora. [sent-118, score-0.385]
33 The insight we use to improve the runtime is that the Viterbi word alignment of a bispan can be reused to calculate the Viterbi word alignments of larger bispans. [sent-132, score-0.236]
34 • Ap,q,u+1,v = λ+u(Ap,q,u,v) removes the first toAken of the right span with index u, so we only need to remove the alignment from u, which can be done in time O(1). [sent-142, score-0.275]
35 • Ap,q+1,u,v = λ+q(Ap,q,u,v) adds one token to Athe end of the left span with index q + 1, we need to check for each word in the right span, if aligning to the word in index q+ 1yields a better translation probability. [sent-143, score-0.356]
36 The light gray boxes show the parallel span and the dark boxes show the span’s Viterbi alignment. [sent-155, score-0.484]
37 In this example, the parallel message contains a “translation” of a b to A B. [sent-156, score-0.356]
38 4 Parallel Data Extraction We will now describe our method to extract parallel data from Microblogs. [sent-173, score-0.356]
39 For the Twitter domain, we use a previously crawled dataset from the years 2008 to 2013, where one million tweets are crawled every day. [sent-176, score-0.564]
40 Regarding Sina Weibo, we built a crawler that continuously collects tweets from Weibo. [sent-179, score-0.328]
41 Thus, to optimize the number of parallel posts we can collect per request, we only crawl all messages from users that have at least 10 parallel tweets in their first 100 posts. [sent-183, score-1.348]
42 The number of parallel messages is estimated by running our alignment model, and checking if τ > φ, where φ was set empirically initially, and optimized after obtaining annotated data, which will be detailed in 5. [sent-184, score-0.579]
43 Using this process, we crawled 65 million tweets from Sina Weibo within 4 months. [sent-186, score-0.428]
44 In both cases, we first filter the collection of tweets for messages containing at least one trigram in each language of the target language pair, deter- mined by their Unicode ranges. [sent-187, score-0.471]
45 This means that for the Chinese-English language pair, we only keep tweets with more than 3 Mandarin characters and 3 latin words. [sent-188, score-0.42]
46 , 2012), if a tweet A is identified as a retweet, meaning that it references another tweet B, we also consider the hypothesis that these tweets may be mutual translations. [sent-190, score-0.576]
47 com/wiki/API文档/en these are also considered for the extraction of parallel data. [sent-193, score-0.356]
48 This is done by concatenating tweets A and B, and adding the constraint that [p, q] must be within A and [u, v] must be within B. [sent-194, score-0.328]
49 After filtering, we obtained 1124k ZH-EN tweets from Sina Weibo, 868k ZH-EN and 136k AR-EN tweets from Twitter. [sent-196, score-0.656]
50 Finally, we run our alignment model described in section 3, and obtain the parallel segments and their scores, which measure how likely those segments are parallel. [sent-198, score-0.84]
51 First, intrinsically, by observing how well our method identifies tweets containing parallel data, the language pair and what their spans are. [sent-202, score-0.867]
52 Our method needs to determine if a given tweet contains parallel data, and if so, what is the language pair of the data, and what segments are parallel. [sent-208, score-0.725]
53 Thus, we had a native Mandarin speaker, also fluent in English, to annotate 2000 tweets sampled from crawled Weibo tweets. [sent-209, score-0.428]
54 One important question of answer is what portion of the Microblogs contains parallel data. [sent-210, score-0.356]
55 Thus, we also use the random sample Twitter and annotated 1200 samples, identifying whether each sample contains parallel data, for the EN-ZH and AR-EN filtered tweets. [sent-211, score-0.408]
56 We count as a true positive (tp) if we correctly identify a parallel tweet, and as a false positive (fp) spuriously detect a parallel tweet. [sent-215, score-0.746]
57 Finally, a true negative (tn) occurs when we correctly detect a non-parallel 180 tweet, and a false negative (fn) if we miss a parallel tweet. [sent-216, score-0.356]
58 Finally, to evaluate the segment alignment, we use the Word Error Rate (WER) metric, without substitutions, where we compare the left and right spans of our system and the respective spans of the reference. [sent-219, score-0.442]
59 We count an insertion error (I) for each word in our system’s spans that is not present in the reference span and a deletion error (D) for each word in the reference span that is not present in our system’s spans. [sent-220, score-0.396]
60 The quality of the parallel sentence detection did not vary significantly with different setups, so we will only show the results for the best setup, which is the baseline model with span constraints. [sent-225, score-0.527]
61 Figure 2: Precision, recall and accuracy curves for parallel data detection. [sent-226, score-0.387]
62 From the precision and recall curves, we observe that most of the parallel data can be found at the top 30% of the filtered tweets, where 5 in 6 tweets are detected correctly as parallel, and only 1in every 6 parallel sentences is lost. [sent-228, score-1.139]
63 We also see that in total, 30% of the filtered tweets are parallel. [sent-231, score-0.38]
64 If we generalize this ratio for the complete set with 1124k tweets, we can expect approximately 337k parallel sentences. [sent-232, score-0.386]
65 Finally, since 65 million tweets were extracted to generate the 337k tweets, we estimate that approximately 1 parallel tweet can be found for every 200 tweets we process using our targeted approach. [sent-233, score-1.166]
66 On the other hand, from the 1200 tweets from Twitter, we found that 27 had parallel data in the ZH-EN pair, if we extrapolate for the whole 868k filtered tweets, we expect that we can find 19530. [sent-234, score-0.736]
67 For AR-EN, a similar result was obtained where we expect 12407 tweets out of the 1. [sent-238, score-0.328]
68 This shows that targeted approaches can substantially reduce the crawling effort required to find parallel tweets. [sent-240, score-0.384]
69 Still, considering that billions of tweets are posted daily, this is a substantial source of parallel data. [sent-241, score-0.684]
70 The remainder of the tests will be performed on the Weibo dataset, which contains more parallel data. [sent-242, score-0.356]
71 Tests on the Twitter data will be conducted as future work, when we process Twitter data on a larger scale to obtain more parallel sentences. [sent-243, score-0.356]
72 This shows that, on average, approximately 1 in 9 words on the parallel segments is incorrect. [sent-251, score-0.588]
73 Among the 578 tweets that are parallel, 496 were extracted within the same tweet and 82 were extracted from retweets. [sent-253, score-0.452]
74 Thus, we see that the majority of the parallel data comes from within the same tweet. [sent-254, score-0.356]
75 To give an intuition about the contents of the parallel data we found, we looked at the distribution over topics of the parallel dataset inferred by LDA (Blei et al. [sent-256, score-0.748]
76 Thus, we grouped the Weibo filtered tweets by users, and ran LDA over the predicted English segments, with 12 topics. [sent-258, score-0.38]
77 To gain some perspective on the type of sentence pairs we are extracting, we will illustrate some sentence pairs we crawled and aligned automatically. [sent-264, score-0.284]
78 We extracted 1386 parallel sentences for tuning and another 1386 sentences for testing, from the manually aligned segments. [sent-291, score-0.396]
79 For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data. [sent-292, score-0.392]
80 To carry out the microblog translation experiments, we need a high quality parallel test set. [sent-320, score-0.619]
81 Since we are not aware of such a test set, we created one by manually selecting parallel messages from Weibo. [sent-321, score-0.499]
82 Z829H ber of parallel tweets according to our automatic method (at least 2 in every 5 tweets). [sent-328, score-0.684]
83 To these, we added another 2000 messages from our targeted Weibo crawl, but these had no requirement on the proportion of parallel tweets they had produced. [sent-329, score-0.827]
84 We identified 2374 parallel segments, of which we used 1187 for development and 1187 for testing. [sent-330, score-0.356]
85 Furthermore, to ensure that our training data was not too similar to the test set in the Weibo translation task, we filtered the training data to remove near duplicates by computing edit distance between each parallel sentence in the heldout set and each training instance. [sent-333, score-0.544]
86 4 As for the language models, we collected a further 10M tweets from Twitter for the English language model and another 10M tweets from Weibo for the Chinese language model. [sent-335, score-0.656]
87 3We acknowledge that self-translated messages are probably not a typically representative sample of all microblog messages. [sent-336, score-0.313]
88 The BLEU scores for the different parallel corpora are shown in Table 3 and the top 10 out-of-vocabulary (OOV) words for each dataset are shown in Table 4. [sent-356, score-0.392]
89 However, by combining the Weibo parallel data with this standard data, improvements in BLEU are obtained. [sent-360, score-0.356]
90 Furthermore, we also note that the system built on the Weibo dataset does not perform substantially worse than the one trained on the FBIS dataset, a further indication that harvesting parallel microblog data yields a diverse collection of translated material. [sent-362, score-0.633]
91 For the Weibo test set, a significant improvement over the news datasets can be achieved using our crawled parallel data. [sent-363, score-0.562]
92 6 Conclusion We presented a framework to crawl parallel data from microblogs. [sent-375, score-0.415]
93 We find parallel data from single posts, with translations of the same sentence in two languages. [sent-376, score-0.465]
94 We show that a considerable amount of parallel sentence pairs can be crawled from microblogs and these can be used to improve Machine Translation by updating our translation tables with translations of newer terms. [sent-377, score-0.853]
95 Furthermore, the in-domain data can substantially improve the translation quality on microblog data. [sent-378, score-0.291]
96 Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. [sent-405, score-0.479]
97 A fast and accurate method for detecting English-Japanese parallel texts. [sent-422, score-0.356]
98 Constructing parallel corpora for six indian languages via crowdsourcing. [sent-473, score-0.356]
99 Extracting parallel sentences from comparable corpora using document level alignment. [sent-485, score-0.385]
100 mining large corpora for parallel sentences to improve translation modeling. [sent-490, score-0.449]
wordName wordTfidf (topN-words)
[('weibo', 0.425), ('parallel', 0.356), ('tweets', 0.328), ('segments', 0.202), ('microblog', 0.17), ('twitter', 0.154), ('messages', 0.143), ('spans', 0.14), ('microblogs', 0.134), ('span', 0.128), ('sina', 0.128), ('tweet', 0.124), ('parenthetical', 0.107), ('crawled', 0.1), ('translation', 0.093), ('update', 0.093), ('fbis', 0.092), ('viterbi', 0.09), ('bispan', 0.087), ('jelh', 0.087), ('syndicate', 0.087), ('segment', 0.087), ('arabic', 0.084), ('alignment', 0.08), ('posts', 0.069), ('alignments', 0.069), ('mandarin', 0.067), ('translations', 0.066), ('nist', 0.063), ('news', 0.061), ('crawl', 0.059), ('zbib', 0.058), ('wer', 0.057), ('xu', 0.056), ('fukushima', 0.053), ('braune', 0.053), ('filtered', 0.052), ('mayne', 0.05), ('edited', 0.05), ('latin', 0.05), ('observe', 0.047), ('oov', 0.045), ('datasets', 0.045), ('axelrod', 0.044), ('acutely', 0.044), ('jargon', 0.044), ('onni', 0.044), ('translated', 0.043), ('sentence', 0.043), ('pair', 0.043), ('characters', 0.042), ('post', 0.041), ('aligned', 0.04), ('uszkoreit', 0.039), ('retweet', 0.039), ('nonstandard', 0.039), ('uniformity', 0.039), ('gimpel', 0.038), ('left', 0.038), ('ibm', 0.038), ('shall', 0.037), ('right', 0.037), ('users', 0.037), ('dataset', 0.036), ('clir', 0.036), ('kenny', 0.036), ('abbreviations', 0.035), ('sl', 0.034), ('pa', 0.034), ('athe', 0.034), ('spuriously', 0.034), ('emoticons', 0.034), ('originated', 0.034), ('colloquial', 0.034), ('furthermore', 0.033), ('operations', 0.033), ('orthographic', 0.033), ('vogel', 0.032), ('smith', 0.032), ('newer', 0.032), ('billion', 0.032), ('koehn', 0.031), ('curves', 0.031), ('ture', 0.031), ('yk', 0.031), ('indexes', 0.031), ('harvested', 0.031), ('approximately', 0.03), ('stroudsburg', 0.03), ('updates', 0.03), ('index', 0.03), ('pairs', 0.029), ('document', 0.029), ('portugal', 0.028), ('substantially', 0.028), ('eal', 0.028), ('resnik', 0.028), ('older', 0.027), ('extremely', 0.027), ('montr', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000018 240 acl-2013-Microblogs as Parallel Corpora
Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso
Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.
2 0.26017588 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez
Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1
3 0.23934668 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts
Author: Alexandra Balahur ; Hristo Tanev
Abstract: Nowadays, the importance of Social Media is constantly growing, as people often use such platforms to share mainstream media news and comment on the events that they relate to. As such, people no loger remain mere spectators to the events that happen in the world, but become part of them, commenting on their developments and the entities involved, sharing their opinions and distributing related content. This paper describes a system that links the main events detected from clusters of newspaper articles to tweets related to them, detects complementary information sources from the links they contain and subsequently applies sentiment analysis to classify them into positive, negative and neutral. In this manner, readers can follow the main events happening in the world, both from the perspective of mainstream as well as social media and the public’s perception on them. This system will be part of the EMM media monitoring framework working live and it will be demonstrated using Google Earth.
4 0.23709656 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
Author: Felix Hieber ; Laura Jehl ; Stefan Riezler
Abstract: We present an approach to mine comparable data for parallel sentences using translation-based cross-lingual information retrieval (CLIR). By iteratively alternating between the tasks of retrieval and translation, an initial general-domain model is allowed to adapt to in-domain data. Adaptation is done by training the translation system on a few thousand sentences retrieved in the step before. Our setup is time- and memory-efficient and of similar quality as CLIR-based adaptation on millions of parallel sentences.
5 0.21271639 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
Author: Weiwei Guo ; Hao Li ; Heng Ji ; Mona Diab
Abstract: Many current Natural Language Processing [NLP] techniques work well assuming a large context of text as input data. However they become ineffective when applied to short texts such as Twitter feeds. To overcome the issue, we want to find a related newswire document to a given tweet to provide contextual support for NLP tasks. This requires robust modeling and understanding of the semantics of short texts. The contribution of the paper is two-fold: 1. we introduce the Linking-Tweets-toNews task as well as a dataset of linked tweet-news pairs, which can benefit many NLP applications; 2. in contrast to previ- ous research which focuses on lexical features within the short texts (text-to-word information), we propose a graph based latent variable model that models the inter short text correlations (text-to-text information). This is motivated by the observation that a tweet usually only covers one aspect of an event. We show that using tweet specific feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are able to extract text-to-text correlations, and thus completes the semantic picture of a short text. Our experiments show significant improvement of our new model over baselines with three evaluation metrics in the new task.
6 0.17829105 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics
7 0.17214711 301 acl-2013-Resolving Entity Morphs in Censored Data
8 0.15932199 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction
9 0.14646249 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction
10 0.14544217 45 acl-2013-An Empirical Study on Uncertainty Identification in Social Media Context
11 0.14318021 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
12 0.1328847 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
13 0.12933661 139 acl-2013-Entity Linking for Tweets
14 0.11902312 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
15 0.11693559 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
16 0.11258599 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
17 0.11097826 289 acl-2013-QuEst - A translation quality estimation framework
18 0.11060601 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
19 0.10852541 67 acl-2013-Bi-directional Inter-dependencies of Subjective Expressions and Targets and their Value for a Joint Model
20 0.10655279 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users
topicId topicWeight
[(0, 0.28), (1, 0.024), (2, 0.157), (3, 0.15), (4, 0.181), (5, 0.05), (6, 0.099), (7, 0.109), (8, 0.183), (9, -0.19), (10, -0.116), (11, 0.006), (12, 0.066), (13, 0.002), (14, 0.023), (15, -0.035), (16, 0.069), (17, -0.058), (18, 0.004), (19, -0.045), (20, 0.023), (21, 0.036), (22, -0.007), (23, -0.045), (24, -0.01), (25, 0.026), (26, -0.008), (27, 0.052), (28, 0.096), (29, -0.017), (30, 0.0), (31, 0.022), (32, -0.005), (33, 0.032), (34, -0.038), (35, -0.014), (36, 0.046), (37, 0.02), (38, 0.079), (39, 0.079), (40, -0.019), (41, -0.067), (42, -0.015), (43, 0.042), (44, 0.045), (45, 0.021), (46, 0.089), (47, -0.021), (48, 0.007), (49, -0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.90606612 240 acl-2013-Microblogs as Parallel Corpora
Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso
Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.
2 0.78595746 45 acl-2013-An Empirical Study on Uncertainty Identification in Social Media Context
Author: Zhongyu Wei ; Junwen Chen ; Wei Gao ; Binyang Li ; Lanjun Zhou ; Yulan He ; Kam-Fai Wong
Abstract: Uncertainty text detection is important to many social-media-based applications since more and more users utilize social media platforms (e.g., Twitter, Facebook, etc.) as information source to produce or derive interpretations based on them. However, existing uncertainty cues are ineffective in social media context because of its specific characteristics. In this paper, we propose a variant of annotation scheme for uncertainty identification and construct the first uncertainty corpus based on tweets. We then conduct experiments on the generated tweets corpus to study the effectiveness of different types of features for uncertainty text identification.
3 0.7750656 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction
Author: Bo Han ; Paul Cook ; Timothy Baldwin
Abstract: We implement a city-level geolocation prediction system for Twitter users. The system infers a user’s location based on both tweet text and user-declared metadata using a stacking approach. We demonstrate that the stacking method substantially outperforms benchmark methods, achieving 49% accuracy on a benchmark dataset. We further evaluate our method on a recent crawl of Twitter data to investigate the impact of temporal factors on model generalisation. Our results suggest that user-declared location metadata is more sensitive to temporal change than the text of Twitter messages. We also describe two ways of accessing/demoing our system.
4 0.73279309 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
Author: Felix Hieber ; Laura Jehl ; Stefan Riezler
Abstract: We present an approach to mine comparable data for parallel sentences using translation-based cross-lingual information retrieval (CLIR). By iteratively alternating between the tasks of retrieval and translation, an initial general-domain model is allowed to adapt to in-domain data. Adaptation is done by training the translation system on a few thousand sentences retrieved in the step before. Our setup is time- and memory-efficient and of similar quality as CLIR-based adaptation on millions of parallel sentences.
5 0.72752297 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
Author: Weiwei Guo ; Hao Li ; Heng Ji ; Mona Diab
Abstract: Many current Natural Language Processing [NLP] techniques work well assuming a large context of text as input data. However they become ineffective when applied to short texts such as Twitter feeds. To overcome the issue, we want to find a related newswire document to a given tweet to provide contextual support for NLP tasks. This requires robust modeling and understanding of the semantics of short texts. The contribution of the paper is two-fold: 1. we introduce the Linking-Tweets-toNews task as well as a dataset of linked tweet-news pairs, which can benefit many NLP applications; 2. in contrast to previ- ous research which focuses on lexical features within the short texts (text-to-word information), we propose a graph based latent variable model that models the inter short text correlations (text-to-text information). This is motivated by the observation that a tweet usually only covers one aspect of an event. We show that using tweet specific feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are able to extract text-to-text correlations, and thus completes the semantic picture of a short text. Our experiments show significant improvement of our new model over baselines with three evaluation metrics in the new task.
7 0.68536699 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts
8 0.66360617 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics
9 0.64366096 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
10 0.64363903 33 acl-2013-A user-centric model of voting intention from Social Media
11 0.64133614 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study
12 0.63452101 42 acl-2013-Aid is Out There: Looking for Help from Tweets during a Large Scale Disaster
13 0.63204622 301 acl-2013-Resolving Entity Morphs in Censored Data
14 0.62488478 114 acl-2013-Detecting Chronic Critics Based on Sentiment Polarity and Userâ•Žs Behavior in Social Media
15 0.6106323 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
16 0.51466036 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users
17 0.48859051 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction
18 0.48729929 255 acl-2013-Name-aware Machine Translation
19 0.48571429 359 acl-2013-Translating Dialectal Arabic to English
20 0.47835612 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
topicId topicWeight
[(0, 0.051), (6, 0.088), (11, 0.055), (15, 0.014), (24, 0.051), (26, 0.086), (29, 0.01), (35, 0.065), (38, 0.01), (42, 0.051), (48, 0.036), (70, 0.045), (87, 0.129), (88, 0.026), (90, 0.049), (95, 0.156), (97, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.90346074 240 acl-2013-Microblogs as Parallel Corpora
Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso
Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.
2 0.85511422 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
Author: Ulle Endriss ; Raquel Fernandez
Abstract: Crowdsourcing, which offers new ways of cheaply and quickly gathering large amounts of information contributed by volunteers online, has revolutionised the collection of labelled data. Yet, to create annotated linguistic resources from this data, we face the challenge of having to combine the judgements of a potentially large group of annotators. In this paper we investigate how to aggregate individual annotations into a single collective annotation, taking inspiration from the field of social choice theory. We formulate a general formal model for collective annotation and propose several aggregation methods that go beyond the commonly used majority rule. We test some of our methods on data from a crowdsourcing experiment on textual entailment annotation.
3 0.8495822 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
Author: Peter A. Rankel ; John M. Conroy ; Hoa Trang Dang ; Ani Nenkova
Abstract: How good are automatic content metrics for news summary evaluation? Here we provide a detailed answer to this question, with a particular focus on assessing the ability of automatic evaluations to identify statistically significant differences present in manual evaluation of content. Using four years of data from the Text Analysis Conference, we analyze the performance of eight ROUGE variants in terms of accuracy, precision and recall in finding significantly different systems. Our experiments show that some of the neglected variants of ROUGE, based on higher order n-grams and syntactic dependencies, are most accurate across the years; the commonly used ROUGE-1 scores find too many significant differences between systems which manual evaluation would deem comparable. We also test combinations ofROUGE variants and find that they considerably improve the accuracy of automatic prediction.
4 0.8431704 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez
Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1
5 0.83583206 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
Author: Graham Neubig
Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, and find that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http : / /phont ron . com/t ravat ar
6 0.83475792 97 acl-2013-Cross-lingual Projections between Languages from Different Families
7 0.83175546 289 acl-2013-QuEst - A translation quality estimation framework
9 0.829032 333 acl-2013-Summarization Through Submodularity and Dispersion
10 0.82866466 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
11 0.82843518 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
12 0.82775885 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
13 0.82739329 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning
14 0.82608008 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner
15 0.82356316 24 acl-2013-A Tale about PRO and Monsters
16 0.82176304 255 acl-2013-Name-aware Machine Translation
17 0.81885839 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction
18 0.81854564 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
19 0.81819284 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing
20 0.81814945 235 acl-2013-Machine Translation Detection from Monolingual Web-Text