acl acl2013 acl2013-326 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hany Hassan ; Arul Menezes
Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. [sent-2, score-1.439]
2 The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. [sent-3, score-0.97]
3 The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. [sent-4, score-0.551]
4 When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. [sent-8, score-0.492]
5 The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text. [sent-9, score-0.48]
6 1 Introduction Social Media text is usually very noisy and contains a lot of typos, ad-hoc abbreviations, phonetic substitutions, customized abbreviations and slang language. [sent-10, score-0.449]
7 Natural language processing and understanding systems such as Machine Translation, Information Extraction and Text-to-Speech are usually trained and optimized for clean data; therefore such systems would face a challenging problem with social media text. [sent-12, score-0.393]
8 It is crucial to have a solution for text normalization that can adapt to such variations automatically. [sent-26, score-0.663]
9 We propose a text normalization approach using an unsupervised method to induce normalization equivalences from noisy data which can adapt to any genre of social media. [sent-27, score-1.961]
10 In this paper, we focus on providing a solution for social media text normalization as a preprocessing step for NLP applications. [sent-28, score-1.031]
11 Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 157 –1586, appropriate normalization depending on the context and on the domain. [sent-36, score-0.656]
12 Third, text normalization as a preprocessing step should have very high precision; in other words, it should provide conservative and confident normalization and not overcorrect. [sent-37, score-1.414]
13 Moreover, the text normalization should have high recall, as well, to have a good impact on the NLP applications. [sent-38, score-0.663]
14 In this paper, we introduce a social media text normalization system which addresses the challenges mentioned above. [sent-39, score-0.995]
15 The proposed system is based on constructing a lattice from possible normalization candidates and finding the best normalization sequence according to an n-gram language model using a Viterbi decoder. [sent-40, score-1.447]
16 We propose an unsupervised approach to learn the normalization candidates from unlabeled text data. [sent-41, score-0.807]
17 We evaluate the approach on the normalization task as well as machine translation task. [sent-44, score-0.671]
18 2 Related Work Early work handled the text normalization problem as a noisy channel model where the normalized words go through a noisy channel to produce the noisy text. [sent-46, score-1.778]
19 (Brill and Moore, 2000) introduced an approach for modeling the spelling errors as a noisy channel model based on string to string edits. [sent-47, score-0.56]
20 , 2007) introduced a supervised HMM channel model for text normalization which has been expanded by (Cook and Stevenson, 2009) to introduce unsupervised noisy channel model using probabilistic models for common abbreviation and various spelling errors types. [sent-51, score-1.228]
21 Some researchers used Statistical Machine Translation approach for text normalization; formalizing the problem as a translation from the noisy forms to the normalized forms. [sent-52, score-0.495]
22 The main drawback of these approaches is that the noisy channel model cannot accurately represent the errors types without contextual information. [sent-55, score-0.44]
23 More recent approaches tried to handle the text normalization problem using normalization lexicons which map the noisy form of the word to a normalized form. [sent-56, score-1.773]
24 , 2011) proposed an approach using a classifier to identify the noisy words candidate for normalization; then using some rules to generate lexical variants and a small normalization lexicon. [sent-58, score-0.981]
25 , 2011) proposed an approach using an impoverished normalization lexicon based on string and distributional similarity along with a dictionary lookup approach to detect noisy words. [sent-60, score-1.227]
26 , 2012) introduced a similar approach by generating a normalization lexicon based on distributional similarity and string similarity. [sent-62, score-0.891]
27 This approach uses pairwise similarity where any two words that share the same context are considered as normalization equivalences. [sent-63, score-0.815]
28 First, it does not take into account the relative frequencies of the normalization equivalences that might share different contexts. [sent-65, score-0.872]
29 Therefore, the selection of the normalization equivalences is performed on pairwise basis only and is not optimized over the whole data. [sent-66, score-0.889]
30 Secondly, the normalization equivalences must appear in the exact same context to be considered as a normalization candidate. [sent-67, score-1.485]
31 Our approach also adopts a lexicon based approach for text normalization, we construct a lattice from possible normalization candidates and find the best normalization sequence according to an n-gram language model using a Viterbi decoder. [sent-69, score-1.57]
32 The normalization lexicon is acquired from unlabeled data using random walks on a contextual similarity graph constructed form n-gram sequences on large unlabeled text corpus. [sent-70, score-1.518]
33 However, our approach is significantly different since we acquire the lexicon using random walks on a contextual similarity graph which has a number of advantages over the pairwise similarity approach used in (Han et al. [sent-73, score-0.796]
34 Namely, the acquired normalization equivalence are optimized globally over the whole data, the rare equivalences are not considered as good candidates unless there is a strong statistical evidence across the data, and finally the normalization equivalences may not share the same context. [sent-75, score-1.769]
35 3 Text Normalization System In this paper, we handle text normalization as a lattice scoring approach, where the translation is performed from noisy text as the source side to the normalized text as the target side. [sent-78, score-1.292]
36 We construct a lattice from possible normalization candidates and find the best normalization sequence according to an n-gram language model using a Viterbi decoder. [sent-80, score-1.382]
37 In this paper, we restrict the normalization lexicon to one-to-one word mappings, we do not consider multi words mapping for the lexicon induction. [sent-81, score-0.931]
38 To identify OOV candidates for normalization; we restrict proposing normalization candidates to the words that we have in our induced normalization lexicon only. [sent-82, score-1.578]
39 1 Baseline Normalization Candidates Generation We experimented with two normalization candidate generators as baseline systems. [sent-86, score-0.779]
40 This approach overcomes the main problem of the dictionary-based approach which is providing inappropriate normalization candidates to the errors styles in the social media text. [sent-94, score-1.026]
41 As we will show in the experiments in Section(5), dictionary-based normalization methods proved to be inadequate for social media domain normalization for many reasons. [sent-95, score-1.584]
42 1 Bipartite Graph Representation The main motivation of this approach is that normalization equivalences share similar context; which we call contextual similarity. [sent-100, score-0.934]
43 For instance, assume 5-gram sequences of words, two words may be normalization equivalences if their n-gram context shares the same two words on the left and the same two words on the right. [sent-101, score-0.926]
44 This contextual similarity can be represented as a bipartite graph with the first partite representing the words and the second partite representing the n-gram contexts that may be shared by words. [sent-103, score-0.413]
45 A word node can be either normalized word or noisy word. [sent-104, score-0.48]
46 Identifying if a word is normalized or noisy (candidate for normalization) is crucial since this decision limits the candidate noisy words to be normalized. [sent-105, score-0.733]
47 We adopted a soft criteria for iden1579 Figure 1: Bipartite Graph Representation, left nodes represent contexts, gray right nodes represent the noisy words and white right nodes rep- resent the normalized words. [sent-106, score-0.715]
48 10 times) is considered as a candidate for normalization (noisy word). [sent-112, score-0.677]
49 Figure(1) shows a sample of the bipartite graph G(W, C, E), where noisy words are shown as gray nodes. [sent-113, score-0.495]
50 While constructing the graph, we identify if a node represents a noisy word (N) (called source node) or a normalized word (M) (called absorbing node). [sent-117, score-0.581]
51 The main objective is to identify pairs of noisy and normalized words that can be considered as normalization equivalences. [sent-122, score-1.042]
52 For example, (Hughes and Ramage, 2007) used random walks on Wordnet graph to measure lexical semantic relatedness between words. [sent-124, score-0.41]
53 In this paper, we apply the label propagation approach to the text normalization problem. [sent-127, score-0.663]
54 Consider a random walk on the bipartite graph G(W, C, E) starting at a noisy word (source node) and ending at a normalized word (absorbing node). [sent-128, score-0.837]
55 The walker starts from any source node Ni belonging to the noisy words then move to any other connected node Mj with probability Pij. [sent-129, score-0.426]
56 This is due to the probability normalization which is done according to the nodes connectivity. [sent-132, score-0.727]
57 It is worth noting that due to the bipartite graph representation; any word node, either noisy (source) or normalized (absorbing), is only connected to context nodes and not directly connected to any other word node. [sent-134, score-0.858]
58 For any random walk the number of steps taken to traverse between any two nodes is called the hitting time (Norris, 1997). [sent-136, score-0.467]
59 Therefore, the hitting time between a noisy and a normalized pair of nodes (n, m) with a walk r is hr (n, m). [sent-137, score-0.718]
60 It is worth noting that the random walks are selected according to the transition probability in Eqn(1); therefore, the more probable paths will be picked more frequently. [sent-142, score-0.422]
61 The same pair of nodes can be connected with many walks ofvarious steps (hits), and the same noisy word can be connected to many other normalized words. [sent-143, score-0.865]
62 We define the contextual similarity probability of a normalization equivalence pair n, m as L(n, m). [sent-144, score-0.745]
63 Which is the relative frequency of the average hitting of those two nodes, H(n, m), and all other normalized nodes linked to that noisy word. [sent-145, score-0.593]
64 Thus L(n, m), is calculated as: L(n,m) = H(n,m)/∑H(n,mi) (3) ∑i Furthermore, we add another similarity cost between a noisy word and a normalized word based on the lexical similarity cost, SimCost(n, m), which we will describe in the next section. [sent-146, score-0.603]
65 This cost function is defined as the ratio of LCSR and Edit distance between two strings as follows: SimCost(n, m) = LCSR(n, m)/ED(n, m) LCSR(n, m) = LCS(n, m)/MaxLenght(n, m) (5) (6) We have modified the Edit Distance calculation ED(n,m) to be more adequate for social media text. [sent-155, score-0.379]
66 1 Training and Evaluation Data We collected large amount of social media data to generate the normalization lexicon using the ran1581 dom walk approach. [sent-158, score-1.207]
67 We combined both data, noisy and clean, together to induce the normalization dictionary from them. [sent-162, score-0.954]
68 We constructed a test set of 1000 sentences of social media which had been corrected by a native human annotator, the main guidelines were to normalize noisy words to its corresponding clean words in a consistent way according to the evidences in the context. [sent-164, score-0.743]
69 Furthermore, we developed a test set for evaluating the effect of the normalization system when used as a preprocessing step for Machine translation. [sent-166, score-0.721]
70 The machine translation test set is composed of 500 sentences of social media English text translated to normalized Spanish text by a bi-lingual translator. [sent-167, score-0.566]
71 2 Evaluating Normalization Lexicon Generation We extracted 5-gram sequences from the combined noisy and clean data; then we limited the space of noisy 5-gram sequences to those which contain only one noisy word as the center word and all other words, representing the context, are not noisy. [sent-169, score-0.974]
72 As we mentioned before, we identify whether the word is noisy or not by looking up a vocabulary list constructed from clean data. [sent-170, score-0.44]
73 Any word that appears less than 10 times in this vocabulary is considered noisy and candidate for normalization during the lexicon induction process. [sent-172, score-1.126]
74 It is worth noting that our notion of noisy word does not mean it is an OOV that has to be corrected; instead it indicates that it is candidate for correction but may be opted not to be normalized if there is no confident normalization for it. [sent-173, score-1.219]
75 We experimented with two candidate generators as baseline systems, namely the dictionary-based spelling correction and the trie approximate match with K errors; where K=3. [sent-181, score-0.396]
76 We compared those approaches with our newly proposed unsupervised normalization lexicon induction; for this case the cost for a candidate is the combined cost of the contextual similarity probability and the lexical similarity cost as defined in Eqn(4). [sent-183, score-1.302]
77 We examine the effect of data size and the steps of the random walks on the accuracy and the coverage of the induced dictionary. [sent-184, score-0.4]
78 Finally, we pruned the lexicon to keep the top 5 candidates per noisy word. [sent-192, score-0.488]
79 Next, we will examine the effect of lexicon size on the normalization task. [sent-199, score-0.779]
80 On the other hand, the induced normalization lexicon approach is doing much better even with a small amount of data as we can see with system RW1 which uses Lex1 generated from 20M sentences and has 123K lexicon entry. [sent-210, score-0.996]
81 , 2012) which used pairwise contextual similarity to induce a normalization lexicon of 40K entries, we will refer to this lexicon as HB-Dict. [sent-217, score-1.136]
82 The contextual graph random walks approach helps in providing high precision lexicon since the sampling nature of the approach helps in filtering out unreliable normalization equivalences. [sent-226, score-1.284]
83 The random walks will traverse more frequent paths; which would lead to more probable normalization equivalence. [sent-227, score-0.989]
84 Since the proposed approach deploys random walks to sample paths that can traverse many steps, this relaxes the constraints that the normalization equivalences have to share the same context. [sent-229, score-1.305]
85 Instead a noisy word may share a context with another noisy word which in turn shares a context with a clean equivalent normalization word. [sent-230, score-1.382]
86 Therefore, we end up with a lexicon that have much higher recall than the pairwise similarity approach since it explores equivalences beyond the pairwise relation. [sent-231, score-0.53]
87 5 Output Analysis Table(4) shows some examples of the induced normalization equivalences, the first part shows good examples where vowels are restored and phonetic similar words are matched. [sent-234, score-0.759]
88 On the other hand, the lexicon has some bad normalization such as ”‘unrecycled ”’ which should be normalized to ”‘non recycled”’ but since the system is limited to one word correction it did not get it. [sent-237, score-1.02]
89 Another interesting bad normalization is ”‘tutting”’ which is new type of 1583 dancing and should not be corrected to ”‘tweeting”’ . [sent-238, score-0.656]
90 vunietdNrwaotgucrilyhnsbgtiTledyabtvrwoiC4csranel:ybicgLtalnhiegtydxcVopnRuhetrogmwisndaStegrciloskmcnersdpiglamtcnoeirsladengityp Table 5 lists a number of examples and their normalization using both Baseline1 and RW3. [sent-239, score-0.627]
91 At the first example, RW3 got the correct normalization as ”interesting ” which apparently is not the one with the shortest edit distance, though it is the most frequent candidate at the generated lexicon. [sent-240, score-0.737]
92 The baseline system did not get it right; it got a wrong normalization with shorter edit distance. [sent-241, score-0.716]
93 At Example(3), both the baseline and RW3 did not get the correct normalization of ”yur” to ”you are ” which is currently a limitation in our system since we only allow one-to-one word mapping in the generated lexicons not one-to-many or many-tomany. [sent-243, score-0.686]
94 This shows a characteristic of the proposed approach; it is very conservative in proposing normalization which is desirable as a preprocessing step for NLP applications. [sent-245, score-0.756]
95 Finally, Example 4 shows also that the system normalize ”gr8” which is mainly due to having a flexible similarity cost during the normalization lexicon construction. [sent-247, score-0.94]
96 The translation with normalization was improved by about 6% from 29. [sent-257, score-0.671]
97 036ol%7t9vs% e%mnet 6 Conclusion and Future Work We introduced a social media text normalization system that can be deployed as a preprocessor for MT and various NLP applications to handle social media text. [sent-262, score-1.374]
98 We show that the proposed unsupervised approach provides a normalization system with very high precision and a reasonable recall. [sent-264, score-0.76]
99 As an extension to this work, we will extend the approach to handle many-to-many normalization pairs; also we plan to apply the approach to more languages. [sent-267, score-0.665]
100 An improved error model for noisy channel spelling correction, In ACL 2000: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Englewood Cliffs, NJ, USA. [sent-280, score-0.42]
wordName wordTfidf (topN-words)
[('normalization', 0.627), ('noisy', 0.268), ('walks', 0.246), ('equivalences', 0.202), ('media', 0.164), ('lexicon', 0.152), ('normalized', 0.147), ('social', 0.139), ('bipartite', 0.133), ('walk', 0.125), ('absorbing', 0.101), ('nodes', 0.1), ('graph', 0.094), ('clean', 0.09), ('channel', 0.082), ('oov', 0.079), ('trie', 0.078), ('hitting', 0.078), ('eqn', 0.076), ('cost', 0.076), ('spelling', 0.07), ('random', 0.07), ('generators', 0.07), ('han', 0.069), ('candidates', 0.068), ('lcsr', 0.067), ('preprocessing', 0.065), ('node', 0.065), ('correction', 0.065), ('contextual', 0.062), ('pairwise', 0.06), ('edit', 0.06), ('lattice', 0.06), ('norris', 0.057), ('tkin', 0.057), ('similarity', 0.056), ('string', 0.056), ('phonetic', 0.054), ('constructed', 0.053), ('candidate', 0.05), ('steps', 0.048), ('sms', 0.048), ('hits', 0.047), ('traverse', 0.046), ('translation', 0.044), ('viterbi', 0.044), ('share', 0.043), ('pij', 0.042), ('vowels', 0.042), ('unlabeled', 0.041), ('transition', 0.04), ('sequences', 0.04), ('deployed', 0.038), ('handle', 0.038), ('contractor', 0.038), ('simcost', 0.038), ('zobel', 0.038), ('text', 0.036), ('induced', 0.036), ('proposed', 0.036), ('unsupervised', 0.035), ('paths', 0.035), ('partite', 0.034), ('checker', 0.034), ('customized', 0.034), ('szummer', 0.034), ('minkov', 0.034), ('precision', 0.033), ('substitution', 0.032), ('dictionary', 0.032), ('experimented', 0.032), ('accent', 0.031), ('dro', 0.031), ('gouws', 0.031), ('approximate', 0.031), ('abbreviations', 0.031), ('confident', 0.031), ('noting', 0.031), ('lexicons', 0.03), ('cook', 0.03), ('corrected', 0.029), ('system', 0.029), ('hughes', 0.029), ('messaging', 0.029), ('vocabulary', 0.029), ('context', 0.029), ('conservative', 0.028), ('errors', 0.028), ('connected', 0.028), ('spell', 0.028), ('normalisation', 0.028), ('melamed', 0.028), ('shares', 0.028), ('induce', 0.027), ('discusses', 0.027), ('messages', 0.027), ('inadequate', 0.027), ('brill', 0.027), ('slang', 0.026), ('ramage', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
Author: Hany Hassan ; Arul Menezes
Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.
2 0.44981983 37 acl-2013-Adaptive Parser-Centric Text Normalization
Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li
Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.
3 0.15994458 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
Author: Majid Razmara ; Maryam Siahbani ; Reza Haffari ; Anoop Sarkar
Abstract: Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation especially when a limited amount of parallel text is available for training or when there is a domain shift from training data to test data. In this paper, we propose a novel approach to finding translations for oov words. We induce a lexicon by constructing a graph on source language monolingual text and employ a graph propagation technique in order to find translations for all the source language phrases. Our method differs from previous approaches by adopting a graph propagation approach that takes into account not only one-step (from oov directly to a source language phrase that has a translation) but multi-step paraphrases from oov source language words to other source language phrases and eventually to target language translations. Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.
Author: Simone Paolo Ponzetto ; Andrea Zielinski
Abstract: unkown-abstract
5 0.1045595 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts
Author: Alexandra Balahur ; Hristo Tanev
Abstract: Nowadays, the importance of Social Media is constantly growing, as people often use such platforms to share mainstream media news and comment on the events that they relate to. As such, people no loger remain mere spectators to the events that happen in the world, but become part of them, commenting on their developments and the entities involved, sharing their opinions and distributing related content. This paper describes a system that links the main events detected from clusters of newspaper articles to tweets related to them, detects complementary information sources from the links they contain and subsequently applies sentiment analysis to classify them into positive, negative and neutral. In this manner, readers can follow the main events happening in the world, both from the perspective of mainstream as well as social media and the public’s perception on them. This system will be part of the EMM media monitoring framework working live and it will be demonstrated using Google Earth.
6 0.10044308 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
7 0.096646287 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
8 0.089110874 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
9 0.088827476 301 acl-2013-Resolving Entity Morphs in Censored Data
10 0.083756536 139 acl-2013-Entity Linking for Tweets
11 0.083440304 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
12 0.082608983 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
13 0.077755168 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation
14 0.068232998 45 acl-2013-An Empirical Study on Uncertainty Identification in Social Media Context
15 0.067297712 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
16 0.066812068 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering
17 0.064774074 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation
18 0.063611686 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
19 0.062255733 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
20 0.058413308 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning
topicId topicWeight
[(0, 0.197), (1, 0.024), (2, 0.039), (3, 0.017), (4, 0.072), (5, 0.002), (6, 0.02), (7, 0.061), (8, 0.062), (9, -0.064), (10, -0.08), (11, 0.03), (12, 0.026), (13, -0.118), (14, 0.017), (15, -0.038), (16, 0.034), (17, 0.012), (18, -0.048), (19, 0.027), (20, 0.042), (21, 0.047), (22, 0.116), (23, -0.016), (24, -0.026), (25, 0.073), (26, 0.016), (27, 0.123), (28, -0.031), (29, -0.039), (30, -0.124), (31, -0.113), (32, -0.185), (33, 0.183), (34, 0.165), (35, -0.149), (36, -0.142), (37, -0.061), (38, -0.121), (39, 0.037), (40, 0.109), (41, 0.085), (42, 0.167), (43, -0.194), (44, -0.103), (45, 0.053), (46, -0.123), (47, -0.042), (48, 0.03), (49, 0.163)]
simIndex simValue paperId paperTitle
same-paper 1 0.95892739 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
Author: Hany Hassan ; Arul Menezes
Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.
2 0.89022785 37 acl-2013-Adaptive Parser-Centric Text Normalization
Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li
Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.
3 0.52107537 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
Author: Majid Razmara ; Maryam Siahbani ; Reza Haffari ; Anoop Sarkar
Abstract: Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation especially when a limited amount of parallel text is available for training or when there is a domain shift from training data to test data. In this paper, we propose a novel approach to finding translations for oov words. We induce a lexicon by constructing a graph on source language monolingual text and employ a graph propagation technique in order to find translations for all the source language phrases. Our method differs from previous approaches by adopting a graph propagation approach that takes into account not only one-step (from oov directly to a source language phrase that has a translation) but multi-step paraphrases from oov source language words to other source language phrases and eventually to target language translations. Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.
4 0.48321399 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars
Author: David Chiang ; Jacob Andreas ; Daniel Bauer ; Karl Moritz Hermann ; Bevan Jones ; Kevin Knight
Abstract: Hyperedge replacement grammar (HRG) is a formalism for generating and transforming graphs that has potential applications in natural language understanding and generation. A recognition algorithm due to Lautemann is known to be polynomial-time for graphs that are connected and of bounded degree. We present a more precise characterization of the algorithm’s complexity, an optimization analogous to binarization of contextfree grammars, and some important implementation details, resulting in an algorithm that is practical for natural-language applications. The algorithm is part of Bolinas, a new software toolkit for HRG processing.
5 0.47008687 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning
Author: Akiko Eriguchi ; Ichiro Kobayashi
Abstract: In a multi-class document categorization using graph-based semi-supervised learning (GBSSL), it is essential to construct a proper graph expressing the relation among nodes and to use a reasonable categorization algorithm. Furthermore, it is also important to provide high-quality correct data as training data. In this context, we propose a method to construct a similarity graph by employing both surface information and latent information to express similarity between nodes and a method to select high-quality training data for GBSSL by means of the PageR- ank algorithm. Experimenting on Reuters21578 corpus, we have confirmed that our proposed methods work well for raising the accuracy of a multi-class document categorization.
6 0.4447785 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints
7 0.42303354 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning
8 0.4181135 293 acl-2013-Random Walk Factoid Annotation for Collective Discourse
10 0.39165887 89 acl-2013-Computerized Analysis of a Verbal Fluency Test
11 0.3898586 114 acl-2013-Detecting Chronic Critics Based on Sentiment Polarity and Userâ•Žs Behavior in Social Media
12 0.38629153 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation
13 0.38192123 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
14 0.37179852 301 acl-2013-Resolving Entity Morphs in Censored Data
15 0.35912099 371 acl-2013-Unsupervised joke generation from big data
16 0.35749969 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation
17 0.34574464 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
18 0.33899236 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures
19 0.33258155 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach
20 0.32787055 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation
topicId topicWeight
[(0, 0.064), (6, 0.029), (11, 0.064), (24, 0.048), (26, 0.084), (28, 0.013), (35, 0.076), (42, 0.038), (48, 0.033), (64, 0.016), (68, 0.111), (70, 0.072), (88, 0.022), (90, 0.042), (95, 0.205)]
simIndex simValue paperId paperTitle
same-paper 1 0.93490601 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
Author: Hany Hassan ; Arul Menezes
Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.
2 0.89431334 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez
Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1
3 0.89330208 240 acl-2013-Microblogs as Parallel Corpora
Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso
Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.
4 0.88769329 289 acl-2013-QuEst - A translation quality estimation framework
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
5 0.88527602 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
Author: Peter A. Rankel ; John M. Conroy ; Hoa Trang Dang ; Ani Nenkova
Abstract: How good are automatic content metrics for news summary evaluation? Here we provide a detailed answer to this question, with a particular focus on assessing the ability of automatic evaluations to identify statistically significant differences present in manual evaluation of content. Using four years of data from the Text Analysis Conference, we analyze the performance of eight ROUGE variants in terms of accuracy, precision and recall in finding significantly different systems. Our experiments show that some of the neglected variants of ROUGE, based on higher order n-grams and syntactic dependencies, are most accurate across the years; the commonly used ROUGE-1 scores find too many significant differences between systems which manual evaluation would deem comparable. We also test combinations ofROUGE variants and find that they considerably improve the accuracy of automatic prediction.
6 0.88350725 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
7 0.88264 255 acl-2013-Name-aware Machine Translation
8 0.88222218 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
9 0.88170183 37 acl-2013-Adaptive Parser-Centric Text Normalization
10 0.87866902 292 acl-2013-Question Classification Transfer
11 0.87512553 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation
12 0.87405801 97 acl-2013-Cross-lingual Projections between Languages from Different Families
13 0.87186497 135 acl-2013-English-to-Russian MT evaluation campaign
14 0.86977971 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
15 0.86933166 66 acl-2013-Beam Search for Solving Substitution Ciphers
16 0.86849332 288 acl-2013-Punctuation Prediction with Transition-based Parsing
17 0.86732793 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
18 0.86499524 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
19 0.863612 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction
20 0.86334014 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation