acl acl2012 acl2012-2 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fei Liu ; Fuliang Weng ; Xiao Jiang
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
Reference: text
sentIndex sentText sentNum sentScore
1 u xliiaaon Abstract Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. [sent-7, score-0.729]
2 It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. [sent-8, score-0.81]
3 , for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. [sent-11, score-0.672]
4 In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. [sent-12, score-1.365]
5 1 Introduction The amount of user generated content has increased drastically in the past few years, driven by the prosperous development of the social media websites such as Twitter, Facebook, and Google+. [sent-16, score-0.157]
6 Yet existing systems often perform poorly in this domain due the to extensive use of the nonstandard tokens, emoticons, incomplete and ungrammatical sentences, etc. [sent-28, score-0.604]
7 Text normalization is also crucial for building robust text-to-speech (TTS) systems, which need to determine the pronunciations for nonstandard words in the social text. [sent-37, score-0.866]
8 The goal of this work is to automatically convert the noisy nonstandard tokens observed in the social text into standard English words. [sent-38, score-0.886]
9 We aim for a robust text normalization system with “broad coverage”, i. [sent-39, score-0.18]
10 , for any user-created nonstandard token, the system should be able to restore the correct word within its top n candidates (n = 1, 3, 10. [sent-41, score-0.782]
11 This is a very challenging task due to two facts: first, there exists huge amount and a wide variety of nonstandard tokens. [sent-45, score-0.604]
12 , 2010); second, the nonstandard tokens consist ProceediJnegjus, o Rfe thpeu 5bl0icth o Afn Knouraela M, 8e-e1t4in Jgul oyf t 2h0e1 A2. [sent-48, score-0.697]
13 of a mixture of both unintentional misspellings and intentionally-created tokens for various reasons1 ,including the needs for speed, ease of typing (Crystal, 2009), sentiment expressing (e. [sent-52, score-0.191]
14 Existing spell checkers and normalization systems rely heavily on lexical/phonetic similarity to select the correct candidate words. [sent-57, score-0.356]
15 , (tomorrow, “tmrw”)2), yet the number of candidates increases dramatically as the system strives to increase the coverage by enlarging the threshold. [sent-60, score-0.204]
16 (Han and Baldwin, 2011) reported an average of 127 candidates per nonstandard token with the correct-word coverage of 84%. [sent-61, score-0.888]
17 Different from previous work, we tackle the text normalization problem from a cognitive-sensitive perspective and investigate the human rationales for normalizing the nonstandard tokens. [sent-63, score-0.854]
18 We argue that there exists a set of letter transformation patterns that humans use to decipher the nonstandard tokens. [sent-64, score-0.986]
19 In this paper, we propose a broad-coverage normalization system by integrating three human per1For this reason, we will use the term “nonstandard tokens” instead of “ill-formed tokens” throughout the paper. [sent-68, score-0.205]
20 2We use the form (standard word, “nonstandard token”) to denote an example nonstandard token and its corresponding standard word. [sent-69, score-0.751]
21 1036 spectives, including the enhanced letter transformation, visual priming, and the string and phonetic similarity. [sent-70, score-0.483]
22 For an arbitrary nonstandard token, the three subnormalizers each suggest their most confident candidates from a different perspective. [sent-71, score-0.817]
23 Results show that our system can achieve over 90% wordcoverage with limited number of candidates and the broad word-coverage can be successfully translated into message-level performance gain. [sent-74, score-0.187]
24 , 1991 ; Brill and Moore, 2000) proposed to use the noisy channel framework to generate a list of corrections for any misspelled word, ranked by the corresponding posterior probabilities. [sent-80, score-0.193]
25 , 2001) enhanced this framework by calculating the likelihood probability as the chance of a noisy token and its associated tag being generated by a specific word. [sent-82, score-0.301]
26 With the rapid growth of SMS and social media content, text normalization system has drawn increasing attention in the recent decade, where the focus is on converting the noisy nonstandard tokens in the informal text into standard dictionary words. [sent-83, score-1.205]
27 , 2010) employed the weighted finite-state machines (FSMs) and rewriting rules for normalizing French SMS; (Pennell and Liu, 2010) focused on tweets created by handsets and developed a CRF tagger for deletion-based abbreviation. [sent-86, score-0.177]
28 The text normalization problem was also tackled under the machine translation (MT) or speech recognition (ASR) framework. [sent-87, score-0.138]
29 , 2011b) proposed to normalize the nonstandard tokens without explicitly categorizing them. [sent-96, score-0.697]
30 In this paper, we propose a novel cognitively-driven text normalization system that robustly tackle both the unintentional misspellings and the intentionally-created noisy tokens. [sent-104, score-0.367]
31 We propose a global context-based approach to purify the automatically collected training data and learn the letter transformation patterns without human supervision. [sent-105, score-0.386]
32 To the best of our knowledge, we are the first to integrate these human perspectives in the text normalization system. [sent-108, score-0.179]
33 3 Broad-Coverage Normalization System In this section, we describe our broad-coverage normalization system, which consists of four key com- ponents. [sent-109, score-0.138]
34 candidates from a different “Enhanced Letter Transformation” automatically learns a set of letter transformation patterns and is most effective in normalizing the intentionally created nonstandard tokens through letter insertion, repetition, deletion, and substitution (Section 3. [sent-111, score-1.56]
35 1); “Visual Priming” proposes candidates based on the visual cues and a primed perspective (Section 3. [sent-112, score-0.286]
36 Note that it is crucial to integrate different human perspectives so that the system is flexible in pro- perspective3: cessing both unintentional misspellings and various intentionally-created noisy tokens. [sent-117, score-0.294]
37 We formulate the process of generating a nonstandard token ti from the dictionary word si using a letter transformation model, and use the model confidence as the probability p(ti |si). [sent-123, score-1.434]
38 To form a nonstandard token, each letter in the dictionary word can be labeled with: (a) one of the 0-9 digits; (b) one of the 26 characters including itself; (c) the null character “-”; (d) a letter combination4. [sent-125, score-1.238]
39 This transformation process from dictio- 3For the dictionary word, we allow the subnormalizers to either return the word itself or candidates that are the possibly intended words in the given context (e. [sent-126, score-0.436]
40 4The set of letter combinations used in this work are {ah, ai, aw, ay, cek s,e ea, ey, teier, ou, te, nwathio} nary words to nonstandard tokens will be learned by a character-level sequence labeling system using the automatically collected (word, token) pairs. [sent-129, score-1.006]
41 Next, we create a large lookup table by applying the character-level labeling system to the standard dictionary words and generate multiple variations for each word using the n-best labeling output, the labeling confidence is used as p(ti |si). [sent-130, score-0.253]
42 During testing, we search this lookup table to fin|ds the best candidate words for the nonstandard tokens. [sent-131, score-0.688]
43 For tokens with letter repetition, we first generate a set of variants by varying the repetitive letters (e. [sent-132, score-0.361]
44 1 Context-Aware Training Pair Selection Manual annotation of the noisy nonstandard tokens takes a lot of time and effort. [sent-139, score-0.786]
45 The ideal training data should consist of the most frequent nonstandard tokens paired with the corresponding corrections, so that the system can learn from the most representative letter transformation patterns. [sent-143, score-1.096]
46 Motivated by research on word sense disambiguation (WSD) (Mihalcea, 2007), we hypothesize the nonstandard token and the standard word share a lot of common terms in their global context. [sent-144, score-0.803]
47 To the best of our knowledge, we are the first to explore this global contextual similarity for the text normalization task. [sent-148, score-0.165]
48 The term weights are defined using a normalized TF-IDF method: wi,k=TTFFii,k× log(DNFk) where TFi,k is the count of term tk appearing within the context of term ti; TFi is the total count of ti in the corpus. [sent-153, score-0.211]
49 TTFFii,k is therefore the relative frequency of tk appearing in the context of ti; log(DNFk) denotes the inverse document frequency of tk, calculated as the logarithm of total tweets (N) divided by the number of tweets containing tk. [sent-154, score-0.248]
50 To select the most representative (word, token) pairs for training, we rank the automatically collected 46,288 pairs by the token frequency, filter out pairs whose contextual similarity lower than a threshold θ (set empirically at 0. [sent-155, score-0.174]
51 , 2011b), then con- struct a feature vector for each letter of the dictionary word, using its mapped character as the reference label. [sent-161, score-0.366]
52 , 2011b), but develop a set of boundary features to effectively characterize the letter transformation process. [sent-166, score-0.429]
53 We notice that in creating the nonstandard tokens, humans tend to drop certain letter units from the word or replace them with other letters. [sent-167, score-0.922]
54 On top of the boundary tags, we develop a set of conjunction features to accurately pinpoint the current character position. [sent-172, score-0.148]
55 We consider conjunction features formed by concatenating character position in syllable and current syllable position in the word (e. [sent-173, score-0.204]
56 , conjunction feature “L B” for the letter “d” in Table 2). [sent-175, score-0.276]
57 We consider conjunction of character/vowel feature and their boundary tags on the syllable/morpheme/word level; conjunction of phoneme and phoneme boundary tags, and absolute position of current character within the corre5Phoneme decomposition is generated using the (Jiampojamarn et al. [sent-177, score-0.358]
58 , 2007) algorithm to map up to two letters to phonemes (2-to-2 alignment); syllable boundary acquired by the hyphenation algorithm (Liang, 1983); morpheme boundary determined by toolkit Morfessor 1. [sent-178, score-0.259]
59 We use the aforementioned features to train the CRF model, then apply the model on dictionary words si to generate multiple variations ti for each word. [sent-182, score-0.3]
60 When a nonstandard token is seen during testing, we apply the noisy channel to generate a list of best candidate words: = arg maxsip(ti |si)p(si). [sent-183, score-0.964]
61 2 Visual Priming Approach A second key component of the broad-coverage normalization system is a novel “Visual Priming” subnormalizer. [sent-185, score-0.18]
62 A priming effect is observed when participants complete stems with words on the study list more often than with the novel words. [sent-189, score-0.324]
63 A person familiarized with the “social talk” is highly primed with the most commonly used words; later when a nonstandard token shows only minor visual cues or visual stimulus, it can still be quickly recognized by the person. [sent-192, score-1.106]
64 In this process, the first letter or first few letters of the word serve as a very important visual stimulus. [sent-193, score-0.441]
65 Based on this assumption, we introduce the “priming” subnormalizer based only on the word frequency and the minor visual stimulus. [sent-194, score-0.349]
66 Note that the first character has been shown to be a crucial visual cue for the brain to understand jumbled words (Davis, ), we therefore consider as candidates only those words si that start with the same character as ti. [sent-199, score-0.517]
67 In the case that the nonstandard token ti starts with a digit (e. [sent-200, score-0.843]
68 , “2moro”), we use the mostly likely corresponding letter to search the candidates (those starting with letter “t”). [sent-202, score-0.594]
69 The “visual priming” subnormalizer promotes the candidate words that are frequently used in the social talk and also bear visual similarity with the given noisy token. [sent-204, score-0.51]
70 This approach also inherently follows the noisy channel framework, with p(ti |si) represents the visual stimulus and p(si) being |thse logarithm of frequency. [sent-206, score-0.316]
71 3 Spell Checker The third subnormalizer is the spell checker, which combines the string and phonetic similarity algorithms and is most effective in normalizing the misspellings. [sent-210, score-0.395]
72 We use the Jazzy spell checker (Idzelis, 2005) that integrates the DoubleMetaphone phonetic matching algorithm and the Levenshtein distance using the near-miss strategy, which enables the interchange of two adjacent letters, and the replacing/deleting/adding of letters. [sent-211, score-0.294]
73 4 Candidate Combination Each of the three subnormalizers is a stand-alone system and can suggest corrections for the nonstandard tokens. [sent-213, score-0.812]
74 Yet we show that each subnormalizer mimics a different perspective that humans use to decode the nonstandard tokens, as a result, our broad-coverage normalization system is built by integrating candidates from the three subnormalizers using various strategies. [sent-214, score-1.14]
75 For a noisy token seen in the informal text, the most convenient way of system combination is to harvest up to n candidates from each of the subnormalizers, and use the pool of candidates (up to 1040 3n) as the system output. [sent-215, score-0.54]
76 In Table 3, we also present the number of distinct nonstandard tokens found in each data set, and notice that only a small portion of the nonstandard tokens correspond to multiple standard words. [sent-231, score-1.419]
77 We calculate the dictionary coverage of the manually annotated words since this sets an upper bound for any normalization system. [sent-232, score-0.247]
78 8 The nonstandard tokens may consist of both numbers/characters and apostrophe. [sent-235, score-0.697]
79 8The dictionary is created by combining the CMU (CMU, 2007) and Aspell (Atkinson, 2006) dictionaries and dropping words with frequency < 20 in the background corpus. [sent-236, score-0.16]
80 The goal of word-level normalization is to convert the list of distinct nonstandard tokens into standard words. [sent-246, score-0.835]
81 For each nonstandard token, the system is considered correct if any of the corresponding standard words is among the n-best output from the system. [sent-247, score-0.646]
82 On message-level, we evaluate the 1-best system output using precision, recall, and f-score, calculated respective to the nonstandard tokens. [sent-249, score-0.67]
83 We present the n-best accuracy (n = 1, 3, 10, 20) of the system as well as the “Oracle” results generated by pooling the top-20 candidates from each of the three subnormalizers. [sent-252, score-0.152]
84 This is of crucial importance to a normalization system, since the high accuracy and limited number of candidates will enable more sophisticated reranking or supervised learning techniques to select the best candidate. [sent-257, score-0.272]
85 These are out of the capabilities of the current text normalization system and partly explains the remaining 5% gap. [sent-262, score-0.18]
86 Regarding the subnormalizer performance, the spell checker yields only 50% to 60% accuracy on all data sets, indicating that the vast amount of the intentionally created nonstandard tokens can hardly be tackled by a system relies solely on the lexical/phonetic similarity. [sent-263, score-1.19]
87 A minor sideeffect is that the candidates were restricted to have the same first letter with the noisy token, this sets the upper bound of the approach to 89. [sent-293, score-0.473]
88 ” is effective at normalizing intentionally created tokens and has better precision regarding its top candidate (n = 1). [sent-298, score-0.303]
89 We notice that the system can effectively learn the letter transformation patterns from a small number of high quality training pairs. [sent-300, score-0.424]
90 The final system was trained using the top 5,000 pairs and the lookup table was created by generating 50 variations for each dictionary word. [sent-301, score-0.179]
91 3 Message-level Results The goal of message-level normalization is to replace each occurrence of the nonstandard token with the candidate word that best fits the local context. [sent-303, score-0.971]
92 Amount of Training Pairs (~45K) Figure 2: Learning curve of the enhanced letter transformation system using random training pair selection or the context-aware approach. [sent-316, score-0.464]
93 Following research in (Han and Baldwin, 2011), we focus on the the normalization task and assume perfect nonstandard token detection. [sent-321, score-0.889]
94 The “Word-level w/o Context” results are generated by replacing each nonstandard token using the 1-best word-level candidate. [sent-322, score-0.751]
95 5 Conclusion In this paper, we propose a broad-coverage normalization system for the social media language without using the human annotations. [sent-346, score-0.337]
96 It integrates three key components: the enhanced letter transformation, visual priming, and string/phonetic similarity. [sent-347, score-0.454]
97 We observe that the social media is an emotion-rich language, therefore future normalization system will need to address various sentimentrelated expressions, such as emoticons (“:d”, “X8”), interjections (“bwahaha”, “brrrr”), acronyms (“lol”, “lmao”), etc. [sent-350, score-0.372]
98 A hybrid rule/model-based finite-state framework for normalizing sms messages. [sent-368, score-0.241]
99 Exploring multiple text sources for twitter topic summarization. [sent-479, score-0.15]
100 A characterlevel machine translation approach for normalization of sms abbreviations. [sent-508, score-0.293]
wordName wordTfidf (topN-words)
[('nonstandard', 0.604), ('priming', 0.324), ('letter', 0.242), ('spell', 0.162), ('sms', 0.155), ('twitter', 0.15), ('visual', 0.147), ('token', 0.147), ('normalization', 0.138), ('si', 0.126), ('subnormalizer', 0.118), ('transformation', 0.115), ('candidates', 0.11), ('subnormalizers', 0.103), ('checker', 0.103), ('social', 0.1), ('tokens', 0.093), ('ti', 0.092), ('noisy', 0.089), ('pennell', 0.088), ('normalizing', 0.086), ('dictionary', 0.082), ('boundary', 0.072), ('enhanced', 0.065), ('tweets', 0.064), ('corrections', 0.063), ('messagelevel', 0.059), ('misspellings', 0.059), ('media', 0.057), ('candidate', 0.056), ('messages', 0.053), ('phoneme', 0.052), ('liu', 0.052), ('bilou', 0.051), ('syllable', 0.051), ('lm', 0.044), ('deana', 0.044), ('fuliang', 0.044), ('maxsip', 0.044), ('visualprim', 0.044), ('tk', 0.044), ('character', 0.042), ('system', 0.042), ('intentionally', 0.041), ('perspectives', 0.041), ('channel', 0.041), ('petrovic', 0.039), ('stimulus', 0.039), ('unintentional', 0.039), ('morpheme', 0.038), ('han', 0.037), ('acronyms', 0.035), ('broad', 0.035), ('crf', 0.034), ('conjunction', 0.034), ('viterbi', 0.034), ('baldwin', 0.034), ('minor', 0.032), ('completion', 0.031), ('beaufort', 0.029), ('brill', 0.029), ('dnfk', 0.029), ('gouws', 0.029), ('isualprim', 0.029), ('jazzy', 0.029), ('kobus', 0.029), ('mays', 0.029), ('microtext', 0.029), ('primed', 0.029), ('purify', 0.029), ('subramaniam', 0.029), ('tulving', 0.029), ('phonetic', 0.029), ('lookup', 0.028), ('arg', 0.027), ('contextual', 0.027), ('coverage', 0.027), ('created', 0.027), ('edinburgh', 0.026), ('frequency', 0.026), ('advertisements', 0.026), ('celikyilmaz', 0.026), ('choudhury', 0.026), ('hogan', 0.026), ('jumbled', 0.026), ('morfessor', 0.026), ('rationales', 0.026), ('weng', 0.026), ('letters', 0.026), ('word', 0.026), ('aw', 0.025), ('pages', 0.025), ('humans', 0.025), ('notice', 0.025), ('labeling', 0.025), ('term', 0.025), ('background', 0.025), ('yet', 0.025), ('crucial', 0.024), ('calculated', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language
Author: Fei Liu ; Fuliang Weng ; Xiao Jiang
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
2 0.13189167 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle
Author: Hao Wang ; Dogan Can ; Abe Kazemzadeh ; Francois Bar ; Shrikanth Narayanan
Abstract: This paper describes a system for real-time analysis of public sentiment toward presidential candidates in the 2012 U.S. election as expressed on Twitter, a microblogging service. Twitter has become a central site where people express their opinions and views on political parties and candidates. Emerging events or news are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to gauge the relation between expressed public sentiment and electoral events. In addition, sentiment analysis can help explore how these events affect public opinion. While traditional content analysis takes days or weeks to complete, the system demonstrated here analyzes sentiment in the entire Twitter traffic about the election, delivering results instantly and continuously. It offers the public, the media, politicians and scholars a new and timely perspective on the dynamics of the electoral process and public opinion. 1
3 0.11778974 160 acl-2012-Personalized Normalization for a Multilingual Chat System
Author: Ai Ti Aw ; Lian Hau Lee
Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1
4 0.0947593 76 acl-2012-Distributional Semantics in Technicolor
Author: Elia Bruni ; Gemma Boleda ; Marco Baroni ; Nam Khanh Tran
Abstract: Our research aims at building computational models of word meaning that are perceptually grounded. Using computer vision techniques, we build visual and multimodal distributional models and compare them to standard textual models. Our results show that, while visual models with state-of-the-art computer vision techniques perform worse than textual models in general tasks (accounting for semantic relatedness), they are as good or better models of the meaning of words with visual correlates such as color terms, even in a nontrivial task that involves nonliteral uses of such words. Moreover, we show that visual and textual information are tapping on different aspects of meaning, and indeed combining them in multimodal models often improves performance.
5 0.079540633 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets
Author: Xiaohua Liu ; Ming Zhou ; Xiangyang Zhou ; Zhongyang Fu ; Furu Wei
Abstract: Tweets represent a critical source of fresh information, in which named entities occur frequently with rich variations. We study the problem of named entity normalization (NEN) for tweets. Two main challenges are the errors propagated from named entity recognition (NER) and the dearth of information in a single tweet. We propose a novel graphical model to simultaneously conduct NER and NEN on multiple tweets to address these challenges. Particularly, our model introduces a binary random variable for each pair of words with the same lemma across similar tweets, whose value indicates whether the two related words are mentions of the same entity. We evaluate our method on a manually annotated data set, and show that our method outperforms the baseline that handles these two tasks separately, boosting the F1 from 80.2% to 83.6% for NER, and the Accuracy from 79.4% to 82.6% for NEN, respectively.
6 0.078983687 205 acl-2012-Tweet Recommendation with Graph Co-Ranking
7 0.076181658 167 acl-2012-QuickView: NLP-based Tweet Search
8 0.072656006 68 acl-2012-Decoding Running Key Ciphers
9 0.071243852 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction
10 0.068464711 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition
11 0.068079576 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations
12 0.06529431 153 acl-2012-Named Entity Disambiguation in Streaming Data
13 0.062766381 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
14 0.058551196 91 acl-2012-Extracting and modeling durations for habits and events from Twitter
15 0.058463514 134 acl-2012-Learning to Find Translations and Transliterations on the Web
16 0.057966929 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
17 0.057353918 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
18 0.057036582 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit
19 0.056261979 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench
20 0.056254718 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
topicId topicWeight
[(0, -0.171), (1, 0.056), (2, 0.026), (3, 0.027), (4, 0.034), (5, 0.074), (6, 0.187), (7, 0.008), (8, 0.025), (9, 0.08), (10, -0.029), (11, -0.022), (12, -0.001), (13, 0.03), (14, 0.019), (15, -0.012), (16, 0.007), (17, 0.03), (18, 0.025), (19, 0.002), (20, -0.026), (21, -0.081), (22, 0.036), (23, -0.007), (24, -0.108), (25, 0.112), (26, 0.093), (27, -0.014), (28, -0.058), (29, -0.118), (30, 0.068), (31, -0.128), (32, -0.006), (33, 0.092), (34, 0.055), (35, 0.022), (36, 0.033), (37, -0.049), (38, -0.043), (39, 0.084), (40, -0.069), (41, -0.089), (42, -0.109), (43, -0.027), (44, -0.11), (45, -0.052), (46, -0.02), (47, 0.12), (48, 0.074), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.91993713 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language
Author: Fei Liu ; Fuliang Weng ; Xiao Jiang
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
2 0.62599832 160 acl-2012-Personalized Normalization for a Multilingual Chat System
Author: Ai Ti Aw ; Lian Hau Lee
Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1
3 0.56142831 68 acl-2012-Decoding Running Key Ciphers
Author: Sravana Reddy ; Kevin Knight
Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.
4 0.55438167 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords
Author: Marco Guerini ; Carlo Strapparava ; Oliviero Stock
Abstract: In recent years there has been a growing interest in crowdsourcing methodologies to be used in experimental research for NLP tasks. In particular, evaluation of systems and theories about persuasion is difficult to accommodate within existing frameworks. In this paper we present a new cheap and fast methodology that allows fast experiment building and evaluation with fully-automated analysis at a low cost. The central idea is exploiting existing commercial tools for advertising on the web, such as Google AdWords, to measure message impact in an ecological setting. The paper includes a description of the approach, tips for how to use AdWords for scientific research, and results of pilot experiments on the impact of affective text variations which confirm the effectiveness of the approach.
5 0.52978593 153 acl-2012-Named Entity Disambiguation in Streaming Data
Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.
Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.
6 0.51333362 76 acl-2012-Distributional Semantics in Technicolor
7 0.45827642 195 acl-2012-The Creation of a Corpus of English Metalanguage
8 0.45369044 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
9 0.45238736 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
10 0.451435 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations
11 0.43480876 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle
12 0.43460548 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets
13 0.4242782 51 acl-2012-Collective Generation of Natural Image Descriptions
14 0.4199641 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum
15 0.41036218 7 acl-2012-A Computational Approach to the Automation of Creative Naming
16 0.3994596 205 acl-2012-Tweet Recommendation with Graph Co-Ranking
17 0.38529578 70 acl-2012-Demonstration of IlluMe: Creating Ambient According to Instant Message Logs
19 0.38303941 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
20 0.37634099 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System
topicId topicWeight
[(25, 0.018), (26, 0.033), (28, 0.038), (30, 0.02), (37, 0.018), (39, 0.048), (74, 0.026), (75, 0.018), (81, 0.011), (82, 0.013), (84, 0.019), (85, 0.019), (90, 0.508), (92, 0.049), (94, 0.022), (99, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.99791443 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language
Author: Fei Liu ; Fuliang Weng ; Xiao Jiang
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
2 0.99706405 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling
Author: Wei Lu ; Dan Roth
Abstract: This paper presents a novel sequence labeling model based on the latent-variable semiMarkov conditional random fields for jointly extracting argument roles of events from texts. The model takes in coarse mention and type information and predicts argument roles for a given event template. This paper addresses the event extraction problem in a primarily unsupervised setting, where no labeled training instances are available. Our key contribution is a novel learning framework called structured preference modeling (PM), that allows arbitrary preference to be assigned to certain structures during the learning procedure. We establish and discuss connections between this framework and other existing works. We show empirically that the structured preferences are crucial to the success of our task. Our model, trained without annotated data and with a small number of structured preferences, yields performance competitive to some baseline supervised approaches.
3 0.99644792 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
Author: Majid Razmara ; George Foster ; Baskaran Sankaran ; Anoop Sarkar
Abstract: Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. In this paper, we evaluate performance on a domain adaptation setting where we translate sentences from the medical domain. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation.
4 0.99632573 212 acl-2012-Using Search-Logs to Improve Query Tagging
Author: Kuzman Ganchev ; Keith Hall ; Ryan McDonald ; Slav Petrov
Abstract: Syntactic analysis of search queries is important for a variety of information-retrieval tasks; however, the lack of annotated data makes training query analysis models difficult. We propose a simple, efficient procedure in which part-of-speech tags are transferred from retrieval-result snippets to queries at training time. Unlike previous work, our final model does not require any additional resources at run-time. Compared to a state-ofthe-art approach, we achieve more than 20% relative error reduction. Additionally, we annotate a corpus of search queries with partof-speech tags, providing a resource for future work on syntactic query analysis.
5 0.99465024 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums
Author: Zhonghua Qu ; Yang Liu
Abstract: Online forums are becoming a popular resource in the state of the art question answering (QA) systems. Because of its nature as an online community, it contains more updated knowledge than other places. However, going through tedious and redundant posts to look for answers could be very time consuming. Most prior work focused on extracting only question answering sentences from user conversations. In this paper, we introduce the task of sentence dependency tagging. Finding dependency structure can not only help find answer quickly but also allow users to trace back how the answer is concluded through user conversations. We use linear-chain conditional random fields (CRF) for sentence type tagging, and a 2D CRF to label the dependency relation between sentences. Our experimental results show that our proposed approach performs well for sentence dependency tagging. This dependency information can benefit other tasks such as thread ranking and answer summarization in online forums.
6 0.97822767 23 acl-2012-A Two-step Approach to Sentence Compression of Spoken Utterances
7 0.9779157 216 acl-2012-Word Epoch Disambiguation: Finding How Words Change Over Time
8 0.96996057 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
9 0.96948075 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
10 0.96760541 55 acl-2012-Community Answer Summarization for Multi-Sentence Question with Group L1 Regularization
11 0.96550536 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors
12 0.96542203 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
13 0.9647007 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
15 0.95910496 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
16 0.95507228 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
17 0.95273906 137 acl-2012-Lemmatisation as a Tagging Task
18 0.95059597 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining
19 0.94752079 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging
20 0.9455933 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars