emnlp emnlp2013 emnlp2013-9 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yi Yang ; Jacob Eisenstein
Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
Reference: text
sentIndex sentText sentNum sentScore
1 The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. [sent-4, score-0.209]
2 This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. [sent-5, score-0.652]
3 We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation. [sent-6, score-0.442]
4 Many ofthe attempts to characterize and overcome this variation have focused on normalization: transforming social media language into text that better matches standard datasets (Sproat et al. [sent-8, score-0.208]
5 Because there is little available training data, and because social media language changes rapidly (Eisenstein, 2013b), fully supervised training is generally not considered appropriate for this task. [sent-11, score-0.174]
6 We propose a different approach, performing normalization in a maximum-likelihood framework. [sent-17, score-0.326]
7 There are two main sources of information to be exploited: local context, and surface similarity between the observed strings and normalization candidates. [sent-18, score-0.371]
8 Because labeled examples of normalized text are not available, this model cannot be trained in the standard supervised fashion. [sent-20, score-0.129]
9 , 2010), as their complexity is quadratic in the size of label space; in normalization, the label space is the vocabulary itself, with at least 104 elements. [sent-22, score-0.111]
10 This training method may be applicable in other unsupervised learning problems with a large label space. [sent-24, score-0.087]
11 This model is implemented in a normalization system called UNLOL (unsupervised normalization in a LOg-Linear model). [sent-25, score-0.652]
12 Our evaluations show that UNLOL outperforms the state-of-the-art on standard normalization datasets. [sent-29, score-0.326]
13 In addition, we demonstrate the linguistic insights that can be obtained from normalization, using UNLOL to identify classes of orthographic transformations that form coherent linguistic styles. [sent-30, score-0.105]
14 2 Background The text normalization task was introduced by Sproat et al. [sent-31, score-0.326]
15 It has become still more salient in the era of widespread social media, particularly Twitter. [sent-34, score-0.096]
16 Han and Baldwin (201 1) formally define a normalization task for Twitter, focusing on normalizations between single tokens, and excluding multi-word tokens like lol (laugh out loud). [sent-35, score-0.455]
17 The normalization task has been criticized by Eisenstein (2013b), who argues that it strips away important social meanings. [sent-36, score-0.422]
18 In recent work, normalization has been shown to yield improvements for part-of-speech tagging (Han et al. [sent-37, score-0.326]
19 As we will show in Section 7, accurate automated normalization can also improve our understanding of the nature of social media language. [sent-40, score-0.5]
20 Supervised methods Early work on normalization focused on labeled SMS datasets, using approaches such as noisy-channel modeling (Choudhury et al. [sent-41, score-0.361]
21 The scalar parameters are then estimated using expectation maximization. [sent-50, score-0.084]
22 (2010) use string edit distance to identify closely-related candidate orthographic forms and then decode the message using a language model. [sent-53, score-0.261]
23 (2010), we apply string edit distance, and like Gouws et al. [sent-57, score-0.119]
24 At present, labeled data for Twitter normalization is available only in small quantities. [sent-72, score-0.361]
25 Moreover, as social media language is undergoing rapid change (Eisenstein, 2013b), labeled datasets may become stale and increasingly ill-suited to new spellings and words. [sent-73, score-0.243]
26 Resources that characterize the current state of internet language risk becoming outdated; in this paper we investigate whether high-quality normalization is possible without any such resources. [sent-77, score-0.326]
27 These features may include simple string edit distance metrics, as well as lexical features that memorize specific pairs of standard and nonstandard words. [sent-87, score-0.217]
28 For example, in the phrase give me sutt in t o be l ieve in, even a reader who has never before seen the word sutt in may recognize it as a phonetic transcription of something. [sent-91, score-0.164]
29 The relatively high string edit distance is overcome by the strong contextual preference for the word something over orthographically closer alternatives such as button or suiting. [sent-92, score-0.119]
30 We can apply an arbitrary target language model, leveraging large amounts of unlabeled data and catering to the desired linguistic characteristics of the normalized content. [sent-93, score-0.154]
31 — — 63 None of these tokens are standard (except 2, which appears in a nonstandard sense here), so without joint inference, it would not be possible to use context to help normalize sutt in. [sent-98, score-0.302]
32 However, unlike the supervised case, here both terms are expectations: the outer expectation is over all target sequences (given the observed source sequence), and the nested expectation is over all source sequences, given the target sequence. [sent-131, score-0.369]
33 As the space of possible target sequences t grows exponentially in the length of the source sequence, it will not be practical to compute this expectation directly. [sent-132, score-0.144]
34 First, while the forward- backward algorithm would enable us to compute Et|s, it would not give us the nested expectation Et|s [Es0|t] ; this is the classic challenge in training globally-normalized log-linear models without labeled data (Smith and Eisner, 2005). [sent-135, score-0.162]
35 SMC algorithms maintain a set of weighted hypotheses; the weights correspond to probabilities, and in our case, the hypotheses correspond to target language word sequences. [sent-142, score-0.16]
36 Specifically, we approximate the conditional probability, XK P(t1:n|s1:n) ≈Xk=1ωnkδt1k:n(t1:n), where ωnk is the normalized weight of sample k at word n ( ω˜nkis the unnormalized weight), and δt1k:n is a delta function centered at t1k:n. [sent-143, score-0.208]
37 At each step, and for each hypothesis k, a new target word is sampled from a proposal distribution, and the weight of the hypothesis is then updated. [sent-144, score-0.338]
38 We maintain feature counts for each hypothesis, and approximate the expectation by taking a weighted average using the hypothesis weights. [sent-145, score-0.162]
39 The proposal distribution will be described in detail later. [sent-146, score-0.242]
40 With these assumptionsQ, we can view normalization as a finite state-space model in which the target language model defines the prior distribution ofthe process and Equation 3 defines the likelihood function. [sent-158, score-0.468]
41 We are able to compute the the posterior probability P(t |s) using sequential importance sampling, a imtyem Pb(etr|s o)f u uthsien gSM seCq family. [sent-159, score-0.158]
42 The crucial idea in sequential importance sampling is to update the hypotheses t1k:n and their weights ωnk so that they approximate the posterior distribution at the next time step, P(t1:n+1 |s1:n+1). [sent-160, score-0.374]
43 We further assume the proposal distribution Q can be factored as: Q(t1:n|s1:n) =Q(tn|t1:n−1, s1:n)Q(t1:n−1 |s1:n−1) =Q(tn|tn−1, sn)Q(t1:n−1|s1:n−1). [sent-162, score-0.242]
44 (10) where we sample tnk and update ωnk while moving from left-to-right, and sample s‘n,k at each n. [sent-165, score-0.329]
45 Note that although the sequential importance sampler moves left-to-right like a filter, we use only the final weights ωN to compute the expectation. [sent-166, score-0.213]
46 Thus, the resulting expectation is based on the distribution P(s1:N |t1:N), so that no backwards “smoothing” pass (Godsill et al. [sent-167, score-0.128]
47 Other applications of sequential Monte Carlo make use of resampling (Cappe et al. [sent-169, score-0.119]
48 2 Proposal distribution The major computational challenge for dynamic programming approaches to normalization is the large label space, equal to the size of the target vocabulary. [sent-172, score-0.465]
49 It may appear that all we have gained by applying sequential Monte Carlo is to convert a computational problem into a statistical one: a naive sampling approach will have little hope of finding the small high-probability region ofthe highdimensional label space. [sent-173, score-0.188]
50 However, sequential importance sampling allows us to address this issue through the proposal distribution, from which we sample the candidate words tn. [sent-174, score-0.39]
51 Careful design ofthe proposal distribution can guide sampling towards the high-probability space. [sent-175, score-0.276]
52 In the asymptotic limit of an infinite number of samples, any non-pathological proposal distribution will ultimately arrive at the desired estimate, but a good proposal distribution can greatly reduce the number of samples needed. [sent-176, score-0.524]
53 In low-dimensional settings, a convenient solution is to set the proposal distribution equal to the transition distribution, Q(tnk |sn, tnk−1) = P(tnk|tnk−1, . [sent-180, score-0.242]
54 We strike a middle ground between efficiency and accuracy, using a proposal distribution that is closely related to the overall likelihood, yet is tractable to sample and compute: ω(k) Q(tnk|sn,tk) d=ef P(sn|tkn)Z(tnk)P(tnk|tk) Pt0P(sn|t0)Z(t0)P(t0|tk) (12) exp ? [sent-188, score-0.242]
55 To update the unnormalized hypothesis weights ω˜kn, we have ω˜kn=ωkn−1Pt0exp? [sent-197, score-0.171]
56 3 Decoding Given an input source sentence s, the decoding problem is to find a target sentence t that maximizes P(t|s) ∝ P(s|t)P(t) = P(sn|tn)P(tn|tn−1). [sent-201, score-0.105]
57 This must be multiplied by the cost of computing the normalized probability P(sn |tn), resulting in a prohibitive time complexity of O(#|t|νS|#|νT|2N). [sent-203, score-0.129]
58 The first is to simply apply the proposal distribution, with linear complexity in the size of the two vocabularies. [sent-205, score-0.198]
59 Alternatively, we can apply t(hte) proposal distribution for selecting target word candidates, then apply the Viterbi algorithm only within these candidates. [sent-207, score-0.302]
60 (2010), which has proven effective for normalization in prior work. [sent-218, score-0.364]
61 We bin this similarity to create binary features indicating whether a string s is in the top-N most similar strings to t; this binning yields substantial speed improvements without negatively impacting accuracy. [sent-219, score-0.106]
62 5 Implementation and data The model and inference described in the previous section are implemented in a software system for normalizing text on twitter, called UNLOL: unsupervised normalization in a LOgLinear model. [sent-220, score-0.427]
63 1 Normalization candidates Most tokens in tweets do not require normalization. [sent-224, score-0.117]
64 The question of how to identify which words are to be normalized is still an open problem. [sent-225, score-0.094]
65 Following Han and Baldwin (201 1), we build a dictionary of words which are permissible in the target domain, and make no attempt to normalize source strings that match these words. [sent-226, score-0.18]
66 As with other comparable approaches, we are therefore unable to normalize strings like i l l into I’ll. [sent-227, score-0.12]
67 For all in-vocabulary words, we define P(sn |tn) = δ(sn, tn), taking the value of zero when sn tn. [sent-235, score-0.146]
68 In addition to words that are in the target vocabulary, there are many other strings that should not be normalized, such as names and multiword shortenings (e. [sent-237, score-0.146]
69 1 We follow prior work and assume that the set of normalization candidates is known in advance during test set decoding (Han et al. [sent-240, score-0.409]
70 Thus, during training we attempt to normalize all tokens that (1) are not in our lexicon of IV words, and (2) are composed of letters, numbers and the apostrophe. [sent-243, score-0.122]
71 This set includes contractions like "gonna" and "gotta", which would not appear in the test set, but are nonetheless normalized — = 1Whether multiword shortenings should be normalized is arguable, but they are outside the scope of current normalization datasets (Han and Baldwin, 2011). [sent-244, score-0.589]
72 For each OOV token, we conduct a pre-normalization step by reducing any repetitions of more than two letters in the nonstandard words to exactly two letters (e. [sent-246, score-0.098]
73 3 Parameters The Monte Carlo approximations require two parameters: the number of samples for sequential Monte Carlo (K), and the number of samples for the non-sequential sampler of the nested expectation (L, from Equation 10). [sent-255, score-0.326]
74 , words that are not in the target vocabulary) and their normalized forms. [sent-262, score-0.154]
75 As this corpus does not provide linguistic context, its decoding must use a unigram target language model. [sent-264, score-0.105]
76 1 by its authors Han and Baldwin (201 1) contains 549 complete tweets with — — — — 1,184 nonstandard tokens (558 unique word types). [sent-266, score-0.215]
77 1 revealed some inconsistencies in annotation (for example, y ’ al l and 2 are sometimes normalized to you and to, but are left unnormalized in other cases). [sent-299, score-0.17]
78 For example, smh is normalized to somehow in LexNorm1 . [sent-301, score-0.135]
79 2 in the hope that it will become standard in future work on normalization in English. [sent-317, score-0.326]
80 68 Metrics Prior work on these datasets has assumed perfect detection of words requiring normalization, and has focused on finding the correct normalization for these words (Han and Baldwin, 2011; Han et al. [sent-328, score-0.36]
81 Recall has been defined as the proportion of words requiring normalization which are normalized correctly; precision is defined as the proportion of normalizations which are correct. [sent-330, score-0.502]
82 In the normalization task that we consider, the tokens to be normalized are specified in advance. [sent-337, score-0.467]
83 Regularization One potential concern is that the number of non-zero feature weights will continually increase until the memory cost becomes overwhelming. [sent-343, score-0.09]
84 7 Analysis We apply our normalization system to investigate the orthographic processes underlying language variation in social media. [sent-361, score-0.527]
85 We then treat these normalizations as labeled training data, and examine the Levenshtein alignment between the source and target tokens. [sent-363, score-0.177]
86 ” We apply non-negative matrix factorization (Lee and Seung, 2001), which characterizes each author by a vector of k style loadings, and simultaneously constructs k style dictionaries, which each put weight on different orthographic rules. [sent-367, score-0.266]
87 Because the loadings are constrained to be non-negative, the factorization can be seen as sparsely assigning varying amounts of each style to each author. [sent-368, score-0.141]
88 The resulting styles are shown in Table 3, for k = 10; other values of k give similar overall results with more or less detail. [sent-372, score-0.088]
89 The styles incorporate a number of linguistic phenomena, including: expressive lengthening (styles 7-9; see Brody and Diakopoulos, 2011); g- and t-dropping (style 5, see Eisenstein 2013a) ; th-stopping (style 6); and the dropping of several word-final vowels (styles 1-3). [sent-373, score-0.134]
90 Some of these styles, such as t-dropping and th-stopping, have direct analogues in spoken language varieties (Tagliamonte and Temple, 2005; Green, 2002), while others, like expressive lengthening, seem more unique to social media. [sent-374, score-0.132]
91 The relationships between these orthographic styles and social variables such as geography and demograph2We tried adding these rules as features and retraining the normalization system, but this hurt performance. [sent-375, score-0.615]
92 The tokens ima, outt a, and needa all refer to multi-word expressions in standard English, and are thus outside the scope of the normalization task as defined by Han et al. [sent-392, score-0.455]
93 But while these normalizations are wrong, the resulting style nonetheless captures a coherent orthographic phenomenon. [sent-395, score-0.248]
94 8 Conclusion We have presented a unified, unsupervised statistical model for normalizing social media text, attaining the best reported performance on the two standard normalization datasets. [sent-396, score-0.601]
95 The power of our approach comes from flexible modeling of word-to-word relationships through features, while exploiting contextual regularity to train the corresponding feature 70 weights without labeled data. [sent-397, score-0.09]
96 The primary technical challenge was overcoming the large label space of the normalization task; we accomplish this using sequential Monte Carlo. [sent-398, score-0.48]
97 Future work may consider whether sequential Monte Carlo can offer similar advantages in other unsupervised NLP tasks. [sent-399, score-0.171]
98 A hybrid rule/model-based finite-state framework for normalizing sms messages. [sent-420, score-0.121]
99 An overview of existing methods and recent advances in sequential monte carlo. [sent-453, score-0.337]
100 Joint inference of named entity recognition and normalization for tweets. [sent-576, score-0.326]
wordName wordTfidf (topN-words)
[('tnk', 0.329), ('normalization', 0.326), ('unlol', 0.247), ('tn', 0.225), ('monte', 0.218), ('proposal', 0.198), ('han', 0.169), ('carlo', 0.153), ('sn', 0.146), ('baldwin', 0.14), ('sequential', 0.119), ('orthographic', 0.105), ('eisenstein', 0.104), ('lexnorm', 0.103), ('nonstandard', 0.098), ('social', 0.096), ('normalized', 0.094), ('twitter', 0.089), ('styles', 0.088), ('oov', 0.086), ('expectation', 0.084), ('doucet', 0.082), ('sutt', 0.082), ('nk', 0.082), ('normalizations', 0.082), ('media', 0.078), ('unnormalized', 0.076), ('normalize', 0.075), ('sms', 0.072), ('tweets', 0.07), ('liu', 0.069), ('viterbi', 0.069), ('contractor', 0.065), ('choudhury', 0.065), ('cappe', 0.062), ('godsill', 0.062), ('gouws', 0.062), ('smc', 0.062), ('tkn', 0.062), ('style', 0.061), ('string', 0.061), ('target', 0.06), ('jacob', 0.059), ('edit', 0.058), ('cook', 0.057), ('kn', 0.056), ('weights', 0.055), ('unsupervised', 0.052), ('tf', 0.052), ('normalizing', 0.049), ('petrovi', 0.049), ('tokens', 0.047), ('tk', 0.047), ('lengthening', 0.046), ('sproat', 0.046), ('decoding', 0.045), ('strings', 0.045), ('expectations', 0.045), ('hypotheses', 0.045), ('distribution', 0.044), ('nested', 0.043), ('quadratic', 0.041), ('xk', 0.041), ('beaufort', 0.041), ('finna', 0.041), ('gud', 0.041), ('ima', 0.041), ('kobus', 0.041), ('loadings', 0.041), ('needa', 0.041), ('nimfa', 0.041), ('outt', 0.041), ('pkk', 0.041), ('saraf', 0.041), ('shortenings', 0.041), ('smh', 0.041), ('texting', 0.041), ('tpn', 0.041), ('xtp', 0.041), ('hypothesis', 0.04), ('samples', 0.04), ('factorization', 0.039), ('importance', 0.039), ('prior', 0.038), ('outer', 0.038), ('approximate', 0.038), ('message', 0.037), ('ech', 0.036), ('gat', 0.036), ('menezes', 0.036), ('nnk', 0.036), ('tagliamonte', 0.036), ('varieties', 0.036), ('hassan', 0.035), ('labeled', 0.035), ('label', 0.035), ('smith', 0.035), ('cost', 0.035), ('datasets', 0.034), ('sampling', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
Author: Yi Yang ; Jacob Eisenstein
Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
2 0.22701722 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
3 0.091793582 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
4 0.088664994 150 emnlp-2013-Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries
Author: Russell Beckley ; Brian Roark
Abstract: Pronunciation dictionaries provide a readily available parallel corpus for learning to transduce between character strings and phoneme strings or vice versa. Translation models can be used to derive character-level paraphrases on either side of this transduction, allowing for the automatic derivation of alternative pronunciations or spellings. We examine finitestate and SMT-based methods for these related tasks, and demonstrate that the tasks have different characteristics finding alternative spellings is harder than alternative pronunciations and benefits from round-trip algorithms when the other does not. We also show that we can increase accuracy by modeling syllable stress. –
5 0.087568164 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
Author: Shi Feng ; Le Zhang ; Binyang Li ; Daling Wang ; Ge Yu ; Kam-Fai Wong
Abstract: Extensive experiments have validated the effectiveness of the corpus-based method for classifying the word’s sentiment polarity. However, no work is done for comparing different corpora in the polarity classification task. Nowadays, Twitter has aggregated huge amount of data that are full of people’s sentiments. In this paper, we empirically evaluate the performance of different corpora in sentiment similarity measurement, which is the fundamental task for word polarity classification. Experiment results show that the Twitter data can achieve a much better performance than the Google, Web1T and Wikipedia based methods.
6 0.085947908 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
7 0.078131422 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts
8 0.0776015 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
9 0.077494003 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
10 0.075546175 27 emnlp-2013-Authorship Attribution of Micro-Messages
11 0.06944108 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
12 0.067970231 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
13 0.066786185 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
14 0.063156679 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media
15 0.06249008 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
16 0.06132691 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
17 0.060003746 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
18 0.059336312 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
19 0.058996003 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
20 0.057613157 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
topicId topicWeight
[(0, -0.226), (1, -0.028), (2, -0.065), (3, -0.066), (4, -0.005), (5, -0.061), (6, 0.044), (7, 0.05), (8, 0.021), (9, -0.019), (10, -0.037), (11, 0.059), (12, 0.071), (13, 0.2), (14, 0.035), (15, -0.081), (16, 0.134), (17, -0.008), (18, -0.022), (19, -0.01), (20, -0.066), (21, -0.034), (22, 0.131), (23, 0.006), (24, -0.059), (25, 0.005), (26, -0.042), (27, 0.026), (28, 0.088), (29, 0.061), (30, 0.044), (31, 0.037), (32, 0.132), (33, 0.076), (34, 0.031), (35, 0.004), (36, -0.029), (37, -0.022), (38, 0.134), (39, 0.063), (40, 0.044), (41, -0.027), (42, -0.002), (43, -0.277), (44, -0.113), (45, 0.043), (46, -0.089), (47, 0.206), (48, -0.102), (49, -0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.92753822 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
Author: Yi Yang ; Jacob Eisenstein
Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
2 0.79396677 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
3 0.56320214 14 emnlp-2013-A Synchronous Context Free Grammar for Time Normalization
Author: Steven Bethard
Abstract: We present an approach to time normalization (e.g. the day before yesterday⇒20 13-04- 12) based on a synchronous contex⇒t free grammar. Synchronous rules map the source language to formally defined operators for manipulating times (FINDENCLOSED, STARTATENDOF, etc.). Time expressions are then parsed using an extended CYK+ algorithm, and converted to a normalized form by applying the operators recursively. For evaluation, a small set of synchronous rules for English time expressions were developed. Our model outperforms HeidelTime, the best time normalization system in TempEval 2013, on four different time normalization corpora.
4 0.43792558 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
Author: Russell Beckley ; Brian Roark
Abstract: Pronunciation dictionaries provide a readily available parallel corpus for learning to transduce between character strings and phoneme strings or vice versa. Translation models can be used to derive character-level paraphrases on either side of this transduction, allowing for the automatic derivation of alternative pronunciations or spellings. We examine finitestate and SMT-based methods for these related tasks, and demonstrate that the tasks have different characteristics finding alternative spellings is harder than alternative pronunciations and benefits from round-trip algorithms when the other does not. We also show that we can increase accuracy by modeling syllable stress. –
8 0.34603468 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts
9 0.34286615 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
10 0.3376939 203 emnlp-2013-With Blinkers on: Robust Prediction of Eye Movements across Readers
11 0.33144325 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels
12 0.31056687 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
13 0.30747652 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
14 0.29673243 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
15 0.29131526 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
16 0.28779632 27 emnlp-2013-Authorship Attribution of Micro-Messages
17 0.28742126 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
18 0.28350464 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
19 0.27800706 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
20 0.27477342 115 emnlp-2013-Joint Learning of Phonetic Units and Word Pronunciations for ASR
topicId topicWeight
[(3, 0.025), (18, 0.046), (22, 0.03), (30, 0.089), (45, 0.011), (47, 0.334), (50, 0.02), (51, 0.174), (66, 0.044), (71, 0.055), (75, 0.018), (77, 0.023), (96, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.81431711 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
Author: Yi Yang ; Jacob Eisenstein
Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
2 0.75970161 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
Author: Dong Nguyen ; A. Seza Dogruoz
Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.
3 0.75229657 118 emnlp-2013-Learning Biological Processes with Global Constraints
Author: Aju Thalappillil Scaria ; Jonathan Berant ; Mengqiu Wang ; Peter Clark ; Justin Lewis ; Brittany Harding ; Christopher D. Manning
Abstract: Biological processes are complex phenomena involving a series of events that are related to one another through various relationships. Systems that can understand and reason over biological processes would dramatically improve the performance of semantic applications involving inference such as question answering (QA) – specifically “How? ” and “Why? ” questions. In this paper, we present the task of process extraction, in which events within a process and the relations between the events are automatically extracted from text. We represent processes by graphs whose edges describe a set oftemporal, causal and co-reference event-event relations, and characterize the structural properties of these graphs (e.g., the graphs are connected). Then, we present a method for extracting relations between the events, which exploits these structural properties by performing joint in- ference over the set of extracted relations. On a novel dataset containing 148 descriptions of biological processes (released with this paper), we show significant improvement comparing to baselines that disregard process structure.
4 0.55803961 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
5 0.53447497 143 emnlp-2013-Open Domain Targeted Sentiment
Author: Margaret Mitchell ; Jacqui Aguilar ; Theresa Wilson ; Benjamin Van Durme
Abstract: We propose a novel approach to sentiment analysis for a low resource setting. The intuition behind this work is that sentiment expressed towards an entity, targeted sentiment, may be viewed as a span of sentiment expressed across the entity. This representation allows us to model sentiment detection as a sequence tagging problem, jointly discovering people and organizations along with whether there is sentiment directed towards them. We compare performance in both Spanish and English on microblog data, using only a sentiment lexicon as an external resource. By leveraging linguisticallyinformed features within conditional random fields (CRFs) trained to minimize empirical risk, our best models in Spanish significantly outperform a strong baseline, and reach around 90% accuracy on the combined task of named entity recognition and sentiment prediction. Our models in English, trained on a much smaller dataset, are not yet statistically significant against their baselines.
6 0.53238356 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
7 0.53154618 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
8 0.53026676 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
9 0.52827376 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
10 0.52744335 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media
11 0.52653962 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
12 0.5263865 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
13 0.52601266 123 emnlp-2013-Learning to Rank Lexical Substitutions
14 0.52539837 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
15 0.525316 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
16 0.5249297 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes
17 0.52473634 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation
18 0.52441502 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
19 0.52384901 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
20 0.52325773 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction