emnlp emnlp2012 emnlp2012-22 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bo Han ; Paul Cook ; Timothy Baldwin
Abstract: Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highlyranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing. 1 Lexical Normalisation A staggering number of short text “microblog” messages are produced every day through social media such as Twitter (Twitter, 2011). The immense volume of real-time, user-generated microblogs that flows through sites has been shown to have utility in applications such as disaster detection (Sakaki et al., 2010), sentiment analysis (Jiang et al., 2011; Gonz a´lez-Ib ´a n˜ez et al., 2011), and event discovery (Weng and Lee, 2011; Benson et al., 2011). However, due to the spontaneous nature of the posts, microblogs are notoriously noisy, containing many non-standard forms e.g., tmrw “tomorrow” and 2day “today” which degrade the performance of — — 421 natural language processing (NLP) tools (Ritter et al., 2010; Han and Baldwin, 2011). To reduce this effect, attempts have been made to adapt NLP tools to microblog data (Gimpel et al., 2011; Foster et al., 2011; Liu et al., 2011b; Ritter et al., 2011). An alternative approach is to pre-normalise non-standard lexical variants to their standard orthography (Liu et al., 2011a; Han and Baldwin, 2011; Xue et al., 2011; Gouws et al., 2011). For example, se u 2morw!!! would be normalised to see you tomorrow! The normalisation approach is especially attractive as a preprocessing step for applications which rely on keyword match or word frequency statistics. For example, earthqu, eathquake, and earthquakeee all attested in a Twitter corpus have the standard form earthquake; by normalising these types to their standard form, better coverage can be achieved for keyword-based methods, and better word frequency estimates can be obtained. In this paper, we focus on the task of lexical normalisation of English Twitter messages, in which out-of-vocabulary (OOV) tokens are normalised to their in-vocabulary (IV) standard form, i.e., a standard form that is in a dictionary. Following other recent work on lexical normalisation (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011; Liu et al., 2012), we specifically focus on one-to-one normalisation in which one OOV token is normalised to one IV word. Naturally, not all OOV words in microblogs are lexical variants of IV words: named entities, e.g., — — are prevalent in microblogs, but not all named entities are included in our dictionary. One challenge for lexical normalisation is therefore to disPLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 42t C1–o4n3f2e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r.a ?lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl tinguish those OOV tokens that require normalisation from those that are well-formed. Recent unsupervised approaches have not attempted to distinguish such tokens from other types of OOV tokens (Cook and Stevenson, 2009; Liu et al., 2011a), limiting their applicability to real-world normalisation tasks. Other approaches (Han and Baldwin, 2011; Gouws et al., 2011) have followed a cascaded approach in which lexical variants are first identified, and then normalised. However, such two-step approaches suffer from poor lexical variant identification performance, which is propagated to the normalisation step. Motivated by the observation that most lexical variants have an unambiguous standard form (especially for longer tokens), and that a lexical variant and its standard form typically occur in similar contexts, in this paper we propose methods for automatically constructing a lexical normalisation dictionary a dictionary whose entries consist — of (lexical variant, standard form) pairs that enables type-based normalisation. Despite the simplicity of this dictionary-based normalisation method, we show it to outperform previously-proposed approaches. This very fast, lightweight solution is suitable for real-time processing of the large volume of streaming microblog data available from Twitter, and offers a simple solution to the lexical variant detection problem that hinders other normalisation methods. Furthermore, this dictionary-based method can be easily integrated with other more-complex normalisation approaches (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011) to produce hybrid systems. After discussing related work in Section 2, we present an overview of our dictionary-based approach to normalisation in Section 3. In Sections 4 and 5 we experimentally select the optimised context similarity parameters and string similarity reranking method. We present experimental results on the unseen test data in Section 6, and offer some concluding remarks in Section 7. — 2 Related Work Given a token t, lexical normalisation is the task of finding arg max P(s|t) ∝ arg max P(t| s)P(s), wofh efinred s igs tahreg smtaanxdaPrd(s form, i.e., an aIVx Pw(otr|sd). PSt(asn)-, dardly in lexical normalisation, t is assumed to be an 422 OOV token, relative to a fixed dictionary. In practice, not all OOV tokens should be normalised; i.e., only lexical variants (e.g., tmrw “tomorrow”) should be normalised and tokens that are OOV but otherwise not lexical variants (e.g., iPad “iPad”) should be unchanged. Most work in this area focuses only on the normalisation task itself, oftentimes assuming that the task of lexical variant detection has already been completed. Various approaches have been proposed to estimate the error model, P(t|s). For example, in work on spell-checking, eBl,ril Pl (atn|ds) M. Fooorre e (2000) improve on a standard edit-distance approach by considering multi-character edit operations; Toutanova and Moore (2002) build on this by incorporating phonological information. Li et al. (2006) utilise distributional similarity (Lin, 1998) to correct misspelled search queries. In text message normalisation, Choudhury et al. (2007) model the letter transformations and emissions using a hidden Markov model (Rabiner, 1989). Cook and Stevenson (2009) and Xue et al. (201 1) propose multiple simple error models, each of which captures a particular way in which lexical variants are formed, such as phonetic spelling (e.g., epik “epic”) or clipping (e.g., walkin “walking”). Nevertheless, optimally weighting the various error models in these approaches is challenging. Without pre-categorising lexical variants into different types, Liu et al. (201 1a) collect Google search snippets from carefully-designed queries from which they then extract noisy lexical variant– standard form pairs. These pairs are used to train a conditional random field (Lafferty et al., 2001) to estimate P(t|s) at the character level. One shortcoming eo fP querying a ese cahracrha engine teol. .o Obtanein strhaoirnt-ing pairs is it tends to be costly in terms of time and bandwidth. Here we exploit microblog data directly to derive (lexical variant, standard form) pairs, instead of relying on external resources. In morerecent work, Liu et al. (2012) endeavour to improve the accuracy of top-n normalisation candidates by integrating human cognitive inference, characterlevel transformations and spell checking in their normalisation model. The encouraging results shift the focus to reranking and promoting the correct normalisation to the top-1 position. However, like much previous work on lexical normalisation, this work assumes perfect lexical variant detection. Aw et al. (2006) and Kaufmann and Kalita (2010) consider normalisation as a machine translation task from lexical variants to standard forms using off-theshelf tools. These methods do not assume that lexical variants have been pre-identified; however, these methods do rely on large quantities of labelled training data, which is not available for microblogs. Recently, Han and Baldwin (201 1) and Gouws et al. (201 1) propose two-step unsupervised approaches to normalisation, in which lexical variants are first identified, and then normalised. They approach lexical variant detection by using a context fitness classifier (Han and Baldwin, 2011) or through dictionary lookup (Gouws et al., 2011). However, the lexical variant detection of both meth- ods is rather unreliable, indicating the challenge of this aspect of normalisation. Both of these approaches incorporate a relatively small normalisation dictionary to capture frequent lexical variants with high precision. In particular, Gouws et al. (201 1) produce a small normalisation lexicon based on distributional similarity and string similarity (Lodhi et al., 2002). Our method adopts a similar strategy using distributional/string similarity, but instead of constructing a small lexicon for preprocessing, we build a much wider-coverage normalisation dictionary and opt for a fully lexiconbased end-to-end normalisation approach. In contrast to the normalisation dictionaries of Han and Baldwin (201 1) and Gouws et al. (201 1) which focus on very frequent lexical variants, we focus on moderate frequency lexical variants of a minimum character length, which tend to have unambiguous standard forms; our intention is to produce normalisation lexicons that are complementary to those currently available. Furthermore, we investigate the impact of a variety of contextual and string similarity measures on the quality of the resulting lexicons. In summary, our dictionary-based normalisation ap- proach is a lightweight end-to-end method which performs both lexical variant detection and normalisation, and thus is suitable for practical online preprocessing, despite its simplicity. 423 3 A Lexical Normalisation Dictionary Before discussing our method for creating a normalisation dictionary, we first discuss the feasibility of such an approach. 3.1 Feasibility Dictionary lookup approaches to normalisation have been shown to have high precision but low recall (Han and Baldwin, 2011; Gouws et al., 2011). Frequent (lexical variant, standard form) pairs such as (u, you) are typically included in the dictionaries used by such methods, while less-frequent items such as (g0tta, gotta) are generally omitted. Because of the degree of lexical creativity and large number of non-standard forms observed on Twitter, a wide-coverage normalisation dictionary would be expensive to construct manually. Based on the assumption that lexical variants occur in similar con- texts to their standard forms, however, it should be possible to automatically construct a normalisation dictionary with wider coverage than is currently available. Dictionary lookup is a type-based approach to normalisation, i.e., every token instance of a given type will always be normalised in the same way. However, lexical variants can be ambiguous, e.g., y corresponds to “you” in yeah, y r right! LOL but “why” in AM CONFUSED!!! y you did that? Nevertheless, the relative occurrence of ambiguous lexical variants is small (Liu et al., 2011a), and it has been observed that while shorter variants such as y are often ambiguous, longer variants tend to be unambiguous. For example bthday and 4eva are unlikely to have standard forms other than “birthday” and “forever”, respectively. Therefore, the normalisation lexicons we produce will only contain entries for OOVs with character length greater than a specified threshold, which are likely to have an unambiguous standard form. 3.2 Overview of approach Our method for constructing a normalisation dictio- nary is as follows: Input: Tokenised English tweets 1. Extract (OOV, IV) pairs based on distributional similarity. 2. Re-rank the extracted pairs by string similarity. Output: A list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs for inclusion in the normalisation lexicon. In Step 1, we leverage large volumes of Twitter data to identify the most distributionally-similar IV type for each OOV type. The result of this process is a set of (OOV, IV) pairs, ranked by distributional similarity. The extracted pairs will include (lexical variant, standard form) pairs, such as (tmrw, tomorrow), but will also contain false positives such as (Tusday, Sunday) Tusday is a lexical variant, but its standard form is not “Sunday” and (Youtube, web) Youtube is an OOV named entity, not a lexical variant. Nevertheless, lexical variants are typically formed from their standard forms through regular processes (Thurlow, 2003) e.g., the omission of characters and from this perspective Sunday and web are not plausible standard — — — — — forms for Tusday and Youtube, respectively. In Step 2, we therefore capture this intuition to re-rank the extracted pairs by string similarity. The top-n items in this re-ranked list then form the normalisation lexicon, which is based only on development data. Although computationally-expensive to build, this dictionary can be created offline. Once built, it then offers a very fast approach to normalisation. We can only reliably compute distributional similarity for types that are moderately frequent in a corpus. Nevertheless, many lexical variants are sufficiently frequent to be able to compute distributional similarity, and can potentially make their way into our normalisation lexicon. This approach is not suitable for normalising low-frequency lexical variants, nor is it suitable for shorter lexical variant types which as discussed in Section 3.1 are more likely to have an ambiguous standard form. Nevertheless, previously-proposed normalisation methods that can handle such phenomena also rely in part on a normalisation lexicon. The normalisation lexicons we create can therefore be easily integrated with previous approaches to form hybrid normalisation systems. — — 4 Contextually-similar Pair Generation Our objective is to extract contextually-similar (OOV, IV) pairs from a large-scale collection of mi424 croblog data. Fundamentally, the surrounding words define the primary context, but there are different ways of representing context and different similarity measures we can use, which may influence the quality of generated normalisation pairs. In representing the context, we experimentally explore the following factors: (1) context window size (from 1 to 3 tokens on both sides); (2) n-gram order ofthe context tokens (unigram, bigram, trigram); (3) whether context words are indexed for relative position or not; and (4) whether we use all context tokens, or only IV words. Because high-accuracy linguistic processing tools for Twitter are still under exploration (Liu et al., 2011b; Gimpel et al., 2011; Ritter et al., 2011; Foster et al., 2011), we do not consider richer representations of context, for example, incorporating information about part-of-speech tags or syntax. We also experiment with a number of simple but widely-used geometric and information theoretic distance/similarity measures. In particular, we use Kullback–Leibler (KL) divergence (Kullback and Leibler, 195 1), Jensen–Shannon (JS) divergence (Lin, 1991), Euclidean distance and Cosine distance. We use a corpus of 10 million English tweets to do parameter tuning over, and a larger corpus of tweets in the final candidate ranking. All tweets were collected from September 2010 to January 2011 via the Twitter API.1 From the raw data we extract English tweets using a language identification tool (Lui and Baldwin, 2011), and then apply a simplified Twitter tokeniser (adapted from O’Connor et al. (2010)). We use the Aspell dictionary (v6.06)2 to determine whether a word is IV, and only include in our normalisation dictionary OOV tokens with at least 64 occurrences in the corpus and character length ≥ 4, both of which were determined through empirical 4o,b bsoetrhva otifo wnh. Fcohr w weearceh d OetOeVrm winoedrd t type ginh the corpus, we select the most similar IV type to form (OOV, IV) pairs. To further narrow the search space, we only consider IV words which are morphophonemically similar to the OOV type, follow- ing settings in Han and Baldwin (201 1).3 1http s : / / dev .twitter . com/ docs / st reaming-api /methods 2http : / / aspe l .net / l 3We only consider IV words within an edit distance of 2 or a phonemic edit distance of 1from the OOV type, and we further In order to evaluate the generated pairs, we randomly selected 1000 OOV words from the 10 million tweet corpus. We set up an annotation task on Amazon Mechanical Turk,4 presenting five independent annotators with each word type (with no context) and asking for corrections where appropriate. For instance, given tmrw, the annotators would likely identify it as a non-standard variant of “tomorrow”. For correct OOV words like iPad, on the other hand, we would expect them to leave the word unchanged. If 3 or more of the 5 annotators make the same suggestion (in the form of either a canonical spelling or leaving the word unchanged), we include this in our gold standard for evaluation. In total, this resulted in 351 lexical variants and 282 correct OOV words, accounting for 63.3% of the 1000 OOV words. These 633 OOV words were used as (OOV, IV) pairs for parameter tuning. The remainder of the 1000 OOV words were ignored on the grounds that there was not sufficient consensus amongst the annotators.5 Contextually-similar pair generation aims to include as many correct normalisation pairs as possible. We evaluate the quality of the normalisation pairs using “Cumulative Gain” (CG): XN0 CG = Xreli0 Xi=1 Suppose there are N0 correct generated pairs (oovi, ivi), each of which is weighted by reli0, the frequency of oovi to indicate its relative importance; for example, (thinkin, thinking) has a higher weight than (g0tta, gotta) because thinkin is more frequent than g0tta in our corpus. In this evaluation we don’t consider the position of normalisation pairs, and nor do we penalise incorrect pairs. Instead, we push distinguishing between correct and incorrect pairs into the downstream re-ranking step in which we incorporate string similarity information. Given the development data and CG, we run an exhaustive search of parameter combinations over only consider the top 30% most-frequent of these IV words. 4https : / /www .mturk .com/mturk/welcome 5Note that the objective of this annotation task is to identify lexical variants that have agreed-upon standard forms irrespective of context, as a special case of the more general task of lexical normalisation (where context may or may not play a significant role in the determination of the normalisation). 425 our development corpus. The five best parameter combinations are shown in Table 1. We notice the CG is almost identical for the top combinations. As a context window size of 3 incurs a heavy processing and memory overhead over a size of 2, we use the 3rd-best parameter combination for subsequent experiments, namely: context window of ±2 tokens, teoxkpeenr bigrams, positional index, nadnodw wK oLf divergence as our distance measure. To better understand the sensitivity of the method to each parameter, we perform a post-hoc parameter analysis relative to a default setting (as underlined in Table 2), altering one parameter at a time. The results in Table 2 show that bigrams outperform other n-gram orders by a large margin (note that the evaluation is based on a log scale), and information-theoretic measures are superior to the geometric measures. Furthermore, it also indicates using the positional indexing better captures context. However, there is little to distinguish context modelling with just IV words or all tokens. Similarly, the context window size has relatively little impact on the overall performance, supporting our earlier observation from Table 1. 5 Pair Re-ranking by String Similarity Once the contextually-similar (OOV, IV) pairs are generated using the selected parameters in Section 4, we further re-rank this set of pairs in an attempt to boost morphophonemically-similar pairs like (bananaz, bananas), and penalise noisy pairs like (paninis, beans). Instead of using the small 10 million tweet corpus, from this step onwards, we use a larger corpus of 80 million English tweets (collected over the same period as the development corpus) to develop a larger-scale normalisation dictionary. This is because once pairs are generated, re-ranking based on string comparison is much faster. We only include in the dictionary OOV words with a token frequency > 15 to include more OOV types than in Section 4, and again apply a minimum length cutoff of 4 char- acters. To measure how well our re-ranking method promotes correct pairs and demotes incorrect pairs (including both OOV words that should not be normalised, e.g. (Youtube, web), and incorrect normalRankWindow sizen-gramPositional index?Lex. choiceSim/distance measurelog(CG) 1±32YesAllKL divergence19.571 2 ±±33 2 No All KL divergence 19.562 3 ±±23 2 Yes All KL divergence 19.562 4 ±±32 2 Yes IVs KL divergence 19.561 5 ±±23 2 Yes IVs JS divergence 19.554 ±2 Table 1: The five best parameter combinations in the exhaustive search of parameter combinations Window sizen-gramPositional index?Lexical choiceSimilarity/distance measure ±1 19.3251 19.328Yes 19.328IVs 19.335KL divergence 19.328 ±±21 1199..332275 2 19.571 No 19.263 All 19.328 Euclidean 19.227 ±±32 1199..332287 3 19.324 JS divergence 19.31 1 Cosine 19.170 Table 2: Parameter sensitivity analysis measured as log(CG) for correctly-generated pairs. We tune one parameter at a time, using the default (underlined) setting for other parameters; the non-exhaustive best-performing setting in each case is indicated in bold. isations for lexical variants, e.g. (bcuz, cause)), we modify our evaluation metric from Section 4 to evaluate the ranking at different points, using Discounted Cumulative Gain (DCG@N: Jarvelin and Kekalainen (2002)): DCG@N = rel1+XiN=2logr2el(i ) where reli again represents the frequency of the OOV, but it can be gain (a positive number) or loss (a negative number), depending on whether the ith pair is correct or incorrect. Because we also expect correct pairs to be ranked higher than incorrect pairs, DCG@N takes both factors into account. Given the generated pairs and the evaluation metric, we first consider three baselines: no re-ranking (i.e., the final ranking is that of the contextual similarity scores), and re-rankings of the pairs based on the frequencies of the OOVs in the Twitter corpus, and the IV unigram frequencies in the Google Web 1T corpus (Brants and Franz, 2006) to get less-noisy frequency estimates. We also compared a variety of re-rankings based on a number of string similarity measures that have been previously considered in normalisation work (reviewed in Section 2). We experiment with standard edit distance (Levenshtein, 1966), edit distance over double metaphone codes (phonetic edit distance: (Philips, 2000)), longest common subsequence ratio over the consonant edit distance of the paired words (hereafter, denoted as 426 consonant edit distance: (Contractor et al., 2010)), and a string subsequence kernel (Lodhi et al., 2002). In Figure 1, we present the DCG@N results for each of our ranking methods at different rank cutoffs. Ranking by OOV frequency is motivated by the assumption that lexical variants are frequently used by social media users. This is confirmed by our findings that lexical pairs like (goin, going) and (nite, night) are at the top of the ranking. However, many proper nouns and named entities are also used frequently and ranked at the top, mixed with lexical variants like (Facebook, speech) and (Youtube, web). In ranking by IV word frequency, we assume the lexical variants are usually derived from frequently-used IV equivalents, e.g. (abou, about). However, many less-frequent lexical variant types have high-frequency (IV) normalisations. For instance, the highest-frequency IV word the has more than 40 OOV lexical variants, such as tthe and thhe. These less-frequent types occupy the top positions, reducing the cumulative gain. Compared with these two baselines, ranking by default contextual similarity scores delivers promising results. It successfully ranks many more intuitive normalisation pairs at the top, such as (2day, today) and (wknd, weekend), but also ranks some incorrect pairs highly, such as (needa, gotta). The string similarity-based methods perform better than our baselines in general. Through manual analysis, we found that standard edit distance ranking is fairly accurate for lexical variants with low edit distance to their standard forms, but fails to identify heavily-altered variants like (tmrw, tomorrow). Consonant edit distance is similar to standard edit distance, but places many longer words at the top of the ranking. Edit distance over double metaphone codes (phonetic edit distance) performs particularly well for lexical variants that include character repetitions commonly used for emphasis on Twitter because such repetitions do not typically alter the phonetic codes. Compared with the other methods, the string subsequence kernel delivers encouraging results. It measures common character subsequences of length n between (OOV, IV) pairs. Because it is computationally expensive to calculate similarity for larger n, we choose n=2, following Gouws et al. (201 1). As N (the lexicon size cut-off) increases, the performance drops more slowly than the other meth— — ods. Although this method fails to rank heavilyaltered variants such as (4get,forget) highly, it typically works well for longer words. Given that we focus on longer OOVs (specifically those longer than 4 characters), this ultimately isn’t a great handicap. 6 Evaluation Given the re-ranked pairs from Section 5, here we apply them to a token-level normalisation task using the normalisation dataset of Han and Baldwin (201 1). 6.1 Metrics We evaluate using the standard evaluation metrics of precision (P), recall (R) and F-score (F) as detailed below. We also consider the false alarm rate (FA) and word error rate (WER), also as shown below. FA measures the negative effects of applying normalisation; a good approach to normalisation should not (incorrectly) normalise tokens that are already in their standard form and do not require normalisation.6 WER, like F-score, shows the overall benefits of normalisation, but unlike F-score, measures how many token-level edits are required for the output to be the same as the ground truth data. In general, dictionaries with a high F-score/low WER and low FA 6FA + P ≤ 1because some lexical variants might be incorrectly Ano +rm Pa ≤lise 1d b. 427 are preferable. P = R= F = FA = WER = # cor#re nctolrym naolrismedal tioskeden toskens # to ckoernresc rtelyqu niori nmga nloisremda tloiskaetniosn P2P +R R # inco#rr encotrlmya nliosremda tloikseedns tokens # token edits n#ee adlletd o akfetnesr normalisation 6.2 Results We select the three best re-ranking methods, and best cut-off N for each method, based on the highest DCG@N value for a given method over the development data, as presented in Figure 1. Namely, they are string subsequence kernel (S-dict, N=40,000), double metaphone edit distance (DMdict, N=10,000) and default contextual similarity without re-ranking (C-dict, N=10,000).7 We evaluate each of the learned dictionaries in Table 3. We also compare each dictionary with the performance of the manually-constructed Internet slang dictionary (HB-dict) used by Han and Baldwin (201 1), the small automatically-derived dictionary of Gouws et al. (201 1) (GHM-dict), and combinations of the different dictionaries. In addition, the contribution of these dictionaries in hybrid normalisation approaches is also presented, in which we first normalise OOVs using a given dictionary (combined or otherwise), and then apply the normalisation method of Gouws et al. (201 1) based on consonant edit distance (GHM-norm), or the approach of Han and Baldwin (201 1) based on the summation of many unsupervised approaches (HB-norm), to the remaining OOVs. Results are shown in Table 3, and discussed below. 6.2.1 Individual Dictionaries Overall, the individual dictionaries derived by the re-ranking methods (DM-dict, S-dict) perform bet- 7We also experimented with combining ranks using Mean Reciprocal Rank. However, the combined rank didn’t improve performance on the development data. We plan to explore other ranking aggregation methods in future work. 1 3 5 7 9 11 31 51 71 91 N cut−offs Figure 1: Re-ranking based on different string similarity methods. ter than that based on contextual similarity (C-dict) in terms of precision and false alarm rate, indicating the importance of re-ranking. Even though C-dict delivers higher recall indicating that many lexical variants are correctly normalised this is offset by its high false alarm rate, which is particularly undesirable in normalisation. Because S-dict has better performance than DM-dict in terms of both F-score and WER, and a much lower false alarm rate than C-dict, subsequent results are presented using S-dict only. — — Both HB-dict and GHM-dict achieve better than 90% precision with moderate recall. Compared to these methods, S-dict is not competitive in terms of either precision or recall. This result seems rather discouraging. However, considering that S-dict is an automatically-constructed dictionary targeting lexical variants of varying frequency, it is not surprising that the precision is worse than that of HB-dict which is manually-constructed and GHM-dict which includes entries only for more-frequent OOVs for which distributional similarity is more accurate. Additionally, the recall of S-dict is hampered by the — — — 428 restriction on lexical variant token length of 4 characters. 6.2.2 Combined Dictionaries Next we look to combining HB-dict, GHM-dict and S-dict. In combining the dictionaries, a given OOV word can be listed with different standard forms in different dictionaries. In such cases we use the following preferences for dictionaries motivated by our confidence in the normalisation pairs — of the dictionaries to resolve conflicts: HB-dict > GHM-dict > S-dict. When we combine dictionaries in the second section of Table 3, we find that they contain complementary information: in each case the recall and F-score are higher for the combined dictionary than any of the individual dictionaries. The combination of HB-dict+GHM-dict produces only a small improvement in terms of F-score over HBdict (the better-performing dictionary) suggesting that, as claimed, HB-dict and GHM-dict share many frequent normalisation pairs. HB-dict+S-dict and GHM-dict+S-dict, on the other hand, improve sub— MethodPrecisionRecallF-ScoreFalse AlarmWord Error Rate C-dict0.4740.2180.2990.2980.103 DM-dict S-dict HB-dict GHM-dict 0.727 0.700 0.915 0.982 0.106 0.179 0.435 0.319 0.185 0.285 0.590 0.482 0.145 0.162 0.048 0.000 0.102 0.097 0.066 0.076 HB-dict+S-dict0.8400.6010.7010.0900.052 GHM-dict+S-dict HB-dict+GHM-dict HB-dict+GHM-dict+S-dict 0.863 0.920 0.847 0.498 0.465 0.630 0.632 0.618 0.723 0.072 0.045 0.086 0.061 0.063 0.049 GHM-dict+GHM-norm0.3380.5780.4270.4580.135 HB-dict+GHM-dict+S-dict+GHM-norm HB-dict+HB-norm HB-dict+GHM-dict+S-dict+HB-norm 0.406 0.515 0.527 0.715 0.771 0.789 0.518 0.618 0.632 0.468 0.332 0.332 0.124 0.081 0.079 Table 3: Normalisation results using our derived dictionaries (contextual similarity (C-dict); double metaphone rendering (DM-dict); string subsequence kernel scores (S-dict)), the dictionary of Gouws et al. (201 1) (GHM-dict), the Internet slang dictionary (HB-dict) from Han and Baldwin (201 1), and combinations of these dictionaries. In addition, we combine the dictionaries with the normalisation method of Gouws et al. (201 1) (GHM-norm) and the combined unsupervised approach of Han and Baldwin (201 1) (HB-norm). stantially over HB-dict and GHM-dict, respectively, indicating that S-dict contains markedly different entries to both HB-dict and GHM-dict. The best Fscore and WER are obtained using the combination of all three dictionaries, HB-dict+GHM-dict+S-dict. Furthermore, the difference between the results using HB-dict+GHM-dict+S-dict and HB-dict+GHMdict is statistically significant (p < 0.01), based on the computationally-intensive Monte Carlo method of Yeh (2000), demonstrating the contribution of Sdict. 6.2.3 Hybrid Approaches The methods of Gouws et al. (201 1) (i.e. GHM-dict+GHM-norm) and Han and Baldwin (201 1) (i.e. HB-dict+HB-norm) have lower precision and higher false alarm rates than the dictionarybased approaches; this is largely caused by lexical variant detection errors.8 Using all dictionaries in combination with these methods HB-dict+GHM-dict+S-dict+GHM-norm and HBdict+GHM-dict+S-dict+HB-norm gives some improvements, but the false alarm rates remain high. Despite the limitations of a pure dictionary-based approach to normalisation discussed in Section 3.1 the current best practical approach to normal— — — — 8Here we report results that do not assume perfect detection of lexical variants, unlike the original published results in each case. 429 Error typeOOVDSitcat.ndard fGoromld (a) pluralsplayeplayersplayer (b) negation unlike like dislike (c) possessives anyones anyone anyone ’s (d) correct OOVs iphone phone iphone (e) test data errors durin during durin (f) ambiguity siging signing singing Table 4: Error types in the combined dictionary (HBdict+GHM-dict+S-dict) isation is to use a lexicon, combining hand-built and automatically-learned normalisation dictionaries. 6.3 Discussion and Error Analysis We first manually analyse the errors in the combined dictionary (HB-dict+GHM-dict+S-dict) and give examples of each error type in Table 4. The most frequent word errors are caused by slight morphologi- cal variations, including plural forms (a), negations (b), possessive cases (c), and OOVs that are correct and do not require normalisation (d). In addition, we also notice some missing annotations where lexical variants are skipped by human annotations but captured by our method (e). Ambiguity (f) definitely exists in longer OOVs, however, these cases do not appear to have a strong negative impact on the normalisation performance. An example of a remainLength cut-off (N)#VariantsPrecisionRecall (≥ N)Recall (all)False Alarm ≥45560.700Rec0al.l3 8(≥1 N)0.1790.162 ≥≥54 382 0.814 0.471 0.152 0.122 ≥≥65 254 0.804 0.484 0.104 0.131 ≥≥76 138 0.793 0.471 0.055 0.122 ≥71380.7930.4710.0550.122 Table 5: S-dict normalisation results broken down according to OOV token length. Recall is presented both over the subset of instances of length ≥ N in the data (“Recall (≥ N)”), and over the entirety of the dataset (“Recall (all)”); “su#bVsaertia onftis n” sitsa tnhcee snu omfb leenrg othf t≥ok Nen iinns tthaenc deast ao f( “tRhee cinadllic (a≥ted N length idn o othveer rt tehset d eanttaisreetty. ing miscellaneous error is bday “birthday”, which is mis-normalised as day. To further study the influence of OOV word length relative to the normalisation performance, we conduct a fine-grained analysis of the performance of the derived dictionary (S-dict) in Table 5, broken down across different OOV word lengths. The results generally support our hypothesis that our method works better for longer OOV words. The derived dictionary is much more reliable for longer tokens (length 5, 6, and 7 characters) in terms of precision and false alarm. Although the recall is relatively modest, in the future we intend to improve recall by mining more normalisation pairs from larger collections of microblog data. 7 Conclusions and Future Work In this paper, we describe a method for automatically constructing a normalisation dictionary that supports normalisation of microblog text through direct substitution of lexical variants with their standard forms. After investigating the impact of different distributional and string similarity methods on the quality of the dictionary, we present experimental results on a standard dataset showing that our proposed methods acquire high quality (lexical variant, standard form) pairs, with reasonable coverage, and achieve state-of-the-art end-toend lexical normalisation performance on a realworld token-level task. Furthermore, this dictionarylookup method combines the detection and normalisation of lexical variants into a simple, lightweight solution which is suitable for processing of highvolume microblog feeds. In the future, we intend to improve our dictionary by leveraging the constantly-growing volume of microblog data, and considering alternative ways to combine distributional and string similarity. In addi430 tion to direct evaluation, we also want to explore the benefits of applying normalisation for downstream social media text processing applications, e.g. event detection. Acknowledgements We would like to thank the three anonymous reviewers for their insightful comments, and Stephan Gouws for kindly sharing his data and discussing his work. NICTA is funded by the Australian government as represented by Department of Broadband, Communication and Digital Economy, and the Australian Research Council through the ICT centre of Excellence programme. References AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of COLING/ACL 2006, pages 33–40, Sydney, Australia. Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 389–398, Portland, Oregon, USA. Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Version 1. Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition, 10: 157–174. Danish Contractor, Tanveer A. Faruquie, and L. Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 189–196, Beijing, China. Paul Cook and Suzanne Stevenson. 2009. An unsu- pervised model for text message normalization. In CALC ’09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71– 78, Boulder, USA. Jennifer Foster, O¨zlem C ¸etinoglu, Joachim Wagner, Joseph L. Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. #hardtoparse: POS Tagging and Parsing the Twitterverse. In Analyzing Microtext: Papers from the 2011 AAAI Workshop, volume WS-1 1-05 of AAAI Workshops, pages 20–25, San Francisco, CA, USA. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 42–47, Portland, Oregon, USA. Roberto Gonz a´lez-Ib ´a n˜ez, Smaranda Muresan, and Nina Wacholder. 2011. Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 581–586, Portland, Oregon, USA. Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP, pages 82–90, Edinburgh, Scotland, UK. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 368–378, Portland, Oregon, USA. K. Jarvelin and J. Kekalainen. 2002. Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems, 20(4). Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target-dependent Twitter sentiment classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 151–160, Portland, Oregon, USA. Joseph Kaufmann and Jugal Kalita. 2010. Syntactic normalization of Twitter messages. In International Con431 ference on Natural Language Processing, Kharagpur, India. S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics, 22:49– 86. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings ofthe Eighteenth International Conference on Machine Learning, pages 282–289, San Francisco, CA, USA. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710. Mu Li, Yang Zhang, Muhua Zhu, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of COLING/ACL 2006, pages 1025–1032, Sydney, Australia. Jianhua Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1): 145–151. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the ACL and 1 International Con7th ference on Computational Linguistics (COLING/ACL98), pages 768–774, Montreal, Quebec, Canada. Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011a. Insertion, deletion, or substitution? normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 71–76, Portland, Oregon, USA. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011b. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 359–367, Portland, Oregon, USA. Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broadcoverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Jeju, Republic of Korea. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. J. Mach. Learn. Res., 2:419– 444. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), pages 553–561, Chiang Mai, Thailand. Brendan O’Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory search and topic summarization for Twitter. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM 2010), pages 384–385, Washington, USA. Lawrence Philips. 2000. The double metaphone search algorithm. C/C++ Users Journal, 18:38–43. Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of Twitter conversations. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL-HLT 2010), pages 172–180, Los Angeles, USA. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 1524–1534, Edinburgh, Scotland, UK. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on the World Wide Web (WWW 2010), pages 851–860, Raleigh, North Carolina, USA. Crispin Thurlow. 2003. Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Analysis Online, 1(1). Kristina Toutanova and Robert C. Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the ACL and 3rd Annual Meeting of the NAACL (ACL-02), pages 144–15 1, Philadelphia, USA. Official Blog Twitter. 2011. 200 million tweets per day. Retrived at August 17th, 2011. Jianshu Weng and Bu-Sung Lee. 2011. Event detection in Twitter. In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain. Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Proceedings of the AAAI11 Workshop on Analyzing Microtext, pages 74–79, San Francisco, USA. Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 947–953, Saarbr¨ ucken, Germany. 432
Reference: text
sentIndex sentText sentNum sentScore
1 net Abstract Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. [sent-7, score-1.016]
2 In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e. [sent-8, score-1.298]
3 We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. [sent-11, score-0.982]
4 Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing. [sent-14, score-0.154]
5 An alternative approach is to pre-normalise non-standard lexical variants to their standard orthography (Liu et al. [sent-30, score-0.242]
6 The normalisation approach is especially attractive as a preprocessing step for applications which rely on keyword match or word frequency statistics. [sent-38, score-0.795]
7 In this paper, we focus on the task of lexical normalisation of English Twitter messages, in which out-of-vocabulary (OOV) tokens are normalised to their in-vocabulary (IV) standard form, i. [sent-40, score-1.005]
8 Following other recent work on lexical normalisation (Liu et al. [sent-43, score-0.889]
9 , 2012), we specifically focus on one-to-one normalisation in which one OOV token is normalised to one IV word. [sent-46, score-0.881]
10 Naturally, not all OOV words in microblogs are lexical variants of IV words: named entities, e. [sent-47, score-0.263]
11 One challenge for lexical normalisation is therefore to disPLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 42t C1–o4n3f2e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r. [sent-50, score-0.889]
12 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl tinguish those OOV tokens that require normalisation from those that are well-formed. [sent-52, score-0.828]
13 , 2011) have followed a cascaded approach in which lexical variants are first identified, and then normalised. [sent-56, score-0.221]
14 However, such two-step approaches suffer from poor lexical variant identification performance, which is propagated to the normalisation step. [sent-57, score-0.964]
15 Despite the simplicity of this dictionary-based normalisation method, we show it to outperform previously-proposed approaches. [sent-59, score-0.795]
16 This very fast, lightweight solution is suitable for real-time processing of the large volume of streaming microblog data available from Twitter, and offers a simple solution to the lexical variant detection problem that hinders other normalisation methods. [sent-60, score-1.151]
17 Furthermore, this dictionary-based method can be easily integrated with other more-complex normalisation approaches (Liu et al. [sent-61, score-0.795]
18 After discussing related work in Section 2, we present an overview of our dictionary-based approach to normalisation in Section 3. [sent-64, score-0.814]
19 In Sections 4 and 5 we experimentally select the optimised context similarity parameters and string similarity reranking method. [sent-65, score-0.151]
20 — 2 Related Work Given a token t, lexical normalisation is the task of finding arg max P(s|t) ∝ arg max P(t| s)P(s), wofh efinred s igs tahreg smtaanxdaPrd(s form, i. [sent-67, score-0.913]
21 , tmrw “tomorrow”) should be normalised and tokens that are OOV but otherwise not lexical variants (e. [sent-75, score-0.381]
22 Most work in this area focuses only on the normalisation task itself, oftentimes assuming that the task of lexical variant detection has already been completed. [sent-78, score-0.997]
23 (201 1) propose multiple simple error models, each of which captures a particular way in which lexical variants are formed, such as phonetic spelling (e. [sent-87, score-0.294]
24 Without pre-categorising lexical variants into different types, Liu et al. [sent-93, score-0.221]
25 (2012) endeavour to improve the accuracy of top-n normalisation candidates by integrating human cognitive inference, characterlevel transformations and spell checking in their normalisation model. [sent-102, score-1.59]
26 The encouraging results shift the focus to reranking and promoting the correct normalisation to the top-1 position. [sent-103, score-0.795]
27 However, like much previous work on lexical normalisation, this work assumes perfect lexical variant detection. [sent-104, score-0.263]
28 (2006) and Kaufmann and Kalita (2010) consider normalisation as a machine translation task from lexical variants to standard forms using off-theshelf tools. [sent-106, score-1.066]
29 These methods do not assume that lexical variants have been pre-identified; however, these methods do rely on large quantities of labelled training data, which is not available for microblogs. [sent-107, score-0.221]
30 (201 1) propose two-step unsupervised approaches to normalisation, in which lexical variants are first identified, and then normalised. [sent-109, score-0.221]
31 They approach lexical variant detection by using a context fitness classifier (Han and Baldwin, 2011) or through dictionary lookup (Gouws et al. [sent-110, score-0.337]
32 However, the lexical variant detection of both meth- ods is rather unreliable, indicating the challenge of this aspect of normalisation. [sent-112, score-0.202]
33 Both of these approaches incorporate a relatively small normalisation dictionary to capture frequent lexical variants with high precision. [sent-113, score-1.129]
34 (201 1) produce a small normalisation lexicon based on distributional similarity and string similarity (Lodhi et al. [sent-115, score-0.978]
35 Our method adopts a similar strategy using distributional/string similarity, but instead of constructing a small lexicon for preprocessing, we build a much wider-coverage normalisation dictionary and opt for a fully lexiconbased end-to-end normalisation approach. [sent-117, score-1.703]
36 In contrast to the normalisation dictionaries of Han and Baldwin (201 1) and Gouws et al. [sent-118, score-0.875]
37 (201 1) which focus on very frequent lexical variants, we focus on moderate frequency lexical variants of a minimum character length, which tend to have unambiguous standard forms; our intention is to produce normalisation lexicons that are complementary to those currently available. [sent-119, score-1.176]
38 In summary, our dictionary-based normalisation ap- proach is a lightweight end-to-end method which performs both lexical variant detection and normalisation, and thus is suitable for practical online preprocessing, despite its simplicity. [sent-121, score-1.043]
39 423 3 A Lexical Normalisation Dictionary Before discussing our method for creating a normalisation dictionary, we first discuss the feasibility of such an approach. [sent-122, score-0.814]
40 1 Feasibility Dictionary lookup approaches to normalisation have been shown to have high precision but low recall (Han and Baldwin, 2011; Gouws et al. [sent-124, score-0.817]
41 Because of the degree of lexical creativity and large number of non-standard forms observed on Twitter, a wide-coverage normalisation dictionary would be expensive to construct manually. [sent-127, score-1.053]
42 Based on the assumption that lexical variants occur in similar con- texts to their standard forms, however, it should be possible to automatically construct a normalisation dictionary with wider coverage than is currently available. [sent-128, score-1.15]
43 Nevertheless, the relative occurrence of ambiguous lexical variants is small (Liu et al. [sent-139, score-0.221]
44 , 2011a), and it has been observed that while shorter variants such as y are often ambiguous, longer variants tend to be unambiguous. [sent-140, score-0.275]
45 Therefore, the normalisation lexicons we produce will only contain entries for OOVs with character length greater than a specified threshold, which are likely to have an unambiguous standard form. [sent-142, score-0.861]
46 2 Overview of approach Our method for constructing a normalisation dictio- nary is as follows: Input: Tokenised English tweets 1. [sent-144, score-0.845]
47 Output: A list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs for inclusion in the normalisation lexicon. [sent-148, score-0.944]
48 Nevertheless, lexical variants are typically formed from their standard forms through regular processes (Thurlow, 2003) e. [sent-152, score-0.271]
49 The top-n items in this re-ranked list then form the normalisation lexicon, which is based only on development data. [sent-156, score-0.795]
50 Nevertheless, many lexical variants are sufficiently frequent to be able to compute distributional similarity, and can potentially make their way into our normalisation lexicon. [sent-160, score-1.048]
51 This approach is not suitable for normalising low-frequency lexical variants, nor is it suitable for shorter lexical variant types which as discussed in Section 3. [sent-161, score-0.319]
52 Nevertheless, previously-proposed normalisation methods that can handle such phenomena also rely in part on a normalisation lexicon. [sent-163, score-1.59]
53 The normalisation lexicons we create can therefore be easily integrated with previous approaches to form hybrid normalisation systems. [sent-164, score-1.59]
54 Fundamentally, the surrounding words define the primary context, but there are different ways of representing context and different similarity measures we can use, which may influence the quality of generated normalisation pairs. [sent-166, score-0.833]
55 06)2 to determine whether a word is IV, and only include in our normalisation dictionary OOV tokens with at least 64 occurrences in the corpus and character length ≥ 4, both of which were determined through empirical 4o,b bsoetrhva otifo wnh. [sent-180, score-0.965]
56 net / l 3We only consider IV words within an edit distance of 2 or a phonemic edit distance of 1from the OOV type, and we further In order to evaluate the generated pairs, we randomly selected 1000 OOV words from the 10 million tweet corpus. [sent-186, score-0.228]
57 In total, this resulted in 351 lexical variants and 282 correct OOV words, accounting for 63. [sent-191, score-0.221]
58 5 Contextually-similar pair generation aims to include as many correct normalisation pairs as possible. [sent-195, score-0.832]
59 In this evaluation we don’t consider the position of normalisation pairs, and nor do we penalise incorrect pairs. [sent-197, score-0.817]
60 Instead, we push distinguishing between correct and incorrect pairs into the downstream re-ranking step in which we incorporate string similarity information. [sent-198, score-0.172]
61 Instead of using the small 10 million tweet corpus, from this step onwards, we use a larger corpus of 80 million English tweets (collected over the same period as the development corpus) to develop a larger-scale normalisation dictionary. [sent-213, score-0.845]
62 We also compared a variety of re-rankings based on a number of string similarity measures that have been previously considered in normalisation work (reviewed in Section 2). [sent-251, score-0.908]
63 Ranking by OOV frequency is motivated by the assumption that lexical variants are frequently used by social media users. [sent-256, score-0.289]
64 However, many proper nouns and named entities are also used frequently and ranked at the top, mixed with lexical variants like (Facebook, speech) and (Youtube, web). [sent-258, score-0.221]
65 In ranking by IV word frequency, we assume the lexical variants are usually derived from frequently-used IV equivalents, e. [sent-259, score-0.242]
66 However, many less-frequent lexical variant types have high-frequency (IV) normalisations. [sent-262, score-0.169]
67 It successfully ranks many more intuitive normalisation pairs at the top, such as (2day, today) and (wknd, weekend), but also ranks some incorrect pairs highly, such as (needa, gotta). [sent-266, score-0.891]
68 Through manual analysis, we found that standard edit distance ranking is fairly accurate for lexical variants with low edit distance to their standard forms, but fails to identify heavily-altered variants like (tmrw, tomorrow). [sent-268, score-0.639]
69 Consonant edit distance is similar to standard edit distance, but places many longer words at the top of the ranking. [sent-269, score-0.229]
70 Edit distance over double metaphone codes (phonetic edit distance) performs particularly well for lexical variants that include character repetitions commonly used for emphasis on Twitter because such repetitions do not typically alter the phonetic codes. [sent-270, score-0.49]
71 6 Evaluation Given the re-ranked pairs from Section 5, here we apply them to a token-level normalisation task using the normalisation dataset of Han and Baldwin (201 1). [sent-278, score-1.627]
72 FA measures the negative effects of applying normalisation; a good approach to normalisation should not (incorrectly) normalise tokens that are already in their standard form and do not require normalisation. [sent-282, score-0.871]
73 In general, dictionaries with a high F-score/low WER and low FA 6FA + P ≤ 1because some lexical variants might be incorrectly Ano +rm Pa ≤lise 1d b. [sent-284, score-0.301]
74 P = R= F = FA = WER = # cor#re nctolrym naolrismedal tioskeden toskens # to ckoernresc rtelyqu niori nmga nloisremda tloiskaetniosn P2P +R R # inco#rr encotrlmya nliosremda tloikseedns tokens # token edits n#ee adlletd o akfetnesr normalisation 6. [sent-286, score-0.852]
75 Namely, they are string subsequence kernel (S-dict, N=40,000), double metaphone edit distance (DMdict, N=10,000) and default contextual similarity without re-ranking (C-dict, N=10,000). [sent-288, score-0.359]
76 We also compare each dictionary with the performance of the manually-constructed Internet slang dictionary (HB-dict) used by Han and Baldwin (201 1), the small automatically-derived dictionary of Gouws et al. [sent-290, score-0.339]
77 In addition, the contribution of these dictionaries in hybrid normalisation approaches is also presented, in which we first normalise OOVs using a given dictionary (combined or otherwise), and then apply the normalisation method of Gouws et al. [sent-292, score-1.805]
78 Even though C-dict delivers higher recall indicating that many lexical variants are correctly normalised this is offset by its high false alarm rate, which is particularly undesirable in normalisation. [sent-302, score-0.414]
79 Additionally, the recall of S-dict is hampered by the — — — 428 restriction on lexical variant token length of 4 characters. [sent-308, score-0.193]
80 In such cases we use the following preferences for dictionaries motivated by our confidence in the normalisation pairs — of the dictionaries to resolve conflicts: HB-dict > GHM-dict > S-dict. [sent-313, score-0.992]
81 When we combine dictionaries in the second section of Table 3, we find that they contain complementary information: in each case the recall and F-score are higher for the combined dictionary than any of the individual dictionaries. [sent-314, score-0.193]
82 The combination of HB-dict+GHM-dict produces only a small improvement in terms of F-score over HBdict (the better-performing dictionary) suggesting that, as claimed, HB-dict and GHM-dict share many frequent normalisation pairs. [sent-315, score-0.795]
83 079 Table 3: Normalisation results using our derived dictionaries (contextual similarity (C-dict); double metaphone rendering (DM-dict); string subsequence kernel scores (S-dict)), the dictionary of Gouws et al. [sent-381, score-0.438]
84 In addition, we combine the dictionaries with the normalisation method of Gouws et al. [sent-383, score-0.875]
85 HB-dict+HB-norm) have lower precision and higher false alarm rates than the dictionarybased approaches; this is largely caused by lexical variant detection errors. [sent-396, score-0.312]
86 8 Using all dictionaries in combination with these methods HB-dict+GHM-dict+S-dict+GHM-norm and HBdict+GHM-dict+S-dict+HB-norm gives some improvements, but the false alarm rates remain high. [sent-397, score-0.19]
87 Despite the limitations of a pure dictionary-based approach to normalisation discussed in Section 3. [sent-398, score-0.795]
88 The most frequent word errors are caused by slight morphologi- cal variations, including plural forms (a), negations (b), possessive cases (c), and OOVs that are correct and do not require normalisation (d). [sent-404, score-0.824]
89 In addition, we also notice some missing annotations where lexical variants are skipped by human annotations but captured by our method (e). [sent-405, score-0.221]
90 Ambiguity (f) definitely exists in longer OOVs, however, these cases do not appear to have a strong negative impact on the normalisation performance. [sent-406, score-0.816]
91 122 Table 5: S-dict normalisation results broken down according to OOV token length. [sent-427, score-0.819]
92 To further study the influence of OOV word length relative to the normalisation performance, we conduct a fine-grained analysis of the performance of the derived dictionary (S-dict) in Table 5, broken down across different OOV word lengths. [sent-430, score-0.908]
93 The derived dictionary is much more reliable for longer tokens (length 5, 6, and 7 characters) in terms of precision and false alarm. [sent-432, score-0.202]
94 Although the recall is relatively modest, in the future we intend to improve recall by mining more normalisation pairs from larger collections of microblog data. [sent-433, score-0.94]
95 7 Conclusions and Future Work In this paper, we describe a method for automatically constructing a normalisation dictionary that supports normalisation of microblog text through direct substitution of lexical variants with their standard forms. [sent-434, score-2.053]
96 Furthermore, this dictionarylookup method combines the detection and normalisation of lexical variants into a simple, lightweight solution which is suitable for processing of highvolume microblog feeds. [sent-436, score-1.203]
97 In the future, we intend to improve our dictionary by leveraging the constantly-growing volume of microblog data, and considering alternative ways to combine distributional and string similarity. [sent-437, score-0.328]
98 In addi430 tion to direct evaluation, we also want to explore the benefits of applying normalisation for downstream social media text processing applications, e. [sent-438, score-0.863]
99 Unsupervised mining of lexical variants from noisy text. [sent-489, score-0.221]
100 Lexical normalisation of short text messages: Makn sens a #twitter. [sent-493, score-0.795]
wordName wordTfidf (topN-words)
[('normalisation', 0.795), ('oov', 0.275), ('gouws', 0.183), ('iv', 0.138), ('variants', 0.127), ('dictionary', 0.113), ('microblog', 0.108), ('han', 0.103), ('baldwin', 0.1), ('lexical', 0.094), ('twitter', 0.093), ('dictionaries', 0.08), ('alarm', 0.075), ('tomorrow', 0.075), ('string', 0.075), ('variant', 0.075), ('oovs', 0.074), ('edit', 0.073), ('tmrw', 0.065), ('normalised', 0.062), ('metaphone', 0.054), ('cg', 0.05), ('wer', 0.05), ('tweets', 0.05), ('dcg', 0.046), ('youtube', 0.046), ('divergence', 0.046), ('weng', 0.043), ('microblogs', 0.042), ('distance', 0.041), ('similarity', 0.038), ('cook', 0.037), ('pairs', 0.037), ('ritter', 0.036), ('media', 0.036), ('liu', 0.036), ('false', 0.035), ('consonant', 0.033), ('tokens', 0.033), ('detection', 0.033), ('hbdict', 0.032), ('ipad', 0.032), ('lodhi', 0.032), ('tusday', 0.032), ('social', 0.032), ('distributional', 0.032), ('subsequence', 0.03), ('oregon', 0.03), ('forms', 0.029), ('spelling', 0.029), ('double', 0.029), ('kullback', 0.028), ('sunday', 0.028), ('lightweight', 0.027), ('portland', 0.026), ('js', 0.025), ('phonetic', 0.025), ('character', 0.024), ('token', 0.024), ('codes', 0.023), ('kl', 0.023), ('fa', 0.022), ('messages', 0.022), ('incorrect', 0.022), ('lookup', 0.022), ('birthday', 0.022), ('contractor', 0.022), ('creativity', 0.022), ('durin', 0.022), ('ez', 0.022), ('fuliang', 0.022), ('gonz', 0.022), ('ivs', 0.022), ('microtext', 0.022), ('normalise', 0.022), ('oovi', 0.022), ('thinkin', 0.022), ('standard', 0.021), ('ranking', 0.021), ('longer', 0.021), ('delivers', 0.021), ('unambiguous', 0.021), ('connor', 0.02), ('kernel', 0.019), ('suitable', 0.019), ('technologies', 0.019), ('error', 0.019), ('discussing', 0.019), ('got', 0.019), ('cumulative', 0.019), ('gimpel', 0.019), ('timothy', 0.019), ('yes', 0.019), ('choudhury', 0.018), ('benson', 0.018), ('jarvelin', 0.018), ('leibler', 0.018), ('lui', 0.018), ('nicta', 0.018), ('normalising', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
Author: Bo Han ; Paul Cook ; Timothy Baldwin
Abstract: Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highlyranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing. 1 Lexical Normalisation A staggering number of short text “microblog” messages are produced every day through social media such as Twitter (Twitter, 2011). The immense volume of real-time, user-generated microblogs that flows through sites has been shown to have utility in applications such as disaster detection (Sakaki et al., 2010), sentiment analysis (Jiang et al., 2011; Gonz a´lez-Ib ´a n˜ez et al., 2011), and event discovery (Weng and Lee, 2011; Benson et al., 2011). However, due to the spontaneous nature of the posts, microblogs are notoriously noisy, containing many non-standard forms e.g., tmrw “tomorrow” and 2day “today” which degrade the performance of — — 421 natural language processing (NLP) tools (Ritter et al., 2010; Han and Baldwin, 2011). To reduce this effect, attempts have been made to adapt NLP tools to microblog data (Gimpel et al., 2011; Foster et al., 2011; Liu et al., 2011b; Ritter et al., 2011). An alternative approach is to pre-normalise non-standard lexical variants to their standard orthography (Liu et al., 2011a; Han and Baldwin, 2011; Xue et al., 2011; Gouws et al., 2011). For example, se u 2morw!!! would be normalised to see you tomorrow! The normalisation approach is especially attractive as a preprocessing step for applications which rely on keyword match or word frequency statistics. For example, earthqu, eathquake, and earthquakeee all attested in a Twitter corpus have the standard form earthquake; by normalising these types to their standard form, better coverage can be achieved for keyword-based methods, and better word frequency estimates can be obtained. In this paper, we focus on the task of lexical normalisation of English Twitter messages, in which out-of-vocabulary (OOV) tokens are normalised to their in-vocabulary (IV) standard form, i.e., a standard form that is in a dictionary. Following other recent work on lexical normalisation (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011; Liu et al., 2012), we specifically focus on one-to-one normalisation in which one OOV token is normalised to one IV word. Naturally, not all OOV words in microblogs are lexical variants of IV words: named entities, e.g., — — are prevalent in microblogs, but not all named entities are included in our dictionary. One challenge for lexical normalisation is therefore to disPLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 42t C1–o4n3f2e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r.a ?lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl tinguish those OOV tokens that require normalisation from those that are well-formed. Recent unsupervised approaches have not attempted to distinguish such tokens from other types of OOV tokens (Cook and Stevenson, 2009; Liu et al., 2011a), limiting their applicability to real-world normalisation tasks. Other approaches (Han and Baldwin, 2011; Gouws et al., 2011) have followed a cascaded approach in which lexical variants are first identified, and then normalised. However, such two-step approaches suffer from poor lexical variant identification performance, which is propagated to the normalisation step. Motivated by the observation that most lexical variants have an unambiguous standard form (especially for longer tokens), and that a lexical variant and its standard form typically occur in similar contexts, in this paper we propose methods for automatically constructing a lexical normalisation dictionary a dictionary whose entries consist — of (lexical variant, standard form) pairs that enables type-based normalisation. Despite the simplicity of this dictionary-based normalisation method, we show it to outperform previously-proposed approaches. This very fast, lightweight solution is suitable for real-time processing of the large volume of streaming microblog data available from Twitter, and offers a simple solution to the lexical variant detection problem that hinders other normalisation methods. Furthermore, this dictionary-based method can be easily integrated with other more-complex normalisation approaches (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011) to produce hybrid systems. After discussing related work in Section 2, we present an overview of our dictionary-based approach to normalisation in Section 3. In Sections 4 and 5 we experimentally select the optimised context similarity parameters and string similarity reranking method. We present experimental results on the unseen test data in Section 6, and offer some concluding remarks in Section 7. — 2 Related Work Given a token t, lexical normalisation is the task of finding arg max P(s|t) ∝ arg max P(t| s)P(s), wofh efinred s igs tahreg smtaanxdaPrd(s form, i.e., an aIVx Pw(otr|sd). PSt(asn)-, dardly in lexical normalisation, t is assumed to be an 422 OOV token, relative to a fixed dictionary. In practice, not all OOV tokens should be normalised; i.e., only lexical variants (e.g., tmrw “tomorrow”) should be normalised and tokens that are OOV but otherwise not lexical variants (e.g., iPad “iPad”) should be unchanged. Most work in this area focuses only on the normalisation task itself, oftentimes assuming that the task of lexical variant detection has already been completed. Various approaches have been proposed to estimate the error model, P(t|s). For example, in work on spell-checking, eBl,ril Pl (atn|ds) M. Fooorre e (2000) improve on a standard edit-distance approach by considering multi-character edit operations; Toutanova and Moore (2002) build on this by incorporating phonological information. Li et al. (2006) utilise distributional similarity (Lin, 1998) to correct misspelled search queries. In text message normalisation, Choudhury et al. (2007) model the letter transformations and emissions using a hidden Markov model (Rabiner, 1989). Cook and Stevenson (2009) and Xue et al. (201 1) propose multiple simple error models, each of which captures a particular way in which lexical variants are formed, such as phonetic spelling (e.g., epik “epic”) or clipping (e.g., walkin “walking”). Nevertheless, optimally weighting the various error models in these approaches is challenging. Without pre-categorising lexical variants into different types, Liu et al. (201 1a) collect Google search snippets from carefully-designed queries from which they then extract noisy lexical variant– standard form pairs. These pairs are used to train a conditional random field (Lafferty et al., 2001) to estimate P(t|s) at the character level. One shortcoming eo fP querying a ese cahracrha engine teol. .o Obtanein strhaoirnt-ing pairs is it tends to be costly in terms of time and bandwidth. Here we exploit microblog data directly to derive (lexical variant, standard form) pairs, instead of relying on external resources. In morerecent work, Liu et al. (2012) endeavour to improve the accuracy of top-n normalisation candidates by integrating human cognitive inference, characterlevel transformations and spell checking in their normalisation model. The encouraging results shift the focus to reranking and promoting the correct normalisation to the top-1 position. However, like much previous work on lexical normalisation, this work assumes perfect lexical variant detection. Aw et al. (2006) and Kaufmann and Kalita (2010) consider normalisation as a machine translation task from lexical variants to standard forms using off-theshelf tools. These methods do not assume that lexical variants have been pre-identified; however, these methods do rely on large quantities of labelled training data, which is not available for microblogs. Recently, Han and Baldwin (201 1) and Gouws et al. (201 1) propose two-step unsupervised approaches to normalisation, in which lexical variants are first identified, and then normalised. They approach lexical variant detection by using a context fitness classifier (Han and Baldwin, 2011) or through dictionary lookup (Gouws et al., 2011). However, the lexical variant detection of both meth- ods is rather unreliable, indicating the challenge of this aspect of normalisation. Both of these approaches incorporate a relatively small normalisation dictionary to capture frequent lexical variants with high precision. In particular, Gouws et al. (201 1) produce a small normalisation lexicon based on distributional similarity and string similarity (Lodhi et al., 2002). Our method adopts a similar strategy using distributional/string similarity, but instead of constructing a small lexicon for preprocessing, we build a much wider-coverage normalisation dictionary and opt for a fully lexiconbased end-to-end normalisation approach. In contrast to the normalisation dictionaries of Han and Baldwin (201 1) and Gouws et al. (201 1) which focus on very frequent lexical variants, we focus on moderate frequency lexical variants of a minimum character length, which tend to have unambiguous standard forms; our intention is to produce normalisation lexicons that are complementary to those currently available. Furthermore, we investigate the impact of a variety of contextual and string similarity measures on the quality of the resulting lexicons. In summary, our dictionary-based normalisation ap- proach is a lightweight end-to-end method which performs both lexical variant detection and normalisation, and thus is suitable for practical online preprocessing, despite its simplicity. 423 3 A Lexical Normalisation Dictionary Before discussing our method for creating a normalisation dictionary, we first discuss the feasibility of such an approach. 3.1 Feasibility Dictionary lookup approaches to normalisation have been shown to have high precision but low recall (Han and Baldwin, 2011; Gouws et al., 2011). Frequent (lexical variant, standard form) pairs such as (u, you) are typically included in the dictionaries used by such methods, while less-frequent items such as (g0tta, gotta) are generally omitted. Because of the degree of lexical creativity and large number of non-standard forms observed on Twitter, a wide-coverage normalisation dictionary would be expensive to construct manually. Based on the assumption that lexical variants occur in similar con- texts to their standard forms, however, it should be possible to automatically construct a normalisation dictionary with wider coverage than is currently available. Dictionary lookup is a type-based approach to normalisation, i.e., every token instance of a given type will always be normalised in the same way. However, lexical variants can be ambiguous, e.g., y corresponds to “you” in yeah, y r right! LOL but “why” in AM CONFUSED!!! y you did that? Nevertheless, the relative occurrence of ambiguous lexical variants is small (Liu et al., 2011a), and it has been observed that while shorter variants such as y are often ambiguous, longer variants tend to be unambiguous. For example bthday and 4eva are unlikely to have standard forms other than “birthday” and “forever”, respectively. Therefore, the normalisation lexicons we produce will only contain entries for OOVs with character length greater than a specified threshold, which are likely to have an unambiguous standard form. 3.2 Overview of approach Our method for constructing a normalisation dictio- nary is as follows: Input: Tokenised English tweets 1. Extract (OOV, IV) pairs based on distributional similarity. 2. Re-rank the extracted pairs by string similarity. Output: A list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs for inclusion in the normalisation lexicon. In Step 1, we leverage large volumes of Twitter data to identify the most distributionally-similar IV type for each OOV type. The result of this process is a set of (OOV, IV) pairs, ranked by distributional similarity. The extracted pairs will include (lexical variant, standard form) pairs, such as (tmrw, tomorrow), but will also contain false positives such as (Tusday, Sunday) Tusday is a lexical variant, but its standard form is not “Sunday” and (Youtube, web) Youtube is an OOV named entity, not a lexical variant. Nevertheless, lexical variants are typically formed from their standard forms through regular processes (Thurlow, 2003) e.g., the omission of characters and from this perspective Sunday and web are not plausible standard — — — — — forms for Tusday and Youtube, respectively. In Step 2, we therefore capture this intuition to re-rank the extracted pairs by string similarity. The top-n items in this re-ranked list then form the normalisation lexicon, which is based only on development data. Although computationally-expensive to build, this dictionary can be created offline. Once built, it then offers a very fast approach to normalisation. We can only reliably compute distributional similarity for types that are moderately frequent in a corpus. Nevertheless, many lexical variants are sufficiently frequent to be able to compute distributional similarity, and can potentially make their way into our normalisation lexicon. This approach is not suitable for normalising low-frequency lexical variants, nor is it suitable for shorter lexical variant types which as discussed in Section 3.1 are more likely to have an ambiguous standard form. Nevertheless, previously-proposed normalisation methods that can handle such phenomena also rely in part on a normalisation lexicon. The normalisation lexicons we create can therefore be easily integrated with previous approaches to form hybrid normalisation systems. — — 4 Contextually-similar Pair Generation Our objective is to extract contextually-similar (OOV, IV) pairs from a large-scale collection of mi424 croblog data. Fundamentally, the surrounding words define the primary context, but there are different ways of representing context and different similarity measures we can use, which may influence the quality of generated normalisation pairs. In representing the context, we experimentally explore the following factors: (1) context window size (from 1 to 3 tokens on both sides); (2) n-gram order ofthe context tokens (unigram, bigram, trigram); (3) whether context words are indexed for relative position or not; and (4) whether we use all context tokens, or only IV words. Because high-accuracy linguistic processing tools for Twitter are still under exploration (Liu et al., 2011b; Gimpel et al., 2011; Ritter et al., 2011; Foster et al., 2011), we do not consider richer representations of context, for example, incorporating information about part-of-speech tags or syntax. We also experiment with a number of simple but widely-used geometric and information theoretic distance/similarity measures. In particular, we use Kullback–Leibler (KL) divergence (Kullback and Leibler, 195 1), Jensen–Shannon (JS) divergence (Lin, 1991), Euclidean distance and Cosine distance. We use a corpus of 10 million English tweets to do parameter tuning over, and a larger corpus of tweets in the final candidate ranking. All tweets were collected from September 2010 to January 2011 via the Twitter API.1 From the raw data we extract English tweets using a language identification tool (Lui and Baldwin, 2011), and then apply a simplified Twitter tokeniser (adapted from O’Connor et al. (2010)). We use the Aspell dictionary (v6.06)2 to determine whether a word is IV, and only include in our normalisation dictionary OOV tokens with at least 64 occurrences in the corpus and character length ≥ 4, both of which were determined through empirical 4o,b bsoetrhva otifo wnh. Fcohr w weearceh d OetOeVrm winoedrd t type ginh the corpus, we select the most similar IV type to form (OOV, IV) pairs. To further narrow the search space, we only consider IV words which are morphophonemically similar to the OOV type, follow- ing settings in Han and Baldwin (201 1).3 1http s : / / dev .twitter . com/ docs / st reaming-api /methods 2http : / / aspe l .net / l 3We only consider IV words within an edit distance of 2 or a phonemic edit distance of 1from the OOV type, and we further In order to evaluate the generated pairs, we randomly selected 1000 OOV words from the 10 million tweet corpus. We set up an annotation task on Amazon Mechanical Turk,4 presenting five independent annotators with each word type (with no context) and asking for corrections where appropriate. For instance, given tmrw, the annotators would likely identify it as a non-standard variant of “tomorrow”. For correct OOV words like iPad, on the other hand, we would expect them to leave the word unchanged. If 3 or more of the 5 annotators make the same suggestion (in the form of either a canonical spelling or leaving the word unchanged), we include this in our gold standard for evaluation. In total, this resulted in 351 lexical variants and 282 correct OOV words, accounting for 63.3% of the 1000 OOV words. These 633 OOV words were used as (OOV, IV) pairs for parameter tuning. The remainder of the 1000 OOV words were ignored on the grounds that there was not sufficient consensus amongst the annotators.5 Contextually-similar pair generation aims to include as many correct normalisation pairs as possible. We evaluate the quality of the normalisation pairs using “Cumulative Gain” (CG): XN0 CG = Xreli0 Xi=1 Suppose there are N0 correct generated pairs (oovi, ivi), each of which is weighted by reli0, the frequency of oovi to indicate its relative importance; for example, (thinkin, thinking) has a higher weight than (g0tta, gotta) because thinkin is more frequent than g0tta in our corpus. In this evaluation we don’t consider the position of normalisation pairs, and nor do we penalise incorrect pairs. Instead, we push distinguishing between correct and incorrect pairs into the downstream re-ranking step in which we incorporate string similarity information. Given the development data and CG, we run an exhaustive search of parameter combinations over only consider the top 30% most-frequent of these IV words. 4https : / /www .mturk .com/mturk/welcome 5Note that the objective of this annotation task is to identify lexical variants that have agreed-upon standard forms irrespective of context, as a special case of the more general task of lexical normalisation (where context may or may not play a significant role in the determination of the normalisation). 425 our development corpus. The five best parameter combinations are shown in Table 1. We notice the CG is almost identical for the top combinations. As a context window size of 3 incurs a heavy processing and memory overhead over a size of 2, we use the 3rd-best parameter combination for subsequent experiments, namely: context window of ±2 tokens, teoxkpeenr bigrams, positional index, nadnodw wK oLf divergence as our distance measure. To better understand the sensitivity of the method to each parameter, we perform a post-hoc parameter analysis relative to a default setting (as underlined in Table 2), altering one parameter at a time. The results in Table 2 show that bigrams outperform other n-gram orders by a large margin (note that the evaluation is based on a log scale), and information-theoretic measures are superior to the geometric measures. Furthermore, it also indicates using the positional indexing better captures context. However, there is little to distinguish context modelling with just IV words or all tokens. Similarly, the context window size has relatively little impact on the overall performance, supporting our earlier observation from Table 1. 5 Pair Re-ranking by String Similarity Once the contextually-similar (OOV, IV) pairs are generated using the selected parameters in Section 4, we further re-rank this set of pairs in an attempt to boost morphophonemically-similar pairs like (bananaz, bananas), and penalise noisy pairs like (paninis, beans). Instead of using the small 10 million tweet corpus, from this step onwards, we use a larger corpus of 80 million English tweets (collected over the same period as the development corpus) to develop a larger-scale normalisation dictionary. This is because once pairs are generated, re-ranking based on string comparison is much faster. We only include in the dictionary OOV words with a token frequency > 15 to include more OOV types than in Section 4, and again apply a minimum length cutoff of 4 char- acters. To measure how well our re-ranking method promotes correct pairs and demotes incorrect pairs (including both OOV words that should not be normalised, e.g. (Youtube, web), and incorrect normalRankWindow sizen-gramPositional index?Lex. choiceSim/distance measurelog(CG) 1±32YesAllKL divergence19.571 2 ±±33 2 No All KL divergence 19.562 3 ±±23 2 Yes All KL divergence 19.562 4 ±±32 2 Yes IVs KL divergence 19.561 5 ±±23 2 Yes IVs JS divergence 19.554 ±2 Table 1: The five best parameter combinations in the exhaustive search of parameter combinations Window sizen-gramPositional index?Lexical choiceSimilarity/distance measure ±1 19.3251 19.328Yes 19.328IVs 19.335KL divergence 19.328 ±±21 1199..332275 2 19.571 No 19.263 All 19.328 Euclidean 19.227 ±±32 1199..332287 3 19.324 JS divergence 19.31 1 Cosine 19.170 Table 2: Parameter sensitivity analysis measured as log(CG) for correctly-generated pairs. We tune one parameter at a time, using the default (underlined) setting for other parameters; the non-exhaustive best-performing setting in each case is indicated in bold. isations for lexical variants, e.g. (bcuz, cause)), we modify our evaluation metric from Section 4 to evaluate the ranking at different points, using Discounted Cumulative Gain (DCG@N: Jarvelin and Kekalainen (2002)): DCG@N = rel1+XiN=2logr2el(i ) where reli again represents the frequency of the OOV, but it can be gain (a positive number) or loss (a negative number), depending on whether the ith pair is correct or incorrect. Because we also expect correct pairs to be ranked higher than incorrect pairs, DCG@N takes both factors into account. Given the generated pairs and the evaluation metric, we first consider three baselines: no re-ranking (i.e., the final ranking is that of the contextual similarity scores), and re-rankings of the pairs based on the frequencies of the OOVs in the Twitter corpus, and the IV unigram frequencies in the Google Web 1T corpus (Brants and Franz, 2006) to get less-noisy frequency estimates. We also compared a variety of re-rankings based on a number of string similarity measures that have been previously considered in normalisation work (reviewed in Section 2). We experiment with standard edit distance (Levenshtein, 1966), edit distance over double metaphone codes (phonetic edit distance: (Philips, 2000)), longest common subsequence ratio over the consonant edit distance of the paired words (hereafter, denoted as 426 consonant edit distance: (Contractor et al., 2010)), and a string subsequence kernel (Lodhi et al., 2002). In Figure 1, we present the DCG@N results for each of our ranking methods at different rank cutoffs. Ranking by OOV frequency is motivated by the assumption that lexical variants are frequently used by social media users. This is confirmed by our findings that lexical pairs like (goin, going) and (nite, night) are at the top of the ranking. However, many proper nouns and named entities are also used frequently and ranked at the top, mixed with lexical variants like (Facebook, speech) and (Youtube, web). In ranking by IV word frequency, we assume the lexical variants are usually derived from frequently-used IV equivalents, e.g. (abou, about). However, many less-frequent lexical variant types have high-frequency (IV) normalisations. For instance, the highest-frequency IV word the has more than 40 OOV lexical variants, such as tthe and thhe. These less-frequent types occupy the top positions, reducing the cumulative gain. Compared with these two baselines, ranking by default contextual similarity scores delivers promising results. It successfully ranks many more intuitive normalisation pairs at the top, such as (2day, today) and (wknd, weekend), but also ranks some incorrect pairs highly, such as (needa, gotta). The string similarity-based methods perform better than our baselines in general. Through manual analysis, we found that standard edit distance ranking is fairly accurate for lexical variants with low edit distance to their standard forms, but fails to identify heavily-altered variants like (tmrw, tomorrow). Consonant edit distance is similar to standard edit distance, but places many longer words at the top of the ranking. Edit distance over double metaphone codes (phonetic edit distance) performs particularly well for lexical variants that include character repetitions commonly used for emphasis on Twitter because such repetitions do not typically alter the phonetic codes. Compared with the other methods, the string subsequence kernel delivers encouraging results. It measures common character subsequences of length n between (OOV, IV) pairs. Because it is computationally expensive to calculate similarity for larger n, we choose n=2, following Gouws et al. (201 1). As N (the lexicon size cut-off) increases, the performance drops more slowly than the other meth— — ods. Although this method fails to rank heavilyaltered variants such as (4get,forget) highly, it typically works well for longer words. Given that we focus on longer OOVs (specifically those longer than 4 characters), this ultimately isn’t a great handicap. 6 Evaluation Given the re-ranked pairs from Section 5, here we apply them to a token-level normalisation task using the normalisation dataset of Han and Baldwin (201 1). 6.1 Metrics We evaluate using the standard evaluation metrics of precision (P), recall (R) and F-score (F) as detailed below. We also consider the false alarm rate (FA) and word error rate (WER), also as shown below. FA measures the negative effects of applying normalisation; a good approach to normalisation should not (incorrectly) normalise tokens that are already in their standard form and do not require normalisation.6 WER, like F-score, shows the overall benefits of normalisation, but unlike F-score, measures how many token-level edits are required for the output to be the same as the ground truth data. In general, dictionaries with a high F-score/low WER and low FA 6FA + P ≤ 1because some lexical variants might be incorrectly Ano +rm Pa ≤lise 1d b. 427 are preferable. P = R= F = FA = WER = # cor#re nctolrym naolrismedal tioskeden toskens # to ckoernresc rtelyqu niori nmga nloisremda tloiskaetniosn P2P +R R # inco#rr encotrlmya nliosremda tloikseedns tokens # token edits n#ee adlletd o akfetnesr normalisation 6.2 Results We select the three best re-ranking methods, and best cut-off N for each method, based on the highest DCG@N value for a given method over the development data, as presented in Figure 1. Namely, they are string subsequence kernel (S-dict, N=40,000), double metaphone edit distance (DMdict, N=10,000) and default contextual similarity without re-ranking (C-dict, N=10,000).7 We evaluate each of the learned dictionaries in Table 3. We also compare each dictionary with the performance of the manually-constructed Internet slang dictionary (HB-dict) used by Han and Baldwin (201 1), the small automatically-derived dictionary of Gouws et al. (201 1) (GHM-dict), and combinations of the different dictionaries. In addition, the contribution of these dictionaries in hybrid normalisation approaches is also presented, in which we first normalise OOVs using a given dictionary (combined or otherwise), and then apply the normalisation method of Gouws et al. (201 1) based on consonant edit distance (GHM-norm), or the approach of Han and Baldwin (201 1) based on the summation of many unsupervised approaches (HB-norm), to the remaining OOVs. Results are shown in Table 3, and discussed below. 6.2.1 Individual Dictionaries Overall, the individual dictionaries derived by the re-ranking methods (DM-dict, S-dict) perform bet- 7We also experimented with combining ranks using Mean Reciprocal Rank. However, the combined rank didn’t improve performance on the development data. We plan to explore other ranking aggregation methods in future work. 1 3 5 7 9 11 31 51 71 91 N cut−offs Figure 1: Re-ranking based on different string similarity methods. ter than that based on contextual similarity (C-dict) in terms of precision and false alarm rate, indicating the importance of re-ranking. Even though C-dict delivers higher recall indicating that many lexical variants are correctly normalised this is offset by its high false alarm rate, which is particularly undesirable in normalisation. Because S-dict has better performance than DM-dict in terms of both F-score and WER, and a much lower false alarm rate than C-dict, subsequent results are presented using S-dict only. — — Both HB-dict and GHM-dict achieve better than 90% precision with moderate recall. Compared to these methods, S-dict is not competitive in terms of either precision or recall. This result seems rather discouraging. However, considering that S-dict is an automatically-constructed dictionary targeting lexical variants of varying frequency, it is not surprising that the precision is worse than that of HB-dict which is manually-constructed and GHM-dict which includes entries only for more-frequent OOVs for which distributional similarity is more accurate. Additionally, the recall of S-dict is hampered by the — — — 428 restriction on lexical variant token length of 4 characters. 6.2.2 Combined Dictionaries Next we look to combining HB-dict, GHM-dict and S-dict. In combining the dictionaries, a given OOV word can be listed with different standard forms in different dictionaries. In such cases we use the following preferences for dictionaries motivated by our confidence in the normalisation pairs — of the dictionaries to resolve conflicts: HB-dict > GHM-dict > S-dict. When we combine dictionaries in the second section of Table 3, we find that they contain complementary information: in each case the recall and F-score are higher for the combined dictionary than any of the individual dictionaries. The combination of HB-dict+GHM-dict produces only a small improvement in terms of F-score over HBdict (the better-performing dictionary) suggesting that, as claimed, HB-dict and GHM-dict share many frequent normalisation pairs. HB-dict+S-dict and GHM-dict+S-dict, on the other hand, improve sub— MethodPrecisionRecallF-ScoreFalse AlarmWord Error Rate C-dict0.4740.2180.2990.2980.103 DM-dict S-dict HB-dict GHM-dict 0.727 0.700 0.915 0.982 0.106 0.179 0.435 0.319 0.185 0.285 0.590 0.482 0.145 0.162 0.048 0.000 0.102 0.097 0.066 0.076 HB-dict+S-dict0.8400.6010.7010.0900.052 GHM-dict+S-dict HB-dict+GHM-dict HB-dict+GHM-dict+S-dict 0.863 0.920 0.847 0.498 0.465 0.630 0.632 0.618 0.723 0.072 0.045 0.086 0.061 0.063 0.049 GHM-dict+GHM-norm0.3380.5780.4270.4580.135 HB-dict+GHM-dict+S-dict+GHM-norm HB-dict+HB-norm HB-dict+GHM-dict+S-dict+HB-norm 0.406 0.515 0.527 0.715 0.771 0.789 0.518 0.618 0.632 0.468 0.332 0.332 0.124 0.081 0.079 Table 3: Normalisation results using our derived dictionaries (contextual similarity (C-dict); double metaphone rendering (DM-dict); string subsequence kernel scores (S-dict)), the dictionary of Gouws et al. (201 1) (GHM-dict), the Internet slang dictionary (HB-dict) from Han and Baldwin (201 1), and combinations of these dictionaries. In addition, we combine the dictionaries with the normalisation method of Gouws et al. (201 1) (GHM-norm) and the combined unsupervised approach of Han and Baldwin (201 1) (HB-norm). stantially over HB-dict and GHM-dict, respectively, indicating that S-dict contains markedly different entries to both HB-dict and GHM-dict. The best Fscore and WER are obtained using the combination of all three dictionaries, HB-dict+GHM-dict+S-dict. Furthermore, the difference between the results using HB-dict+GHM-dict+S-dict and HB-dict+GHMdict is statistically significant (p < 0.01), based on the computationally-intensive Monte Carlo method of Yeh (2000), demonstrating the contribution of Sdict. 6.2.3 Hybrid Approaches The methods of Gouws et al. (201 1) (i.e. GHM-dict+GHM-norm) and Han and Baldwin (201 1) (i.e. HB-dict+HB-norm) have lower precision and higher false alarm rates than the dictionarybased approaches; this is largely caused by lexical variant detection errors.8 Using all dictionaries in combination with these methods HB-dict+GHM-dict+S-dict+GHM-norm and HBdict+GHM-dict+S-dict+HB-norm gives some improvements, but the false alarm rates remain high. Despite the limitations of a pure dictionary-based approach to normalisation discussed in Section 3.1 the current best practical approach to normal— — — — 8Here we report results that do not assume perfect detection of lexical variants, unlike the original published results in each case. 429 Error typeOOVDSitcat.ndard fGoromld (a) pluralsplayeplayersplayer (b) negation unlike like dislike (c) possessives anyones anyone anyone ’s (d) correct OOVs iphone phone iphone (e) test data errors durin during durin (f) ambiguity siging signing singing Table 4: Error types in the combined dictionary (HBdict+GHM-dict+S-dict) isation is to use a lexicon, combining hand-built and automatically-learned normalisation dictionaries. 6.3 Discussion and Error Analysis We first manually analyse the errors in the combined dictionary (HB-dict+GHM-dict+S-dict) and give examples of each error type in Table 4. The most frequent word errors are caused by slight morphologi- cal variations, including plural forms (a), negations (b), possessive cases (c), and OOVs that are correct and do not require normalisation (d). In addition, we also notice some missing annotations where lexical variants are skipped by human annotations but captured by our method (e). Ambiguity (f) definitely exists in longer OOVs, however, these cases do not appear to have a strong negative impact on the normalisation performance. An example of a remainLength cut-off (N)#VariantsPrecisionRecall (≥ N)Recall (all)False Alarm ≥45560.700Rec0al.l3 8(≥1 N)0.1790.162 ≥≥54 382 0.814 0.471 0.152 0.122 ≥≥65 254 0.804 0.484 0.104 0.131 ≥≥76 138 0.793 0.471 0.055 0.122 ≥71380.7930.4710.0550.122 Table 5: S-dict normalisation results broken down according to OOV token length. Recall is presented both over the subset of instances of length ≥ N in the data (“Recall (≥ N)”), and over the entirety of the dataset (“Recall (all)”); “su#bVsaertia onftis n” sitsa tnhcee snu omfb leenrg othf t≥ok Nen iinns tthaenc deast ao f( “tRhee cinadllic (a≥ted N length idn o othveer rt tehset d eanttaisreetty. ing miscellaneous error is bday “birthday”, which is mis-normalised as day. To further study the influence of OOV word length relative to the normalisation performance, we conduct a fine-grained analysis of the performance of the derived dictionary (S-dict) in Table 5, broken down across different OOV word lengths. The results generally support our hypothesis that our method works better for longer OOV words. The derived dictionary is much more reliable for longer tokens (length 5, 6, and 7 characters) in terms of precision and false alarm. Although the recall is relatively modest, in the future we intend to improve recall by mining more normalisation pairs from larger collections of microblog data. 7 Conclusions and Future Work In this paper, we describe a method for automatically constructing a normalisation dictionary that supports normalisation of microblog text through direct substitution of lexical variants with their standard forms. After investigating the impact of different distributional and string similarity methods on the quality of the dictionary, we present experimental results on a standard dataset showing that our proposed methods acquire high quality (lexical variant, standard form) pairs, with reasonable coverage, and achieve state-of-the-art end-toend lexical normalisation performance on a realworld token-level task. Furthermore, this dictionarylookup method combines the detection and normalisation of lexical variants into a simple, lightweight solution which is suitable for processing of highvolume microblog feeds. In the future, we intend to improve our dictionary by leveraging the constantly-growing volume of microblog data, and considering alternative ways to combine distributional and string similarity. In addi430 tion to direct evaluation, we also want to explore the benefits of applying normalisation for downstream social media text processing applications, e.g. event detection. Acknowledgements We would like to thank the three anonymous reviewers for their insightful comments, and Stephan Gouws for kindly sharing his data and discussing his work. NICTA is funded by the Australian government as represented by Department of Broadband, Communication and Digital Economy, and the Australian Research Council through the ICT centre of Excellence programme. References AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of COLING/ACL 2006, pages 33–40, Sydney, Australia. Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 389–398, Portland, Oregon, USA. Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Version 1. Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition, 10: 157–174. Danish Contractor, Tanveer A. Faruquie, and L. Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 189–196, Beijing, China. Paul Cook and Suzanne Stevenson. 2009. An unsu- pervised model for text message normalization. In CALC ’09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71– 78, Boulder, USA. Jennifer Foster, O¨zlem C ¸etinoglu, Joachim Wagner, Joseph L. Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. #hardtoparse: POS Tagging and Parsing the Twitterverse. In Analyzing Microtext: Papers from the 2011 AAAI Workshop, volume WS-1 1-05 of AAAI Workshops, pages 20–25, San Francisco, CA, USA. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 42–47, Portland, Oregon, USA. Roberto Gonz a´lez-Ib ´a n˜ez, Smaranda Muresan, and Nina Wacholder. 2011. Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 581–586, Portland, Oregon, USA. Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP, pages 82–90, Edinburgh, Scotland, UK. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 368–378, Portland, Oregon, USA. K. Jarvelin and J. Kekalainen. 2002. Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems, 20(4). Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target-dependent Twitter sentiment classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 151–160, Portland, Oregon, USA. Joseph Kaufmann and Jugal Kalita. 2010. Syntactic normalization of Twitter messages. In International Con431 ference on Natural Language Processing, Kharagpur, India. S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics, 22:49– 86. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings ofthe Eighteenth International Conference on Machine Learning, pages 282–289, San Francisco, CA, USA. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710. Mu Li, Yang Zhang, Muhua Zhu, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of COLING/ACL 2006, pages 1025–1032, Sydney, Australia. Jianhua Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1): 145–151. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the ACL and 1 International Con7th ference on Computational Linguistics (COLING/ACL98), pages 768–774, Montreal, Quebec, Canada. Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011a. Insertion, deletion, or substitution? normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 71–76, Portland, Oregon, USA. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011b. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 359–367, Portland, Oregon, USA. Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broadcoverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Jeju, Republic of Korea. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. J. Mach. Learn. Res., 2:419– 444. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), pages 553–561, Chiang Mai, Thailand. Brendan O’Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory search and topic summarization for Twitter. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM 2010), pages 384–385, Washington, USA. Lawrence Philips. 2000. The double metaphone search algorithm. C/C++ Users Journal, 18:38–43. Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of Twitter conversations. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL-HLT 2010), pages 172–180, Los Angeles, USA. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 1524–1534, Edinburgh, Scotland, UK. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on the World Wide Web (WWW 2010), pages 851–860, Raleigh, North Carolina, USA. Crispin Thurlow. 2003. Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Analysis Online, 1(1). Kristina Toutanova and Robert C. Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the ACL and 3rd Annual Meeting of the NAACL (ACL-02), pages 144–15 1, Philadelphia, USA. Official Blog Twitter. 2011. 200 million tweets per day. Retrived at August 17th, 2011. Jianshu Weng and Bu-Sung Lee. 2011. Event detection in Twitter. In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain. Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Proceedings of the AAAI11 Workshop on Analyzing Microtext, pages 74–79, San Francisco, USA. Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 947–953, Saarbr¨ ucken, Germany. 432
2 0.054450508 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
3 0.050868001 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
Author: Dan Garrette ; Jason Baldridge
Abstract: Past work on learning part-of-speech taggers from tag dictionaries and raw data has reported good results, but the assumptions made about those dictionaries are often unrealistic: due to historical precedents, they assume access to information about labels in the raw and test sets. Here, we demonstrate ways to learn hidden Markov model taggers from incomplete tag dictionaries. Taking the MINGREEDY algorithm (Ravi et al., 2010) as a starting point, we improve it with several intuitive heuristics. We also define a simple HMM emission initialization that takes advantage of the tag dictionary and raw data to capture both the openness of a given tag and its estimated prevalence in the raw data. Altogether, our augmentations produce improvements to per- formance over the original MIN-GREEDY algorithm for both English and Italian data.
4 0.046635766 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze
Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
5 0.04486531 106 emnlp-2012-Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features
Author: Jiayi Zhao ; Xipeng Qiu ; Shu Zhang ; Feng Ji ; Xuanjing Huang
Abstract: In modern Chinese articles or conversations, it is very popular to involve a few English words, especially in emails and Internet literature. Therefore, it becomes an important and challenging topic to analyze Chinese-English mixed texts. The underlying problem is how to tag part-of-speech (POS) for the English words involved. Due to the lack of specially annotated corpus, most of the English words are tagged as the oversimplified type, “foreign words”. In this paper, we present a method using dynamic features to tag POS of mixed texts. Experiments show that our method achieves higher performance than traditional sequence labeling methods. Meanwhile, our method also boosts the performance of POS tagging for pure Chinese texts.
6 0.044275861 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures
7 0.042067789 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities
8 0.040330149 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
9 0.036482442 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
10 0.036478166 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
11 0.034348264 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
12 0.032807682 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM
13 0.032576103 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid
14 0.030179683 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
15 0.028907003 13 emnlp-2012-A Unified Approach to Transliteration-based Text Input with Online Spelling Correction
16 0.028173016 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants
17 0.027913671 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
18 0.027678788 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
19 0.02763954 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
20 0.027605535 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence
topicId topicWeight
[(0, 0.112), (1, 0.005), (2, 0.007), (3, 0.016), (4, -0.001), (5, 0.016), (6, 0.066), (7, -0.026), (8, 0.036), (9, -0.078), (10, 0.022), (11, -0.074), (12, -0.023), (13, -0.015), (14, -0.07), (15, -0.033), (16, 0.052), (17, -0.035), (18, -0.011), (19, 0.027), (20, -0.067), (21, -0.049), (22, 0.06), (23, -0.019), (24, 0.047), (25, 0.041), (26, -0.086), (27, 0.035), (28, 0.062), (29, 0.094), (30, 0.103), (31, 0.049), (32, -0.144), (33, 0.007), (34, -0.059), (35, 0.083), (36, -0.069), (37, -0.032), (38, 0.044), (39, -0.112), (40, -0.174), (41, -0.084), (42, -0.161), (43, -0.35), (44, -0.198), (45, 0.114), (46, -0.094), (47, 0.177), (48, -0.009), (49, -0.266)]
simIndex simValue paperId paperTitle
same-paper 1 0.95220566 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
Author: Bo Han ; Paul Cook ; Timothy Baldwin
Abstract: Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highlyranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing. 1 Lexical Normalisation A staggering number of short text “microblog” messages are produced every day through social media such as Twitter (Twitter, 2011). The immense volume of real-time, user-generated microblogs that flows through sites has been shown to have utility in applications such as disaster detection (Sakaki et al., 2010), sentiment analysis (Jiang et al., 2011; Gonz a´lez-Ib ´a n˜ez et al., 2011), and event discovery (Weng and Lee, 2011; Benson et al., 2011). However, due to the spontaneous nature of the posts, microblogs are notoriously noisy, containing many non-standard forms e.g., tmrw “tomorrow” and 2day “today” which degrade the performance of — — 421 natural language processing (NLP) tools (Ritter et al., 2010; Han and Baldwin, 2011). To reduce this effect, attempts have been made to adapt NLP tools to microblog data (Gimpel et al., 2011; Foster et al., 2011; Liu et al., 2011b; Ritter et al., 2011). An alternative approach is to pre-normalise non-standard lexical variants to their standard orthography (Liu et al., 2011a; Han and Baldwin, 2011; Xue et al., 2011; Gouws et al., 2011). For example, se u 2morw!!! would be normalised to see you tomorrow! The normalisation approach is especially attractive as a preprocessing step for applications which rely on keyword match or word frequency statistics. For example, earthqu, eathquake, and earthquakeee all attested in a Twitter corpus have the standard form earthquake; by normalising these types to their standard form, better coverage can be achieved for keyword-based methods, and better word frequency estimates can be obtained. In this paper, we focus on the task of lexical normalisation of English Twitter messages, in which out-of-vocabulary (OOV) tokens are normalised to their in-vocabulary (IV) standard form, i.e., a standard form that is in a dictionary. Following other recent work on lexical normalisation (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011; Liu et al., 2012), we specifically focus on one-to-one normalisation in which one OOV token is normalised to one IV word. Naturally, not all OOV words in microblogs are lexical variants of IV words: named entities, e.g., — — are prevalent in microblogs, but not all named entities are included in our dictionary. One challenge for lexical normalisation is therefore to disPLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 42t C1–o4n3f2e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r.a ?lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl tinguish those OOV tokens that require normalisation from those that are well-formed. Recent unsupervised approaches have not attempted to distinguish such tokens from other types of OOV tokens (Cook and Stevenson, 2009; Liu et al., 2011a), limiting their applicability to real-world normalisation tasks. Other approaches (Han and Baldwin, 2011; Gouws et al., 2011) have followed a cascaded approach in which lexical variants are first identified, and then normalised. However, such two-step approaches suffer from poor lexical variant identification performance, which is propagated to the normalisation step. Motivated by the observation that most lexical variants have an unambiguous standard form (especially for longer tokens), and that a lexical variant and its standard form typically occur in similar contexts, in this paper we propose methods for automatically constructing a lexical normalisation dictionary a dictionary whose entries consist — of (lexical variant, standard form) pairs that enables type-based normalisation. Despite the simplicity of this dictionary-based normalisation method, we show it to outperform previously-proposed approaches. This very fast, lightweight solution is suitable for real-time processing of the large volume of streaming microblog data available from Twitter, and offers a simple solution to the lexical variant detection problem that hinders other normalisation methods. Furthermore, this dictionary-based method can be easily integrated with other more-complex normalisation approaches (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011) to produce hybrid systems. After discussing related work in Section 2, we present an overview of our dictionary-based approach to normalisation in Section 3. In Sections 4 and 5 we experimentally select the optimised context similarity parameters and string similarity reranking method. We present experimental results on the unseen test data in Section 6, and offer some concluding remarks in Section 7. — 2 Related Work Given a token t, lexical normalisation is the task of finding arg max P(s|t) ∝ arg max P(t| s)P(s), wofh efinred s igs tahreg smtaanxdaPrd(s form, i.e., an aIVx Pw(otr|sd). PSt(asn)-, dardly in lexical normalisation, t is assumed to be an 422 OOV token, relative to a fixed dictionary. In practice, not all OOV tokens should be normalised; i.e., only lexical variants (e.g., tmrw “tomorrow”) should be normalised and tokens that are OOV but otherwise not lexical variants (e.g., iPad “iPad”) should be unchanged. Most work in this area focuses only on the normalisation task itself, oftentimes assuming that the task of lexical variant detection has already been completed. Various approaches have been proposed to estimate the error model, P(t|s). For example, in work on spell-checking, eBl,ril Pl (atn|ds) M. Fooorre e (2000) improve on a standard edit-distance approach by considering multi-character edit operations; Toutanova and Moore (2002) build on this by incorporating phonological information. Li et al. (2006) utilise distributional similarity (Lin, 1998) to correct misspelled search queries. In text message normalisation, Choudhury et al. (2007) model the letter transformations and emissions using a hidden Markov model (Rabiner, 1989). Cook and Stevenson (2009) and Xue et al. (201 1) propose multiple simple error models, each of which captures a particular way in which lexical variants are formed, such as phonetic spelling (e.g., epik “epic”) or clipping (e.g., walkin “walking”). Nevertheless, optimally weighting the various error models in these approaches is challenging. Without pre-categorising lexical variants into different types, Liu et al. (201 1a) collect Google search snippets from carefully-designed queries from which they then extract noisy lexical variant– standard form pairs. These pairs are used to train a conditional random field (Lafferty et al., 2001) to estimate P(t|s) at the character level. One shortcoming eo fP querying a ese cahracrha engine teol. .o Obtanein strhaoirnt-ing pairs is it tends to be costly in terms of time and bandwidth. Here we exploit microblog data directly to derive (lexical variant, standard form) pairs, instead of relying on external resources. In morerecent work, Liu et al. (2012) endeavour to improve the accuracy of top-n normalisation candidates by integrating human cognitive inference, characterlevel transformations and spell checking in their normalisation model. The encouraging results shift the focus to reranking and promoting the correct normalisation to the top-1 position. However, like much previous work on lexical normalisation, this work assumes perfect lexical variant detection. Aw et al. (2006) and Kaufmann and Kalita (2010) consider normalisation as a machine translation task from lexical variants to standard forms using off-theshelf tools. These methods do not assume that lexical variants have been pre-identified; however, these methods do rely on large quantities of labelled training data, which is not available for microblogs. Recently, Han and Baldwin (201 1) and Gouws et al. (201 1) propose two-step unsupervised approaches to normalisation, in which lexical variants are first identified, and then normalised. They approach lexical variant detection by using a context fitness classifier (Han and Baldwin, 2011) or through dictionary lookup (Gouws et al., 2011). However, the lexical variant detection of both meth- ods is rather unreliable, indicating the challenge of this aspect of normalisation. Both of these approaches incorporate a relatively small normalisation dictionary to capture frequent lexical variants with high precision. In particular, Gouws et al. (201 1) produce a small normalisation lexicon based on distributional similarity and string similarity (Lodhi et al., 2002). Our method adopts a similar strategy using distributional/string similarity, but instead of constructing a small lexicon for preprocessing, we build a much wider-coverage normalisation dictionary and opt for a fully lexiconbased end-to-end normalisation approach. In contrast to the normalisation dictionaries of Han and Baldwin (201 1) and Gouws et al. (201 1) which focus on very frequent lexical variants, we focus on moderate frequency lexical variants of a minimum character length, which tend to have unambiguous standard forms; our intention is to produce normalisation lexicons that are complementary to those currently available. Furthermore, we investigate the impact of a variety of contextual and string similarity measures on the quality of the resulting lexicons. In summary, our dictionary-based normalisation ap- proach is a lightweight end-to-end method which performs both lexical variant detection and normalisation, and thus is suitable for practical online preprocessing, despite its simplicity. 423 3 A Lexical Normalisation Dictionary Before discussing our method for creating a normalisation dictionary, we first discuss the feasibility of such an approach. 3.1 Feasibility Dictionary lookup approaches to normalisation have been shown to have high precision but low recall (Han and Baldwin, 2011; Gouws et al., 2011). Frequent (lexical variant, standard form) pairs such as (u, you) are typically included in the dictionaries used by such methods, while less-frequent items such as (g0tta, gotta) are generally omitted. Because of the degree of lexical creativity and large number of non-standard forms observed on Twitter, a wide-coverage normalisation dictionary would be expensive to construct manually. Based on the assumption that lexical variants occur in similar con- texts to their standard forms, however, it should be possible to automatically construct a normalisation dictionary with wider coverage than is currently available. Dictionary lookup is a type-based approach to normalisation, i.e., every token instance of a given type will always be normalised in the same way. However, lexical variants can be ambiguous, e.g., y corresponds to “you” in yeah, y r right! LOL but “why” in AM CONFUSED!!! y you did that? Nevertheless, the relative occurrence of ambiguous lexical variants is small (Liu et al., 2011a), and it has been observed that while shorter variants such as y are often ambiguous, longer variants tend to be unambiguous. For example bthday and 4eva are unlikely to have standard forms other than “birthday” and “forever”, respectively. Therefore, the normalisation lexicons we produce will only contain entries for OOVs with character length greater than a specified threshold, which are likely to have an unambiguous standard form. 3.2 Overview of approach Our method for constructing a normalisation dictio- nary is as follows: Input: Tokenised English tweets 1. Extract (OOV, IV) pairs based on distributional similarity. 2. Re-rank the extracted pairs by string similarity. Output: A list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs for inclusion in the normalisation lexicon. In Step 1, we leverage large volumes of Twitter data to identify the most distributionally-similar IV type for each OOV type. The result of this process is a set of (OOV, IV) pairs, ranked by distributional similarity. The extracted pairs will include (lexical variant, standard form) pairs, such as (tmrw, tomorrow), but will also contain false positives such as (Tusday, Sunday) Tusday is a lexical variant, but its standard form is not “Sunday” and (Youtube, web) Youtube is an OOV named entity, not a lexical variant. Nevertheless, lexical variants are typically formed from their standard forms through regular processes (Thurlow, 2003) e.g., the omission of characters and from this perspective Sunday and web are not plausible standard — — — — — forms for Tusday and Youtube, respectively. In Step 2, we therefore capture this intuition to re-rank the extracted pairs by string similarity. The top-n items in this re-ranked list then form the normalisation lexicon, which is based only on development data. Although computationally-expensive to build, this dictionary can be created offline. Once built, it then offers a very fast approach to normalisation. We can only reliably compute distributional similarity for types that are moderately frequent in a corpus. Nevertheless, many lexical variants are sufficiently frequent to be able to compute distributional similarity, and can potentially make their way into our normalisation lexicon. This approach is not suitable for normalising low-frequency lexical variants, nor is it suitable for shorter lexical variant types which as discussed in Section 3.1 are more likely to have an ambiguous standard form. Nevertheless, previously-proposed normalisation methods that can handle such phenomena also rely in part on a normalisation lexicon. The normalisation lexicons we create can therefore be easily integrated with previous approaches to form hybrid normalisation systems. — — 4 Contextually-similar Pair Generation Our objective is to extract contextually-similar (OOV, IV) pairs from a large-scale collection of mi424 croblog data. Fundamentally, the surrounding words define the primary context, but there are different ways of representing context and different similarity measures we can use, which may influence the quality of generated normalisation pairs. In representing the context, we experimentally explore the following factors: (1) context window size (from 1 to 3 tokens on both sides); (2) n-gram order ofthe context tokens (unigram, bigram, trigram); (3) whether context words are indexed for relative position or not; and (4) whether we use all context tokens, or only IV words. Because high-accuracy linguistic processing tools for Twitter are still under exploration (Liu et al., 2011b; Gimpel et al., 2011; Ritter et al., 2011; Foster et al., 2011), we do not consider richer representations of context, for example, incorporating information about part-of-speech tags or syntax. We also experiment with a number of simple but widely-used geometric and information theoretic distance/similarity measures. In particular, we use Kullback–Leibler (KL) divergence (Kullback and Leibler, 195 1), Jensen–Shannon (JS) divergence (Lin, 1991), Euclidean distance and Cosine distance. We use a corpus of 10 million English tweets to do parameter tuning over, and a larger corpus of tweets in the final candidate ranking. All tweets were collected from September 2010 to January 2011 via the Twitter API.1 From the raw data we extract English tweets using a language identification tool (Lui and Baldwin, 2011), and then apply a simplified Twitter tokeniser (adapted from O’Connor et al. (2010)). We use the Aspell dictionary (v6.06)2 to determine whether a word is IV, and only include in our normalisation dictionary OOV tokens with at least 64 occurrences in the corpus and character length ≥ 4, both of which were determined through empirical 4o,b bsoetrhva otifo wnh. Fcohr w weearceh d OetOeVrm winoedrd t type ginh the corpus, we select the most similar IV type to form (OOV, IV) pairs. To further narrow the search space, we only consider IV words which are morphophonemically similar to the OOV type, follow- ing settings in Han and Baldwin (201 1).3 1http s : / / dev .twitter . com/ docs / st reaming-api /methods 2http : / / aspe l .net / l 3We only consider IV words within an edit distance of 2 or a phonemic edit distance of 1from the OOV type, and we further In order to evaluate the generated pairs, we randomly selected 1000 OOV words from the 10 million tweet corpus. We set up an annotation task on Amazon Mechanical Turk,4 presenting five independent annotators with each word type (with no context) and asking for corrections where appropriate. For instance, given tmrw, the annotators would likely identify it as a non-standard variant of “tomorrow”. For correct OOV words like iPad, on the other hand, we would expect them to leave the word unchanged. If 3 or more of the 5 annotators make the same suggestion (in the form of either a canonical spelling or leaving the word unchanged), we include this in our gold standard for evaluation. In total, this resulted in 351 lexical variants and 282 correct OOV words, accounting for 63.3% of the 1000 OOV words. These 633 OOV words were used as (OOV, IV) pairs for parameter tuning. The remainder of the 1000 OOV words were ignored on the grounds that there was not sufficient consensus amongst the annotators.5 Contextually-similar pair generation aims to include as many correct normalisation pairs as possible. We evaluate the quality of the normalisation pairs using “Cumulative Gain” (CG): XN0 CG = Xreli0 Xi=1 Suppose there are N0 correct generated pairs (oovi, ivi), each of which is weighted by reli0, the frequency of oovi to indicate its relative importance; for example, (thinkin, thinking) has a higher weight than (g0tta, gotta) because thinkin is more frequent than g0tta in our corpus. In this evaluation we don’t consider the position of normalisation pairs, and nor do we penalise incorrect pairs. Instead, we push distinguishing between correct and incorrect pairs into the downstream re-ranking step in which we incorporate string similarity information. Given the development data and CG, we run an exhaustive search of parameter combinations over only consider the top 30% most-frequent of these IV words. 4https : / /www .mturk .com/mturk/welcome 5Note that the objective of this annotation task is to identify lexical variants that have agreed-upon standard forms irrespective of context, as a special case of the more general task of lexical normalisation (where context may or may not play a significant role in the determination of the normalisation). 425 our development corpus. The five best parameter combinations are shown in Table 1. We notice the CG is almost identical for the top combinations. As a context window size of 3 incurs a heavy processing and memory overhead over a size of 2, we use the 3rd-best parameter combination for subsequent experiments, namely: context window of ±2 tokens, teoxkpeenr bigrams, positional index, nadnodw wK oLf divergence as our distance measure. To better understand the sensitivity of the method to each parameter, we perform a post-hoc parameter analysis relative to a default setting (as underlined in Table 2), altering one parameter at a time. The results in Table 2 show that bigrams outperform other n-gram orders by a large margin (note that the evaluation is based on a log scale), and information-theoretic measures are superior to the geometric measures. Furthermore, it also indicates using the positional indexing better captures context. However, there is little to distinguish context modelling with just IV words or all tokens. Similarly, the context window size has relatively little impact on the overall performance, supporting our earlier observation from Table 1. 5 Pair Re-ranking by String Similarity Once the contextually-similar (OOV, IV) pairs are generated using the selected parameters in Section 4, we further re-rank this set of pairs in an attempt to boost morphophonemically-similar pairs like (bananaz, bananas), and penalise noisy pairs like (paninis, beans). Instead of using the small 10 million tweet corpus, from this step onwards, we use a larger corpus of 80 million English tweets (collected over the same period as the development corpus) to develop a larger-scale normalisation dictionary. This is because once pairs are generated, re-ranking based on string comparison is much faster. We only include in the dictionary OOV words with a token frequency > 15 to include more OOV types than in Section 4, and again apply a minimum length cutoff of 4 char- acters. To measure how well our re-ranking method promotes correct pairs and demotes incorrect pairs (including both OOV words that should not be normalised, e.g. (Youtube, web), and incorrect normalRankWindow sizen-gramPositional index?Lex. choiceSim/distance measurelog(CG) 1±32YesAllKL divergence19.571 2 ±±33 2 No All KL divergence 19.562 3 ±±23 2 Yes All KL divergence 19.562 4 ±±32 2 Yes IVs KL divergence 19.561 5 ±±23 2 Yes IVs JS divergence 19.554 ±2 Table 1: The five best parameter combinations in the exhaustive search of parameter combinations Window sizen-gramPositional index?Lexical choiceSimilarity/distance measure ±1 19.3251 19.328Yes 19.328IVs 19.335KL divergence 19.328 ±±21 1199..332275 2 19.571 No 19.263 All 19.328 Euclidean 19.227 ±±32 1199..332287 3 19.324 JS divergence 19.31 1 Cosine 19.170 Table 2: Parameter sensitivity analysis measured as log(CG) for correctly-generated pairs. We tune one parameter at a time, using the default (underlined) setting for other parameters; the non-exhaustive best-performing setting in each case is indicated in bold. isations for lexical variants, e.g. (bcuz, cause)), we modify our evaluation metric from Section 4 to evaluate the ranking at different points, using Discounted Cumulative Gain (DCG@N: Jarvelin and Kekalainen (2002)): DCG@N = rel1+XiN=2logr2el(i ) where reli again represents the frequency of the OOV, but it can be gain (a positive number) or loss (a negative number), depending on whether the ith pair is correct or incorrect. Because we also expect correct pairs to be ranked higher than incorrect pairs, DCG@N takes both factors into account. Given the generated pairs and the evaluation metric, we first consider three baselines: no re-ranking (i.e., the final ranking is that of the contextual similarity scores), and re-rankings of the pairs based on the frequencies of the OOVs in the Twitter corpus, and the IV unigram frequencies in the Google Web 1T corpus (Brants and Franz, 2006) to get less-noisy frequency estimates. We also compared a variety of re-rankings based on a number of string similarity measures that have been previously considered in normalisation work (reviewed in Section 2). We experiment with standard edit distance (Levenshtein, 1966), edit distance over double metaphone codes (phonetic edit distance: (Philips, 2000)), longest common subsequence ratio over the consonant edit distance of the paired words (hereafter, denoted as 426 consonant edit distance: (Contractor et al., 2010)), and a string subsequence kernel (Lodhi et al., 2002). In Figure 1, we present the DCG@N results for each of our ranking methods at different rank cutoffs. Ranking by OOV frequency is motivated by the assumption that lexical variants are frequently used by social media users. This is confirmed by our findings that lexical pairs like (goin, going) and (nite, night) are at the top of the ranking. However, many proper nouns and named entities are also used frequently and ranked at the top, mixed with lexical variants like (Facebook, speech) and (Youtube, web). In ranking by IV word frequency, we assume the lexical variants are usually derived from frequently-used IV equivalents, e.g. (abou, about). However, many less-frequent lexical variant types have high-frequency (IV) normalisations. For instance, the highest-frequency IV word the has more than 40 OOV lexical variants, such as tthe and thhe. These less-frequent types occupy the top positions, reducing the cumulative gain. Compared with these two baselines, ranking by default contextual similarity scores delivers promising results. It successfully ranks many more intuitive normalisation pairs at the top, such as (2day, today) and (wknd, weekend), but also ranks some incorrect pairs highly, such as (needa, gotta). The string similarity-based methods perform better than our baselines in general. Through manual analysis, we found that standard edit distance ranking is fairly accurate for lexical variants with low edit distance to their standard forms, but fails to identify heavily-altered variants like (tmrw, tomorrow). Consonant edit distance is similar to standard edit distance, but places many longer words at the top of the ranking. Edit distance over double metaphone codes (phonetic edit distance) performs particularly well for lexical variants that include character repetitions commonly used for emphasis on Twitter because such repetitions do not typically alter the phonetic codes. Compared with the other methods, the string subsequence kernel delivers encouraging results. It measures common character subsequences of length n between (OOV, IV) pairs. Because it is computationally expensive to calculate similarity for larger n, we choose n=2, following Gouws et al. (201 1). As N (the lexicon size cut-off) increases, the performance drops more slowly than the other meth— — ods. Although this method fails to rank heavilyaltered variants such as (4get,forget) highly, it typically works well for longer words. Given that we focus on longer OOVs (specifically those longer than 4 characters), this ultimately isn’t a great handicap. 6 Evaluation Given the re-ranked pairs from Section 5, here we apply them to a token-level normalisation task using the normalisation dataset of Han and Baldwin (201 1). 6.1 Metrics We evaluate using the standard evaluation metrics of precision (P), recall (R) and F-score (F) as detailed below. We also consider the false alarm rate (FA) and word error rate (WER), also as shown below. FA measures the negative effects of applying normalisation; a good approach to normalisation should not (incorrectly) normalise tokens that are already in their standard form and do not require normalisation.6 WER, like F-score, shows the overall benefits of normalisation, but unlike F-score, measures how many token-level edits are required for the output to be the same as the ground truth data. In general, dictionaries with a high F-score/low WER and low FA 6FA + P ≤ 1because some lexical variants might be incorrectly Ano +rm Pa ≤lise 1d b. 427 are preferable. P = R= F = FA = WER = # cor#re nctolrym naolrismedal tioskeden toskens # to ckoernresc rtelyqu niori nmga nloisremda tloiskaetniosn P2P +R R # inco#rr encotrlmya nliosremda tloikseedns tokens # token edits n#ee adlletd o akfetnesr normalisation 6.2 Results We select the three best re-ranking methods, and best cut-off N for each method, based on the highest DCG@N value for a given method over the development data, as presented in Figure 1. Namely, they are string subsequence kernel (S-dict, N=40,000), double metaphone edit distance (DMdict, N=10,000) and default contextual similarity without re-ranking (C-dict, N=10,000).7 We evaluate each of the learned dictionaries in Table 3. We also compare each dictionary with the performance of the manually-constructed Internet slang dictionary (HB-dict) used by Han and Baldwin (201 1), the small automatically-derived dictionary of Gouws et al. (201 1) (GHM-dict), and combinations of the different dictionaries. In addition, the contribution of these dictionaries in hybrid normalisation approaches is also presented, in which we first normalise OOVs using a given dictionary (combined or otherwise), and then apply the normalisation method of Gouws et al. (201 1) based on consonant edit distance (GHM-norm), or the approach of Han and Baldwin (201 1) based on the summation of many unsupervised approaches (HB-norm), to the remaining OOVs. Results are shown in Table 3, and discussed below. 6.2.1 Individual Dictionaries Overall, the individual dictionaries derived by the re-ranking methods (DM-dict, S-dict) perform bet- 7We also experimented with combining ranks using Mean Reciprocal Rank. However, the combined rank didn’t improve performance on the development data. We plan to explore other ranking aggregation methods in future work. 1 3 5 7 9 11 31 51 71 91 N cut−offs Figure 1: Re-ranking based on different string similarity methods. ter than that based on contextual similarity (C-dict) in terms of precision and false alarm rate, indicating the importance of re-ranking. Even though C-dict delivers higher recall indicating that many lexical variants are correctly normalised this is offset by its high false alarm rate, which is particularly undesirable in normalisation. Because S-dict has better performance than DM-dict in terms of both F-score and WER, and a much lower false alarm rate than C-dict, subsequent results are presented using S-dict only. — — Both HB-dict and GHM-dict achieve better than 90% precision with moderate recall. Compared to these methods, S-dict is not competitive in terms of either precision or recall. This result seems rather discouraging. However, considering that S-dict is an automatically-constructed dictionary targeting lexical variants of varying frequency, it is not surprising that the precision is worse than that of HB-dict which is manually-constructed and GHM-dict which includes entries only for more-frequent OOVs for which distributional similarity is more accurate. Additionally, the recall of S-dict is hampered by the — — — 428 restriction on lexical variant token length of 4 characters. 6.2.2 Combined Dictionaries Next we look to combining HB-dict, GHM-dict and S-dict. In combining the dictionaries, a given OOV word can be listed with different standard forms in different dictionaries. In such cases we use the following preferences for dictionaries motivated by our confidence in the normalisation pairs — of the dictionaries to resolve conflicts: HB-dict > GHM-dict > S-dict. When we combine dictionaries in the second section of Table 3, we find that they contain complementary information: in each case the recall and F-score are higher for the combined dictionary than any of the individual dictionaries. The combination of HB-dict+GHM-dict produces only a small improvement in terms of F-score over HBdict (the better-performing dictionary) suggesting that, as claimed, HB-dict and GHM-dict share many frequent normalisation pairs. HB-dict+S-dict and GHM-dict+S-dict, on the other hand, improve sub— MethodPrecisionRecallF-ScoreFalse AlarmWord Error Rate C-dict0.4740.2180.2990.2980.103 DM-dict S-dict HB-dict GHM-dict 0.727 0.700 0.915 0.982 0.106 0.179 0.435 0.319 0.185 0.285 0.590 0.482 0.145 0.162 0.048 0.000 0.102 0.097 0.066 0.076 HB-dict+S-dict0.8400.6010.7010.0900.052 GHM-dict+S-dict HB-dict+GHM-dict HB-dict+GHM-dict+S-dict 0.863 0.920 0.847 0.498 0.465 0.630 0.632 0.618 0.723 0.072 0.045 0.086 0.061 0.063 0.049 GHM-dict+GHM-norm0.3380.5780.4270.4580.135 HB-dict+GHM-dict+S-dict+GHM-norm HB-dict+HB-norm HB-dict+GHM-dict+S-dict+HB-norm 0.406 0.515 0.527 0.715 0.771 0.789 0.518 0.618 0.632 0.468 0.332 0.332 0.124 0.081 0.079 Table 3: Normalisation results using our derived dictionaries (contextual similarity (C-dict); double metaphone rendering (DM-dict); string subsequence kernel scores (S-dict)), the dictionary of Gouws et al. (201 1) (GHM-dict), the Internet slang dictionary (HB-dict) from Han and Baldwin (201 1), and combinations of these dictionaries. In addition, we combine the dictionaries with the normalisation method of Gouws et al. (201 1) (GHM-norm) and the combined unsupervised approach of Han and Baldwin (201 1) (HB-norm). stantially over HB-dict and GHM-dict, respectively, indicating that S-dict contains markedly different entries to both HB-dict and GHM-dict. The best Fscore and WER are obtained using the combination of all three dictionaries, HB-dict+GHM-dict+S-dict. Furthermore, the difference between the results using HB-dict+GHM-dict+S-dict and HB-dict+GHMdict is statistically significant (p < 0.01), based on the computationally-intensive Monte Carlo method of Yeh (2000), demonstrating the contribution of Sdict. 6.2.3 Hybrid Approaches The methods of Gouws et al. (201 1) (i.e. GHM-dict+GHM-norm) and Han and Baldwin (201 1) (i.e. HB-dict+HB-norm) have lower precision and higher false alarm rates than the dictionarybased approaches; this is largely caused by lexical variant detection errors.8 Using all dictionaries in combination with these methods HB-dict+GHM-dict+S-dict+GHM-norm and HBdict+GHM-dict+S-dict+HB-norm gives some improvements, but the false alarm rates remain high. Despite the limitations of a pure dictionary-based approach to normalisation discussed in Section 3.1 the current best practical approach to normal— — — — 8Here we report results that do not assume perfect detection of lexical variants, unlike the original published results in each case. 429 Error typeOOVDSitcat.ndard fGoromld (a) pluralsplayeplayersplayer (b) negation unlike like dislike (c) possessives anyones anyone anyone ’s (d) correct OOVs iphone phone iphone (e) test data errors durin during durin (f) ambiguity siging signing singing Table 4: Error types in the combined dictionary (HBdict+GHM-dict+S-dict) isation is to use a lexicon, combining hand-built and automatically-learned normalisation dictionaries. 6.3 Discussion and Error Analysis We first manually analyse the errors in the combined dictionary (HB-dict+GHM-dict+S-dict) and give examples of each error type in Table 4. The most frequent word errors are caused by slight morphologi- cal variations, including plural forms (a), negations (b), possessive cases (c), and OOVs that are correct and do not require normalisation (d). In addition, we also notice some missing annotations where lexical variants are skipped by human annotations but captured by our method (e). Ambiguity (f) definitely exists in longer OOVs, however, these cases do not appear to have a strong negative impact on the normalisation performance. An example of a remainLength cut-off (N)#VariantsPrecisionRecall (≥ N)Recall (all)False Alarm ≥45560.700Rec0al.l3 8(≥1 N)0.1790.162 ≥≥54 382 0.814 0.471 0.152 0.122 ≥≥65 254 0.804 0.484 0.104 0.131 ≥≥76 138 0.793 0.471 0.055 0.122 ≥71380.7930.4710.0550.122 Table 5: S-dict normalisation results broken down according to OOV token length. Recall is presented both over the subset of instances of length ≥ N in the data (“Recall (≥ N)”), and over the entirety of the dataset (“Recall (all)”); “su#bVsaertia onftis n” sitsa tnhcee snu omfb leenrg othf t≥ok Nen iinns tthaenc deast ao f( “tRhee cinadllic (a≥ted N length idn o othveer rt tehset d eanttaisreetty. ing miscellaneous error is bday “birthday”, which is mis-normalised as day. To further study the influence of OOV word length relative to the normalisation performance, we conduct a fine-grained analysis of the performance of the derived dictionary (S-dict) in Table 5, broken down across different OOV word lengths. The results generally support our hypothesis that our method works better for longer OOV words. The derived dictionary is much more reliable for longer tokens (length 5, 6, and 7 characters) in terms of precision and false alarm. Although the recall is relatively modest, in the future we intend to improve recall by mining more normalisation pairs from larger collections of microblog data. 7 Conclusions and Future Work In this paper, we describe a method for automatically constructing a normalisation dictionary that supports normalisation of microblog text through direct substitution of lexical variants with their standard forms. After investigating the impact of different distributional and string similarity methods on the quality of the dictionary, we present experimental results on a standard dataset showing that our proposed methods acquire high quality (lexical variant, standard form) pairs, with reasonable coverage, and achieve state-of-the-art end-toend lexical normalisation performance on a realworld token-level task. Furthermore, this dictionarylookup method combines the detection and normalisation of lexical variants into a simple, lightweight solution which is suitable for processing of highvolume microblog feeds. In the future, we intend to improve our dictionary by leveraging the constantly-growing volume of microblog data, and considering alternative ways to combine distributional and string similarity. In addi430 tion to direct evaluation, we also want to explore the benefits of applying normalisation for downstream social media text processing applications, e.g. event detection. Acknowledgements We would like to thank the three anonymous reviewers for their insightful comments, and Stephan Gouws for kindly sharing his data and discussing his work. NICTA is funded by the Australian government as represented by Department of Broadband, Communication and Digital Economy, and the Australian Research Council through the ICT centre of Excellence programme. References AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of COLING/ACL 2006, pages 33–40, Sydney, Australia. Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 389–398, Portland, Oregon, USA. Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Version 1. Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition, 10: 157–174. Danish Contractor, Tanveer A. Faruquie, and L. Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 189–196, Beijing, China. Paul Cook and Suzanne Stevenson. 2009. An unsu- pervised model for text message normalization. In CALC ’09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71– 78, Boulder, USA. Jennifer Foster, O¨zlem C ¸etinoglu, Joachim Wagner, Joseph L. Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. #hardtoparse: POS Tagging and Parsing the Twitterverse. In Analyzing Microtext: Papers from the 2011 AAAI Workshop, volume WS-1 1-05 of AAAI Workshops, pages 20–25, San Francisco, CA, USA. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 42–47, Portland, Oregon, USA. Roberto Gonz a´lez-Ib ´a n˜ez, Smaranda Muresan, and Nina Wacholder. 2011. Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 581–586, Portland, Oregon, USA. Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP, pages 82–90, Edinburgh, Scotland, UK. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 368–378, Portland, Oregon, USA. K. Jarvelin and J. Kekalainen. 2002. Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems, 20(4). Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target-dependent Twitter sentiment classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 151–160, Portland, Oregon, USA. Joseph Kaufmann and Jugal Kalita. 2010. Syntactic normalization of Twitter messages. In International Con431 ference on Natural Language Processing, Kharagpur, India. S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics, 22:49– 86. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings ofthe Eighteenth International Conference on Machine Learning, pages 282–289, San Francisco, CA, USA. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710. Mu Li, Yang Zhang, Muhua Zhu, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of COLING/ACL 2006, pages 1025–1032, Sydney, Australia. Jianhua Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1): 145–151. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the ACL and 1 International Con7th ference on Computational Linguistics (COLING/ACL98), pages 768–774, Montreal, Quebec, Canada. Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011a. Insertion, deletion, or substitution? normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 71–76, Portland, Oregon, USA. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011b. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 359–367, Portland, Oregon, USA. Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broadcoverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Jeju, Republic of Korea. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. J. Mach. Learn. Res., 2:419– 444. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), pages 553–561, Chiang Mai, Thailand. Brendan O’Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory search and topic summarization for Twitter. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM 2010), pages 384–385, Washington, USA. Lawrence Philips. 2000. The double metaphone search algorithm. C/C++ Users Journal, 18:38–43. Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of Twitter conversations. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL-HLT 2010), pages 172–180, Los Angeles, USA. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 1524–1534, Edinburgh, Scotland, UK. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on the World Wide Web (WWW 2010), pages 851–860, Raleigh, North Carolina, USA. Crispin Thurlow. 2003. Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Analysis Online, 1(1). Kristina Toutanova and Robert C. Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the ACL and 3rd Annual Meeting of the NAACL (ACL-02), pages 144–15 1, Philadelphia, USA. Official Blog Twitter. 2011. 200 million tweets per day. Retrived at August 17th, 2011. Jianshu Weng and Bu-Sung Lee. 2011. Event detection in Twitter. In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain. Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Proceedings of the AAAI11 Workshop on Analyzing Microtext, pages 74–79, San Francisco, USA. Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 947–953, Saarbr¨ ucken, Germany. 432
2 0.43145013 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities
Author: Xin Zhao ; Baihan Shu ; Jing Jiang ; Yang Song ; Hongfei Yan ; Xiaoming Li
Abstract: Activities on social media increase at a dramatic rate. When an external event happens, there is a surge in the degree of activities related to the event. These activities may be temporally correlated with one another, but they may also capture different aspects of an event and therefore exhibit different bursty patterns. In this paper, we propose to identify event-related bursts via social media activities. We study how to correlate multiple types of activities to derive a global bursty pattern. To model smoothness of one state sequence, we propose a novel function which can capture the state context. The experiments on a large Twitter dataset shows our methods are very effective.
3 0.41838607 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings
Author: Doo Soon Kim ; Kunal Verma ; Peter Yeh
Abstract: Short listings such as classified ads or product listings abound on the web. If a computer can reliably extract information from them, it will greatly benefit a variety of applications. Short listings are, however, challenging to process due to their informal styles. In this paper, we present an unsupervised information extraction system for short listings. Given a corpus of listings, the system builds a semantic model that represents typical objects and their attributes in the domain of the corpus, and then uses the model to extract information. Two key features in the system are a semantic parser that extracts objects and their attributes and a listing-focused clustering module that helps group together extracted tokens of same type. Our evaluation shows that the , semantic model learned by these two modules is effective across multiple domains.
4 0.36973929 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
Author: Pidong Wang ; Preslav Nakov ; Hwee Tou Ng
Abstract: We propose a novel, language-independent approach for improving machine translation from a resource-poor language to X by adapting a large bi-text for a related resource-rich language and X (the same target language). We assume a small bi-text for the resourcepoor language to X pair, which we use to learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language; we then adapt the former to get closer to the latter. Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 6.7 BLEU points of improvement over the unadapted one and 2.6 BLEU points over the original small bi-text. Moreover, combining the small bi-text with the adapted bi-text outperforms the corresponding combinations with the unadapted bi-text by 1.5– 3 BLEU points. We also demonstrate applicability to other languages and domains.
5 0.3330836 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze
Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
6 0.32173488 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid
7 0.2634005 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
8 0.2488775 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
9 0.23862717 38 emnlp-2012-Employing Compositional Semantics and Discourse Consistency in Chinese Event Extraction
10 0.23335429 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
11 0.19798104 88 emnlp-2012-Minimal Dependency Length in Realization Ranking
12 0.19545469 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
13 0.19124214 100 emnlp-2012-Open Language Learning for Information Extraction
14 0.18176061 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
15 0.16758828 120 emnlp-2012-Streaming Analysis of Discourse Participants
16 0.16741571 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
17 0.15326603 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
18 0.1472472 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
19 0.14530918 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification
20 0.14526087 119 emnlp-2012-Spectral Dependency Parsing with Latent Variables
topicId topicWeight
[(2, 0.017), (16, 0.022), (20, 0.304), (25, 0.012), (34, 0.062), (41, 0.022), (45, 0.03), (60, 0.092), (63, 0.041), (64, 0.031), (65, 0.022), (68, 0.029), (70, 0.028), (73, 0.014), (74, 0.033), (76, 0.051), (80, 0.016), (86, 0.034), (94, 0.01), (95, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.70141077 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
Author: Bo Han ; Paul Cook ; Timothy Baldwin
Abstract: Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highlyranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing. 1 Lexical Normalisation A staggering number of short text “microblog” messages are produced every day through social media such as Twitter (Twitter, 2011). The immense volume of real-time, user-generated microblogs that flows through sites has been shown to have utility in applications such as disaster detection (Sakaki et al., 2010), sentiment analysis (Jiang et al., 2011; Gonz a´lez-Ib ´a n˜ez et al., 2011), and event discovery (Weng and Lee, 2011; Benson et al., 2011). However, due to the spontaneous nature of the posts, microblogs are notoriously noisy, containing many non-standard forms e.g., tmrw “tomorrow” and 2day “today” which degrade the performance of — — 421 natural language processing (NLP) tools (Ritter et al., 2010; Han and Baldwin, 2011). To reduce this effect, attempts have been made to adapt NLP tools to microblog data (Gimpel et al., 2011; Foster et al., 2011; Liu et al., 2011b; Ritter et al., 2011). An alternative approach is to pre-normalise non-standard lexical variants to their standard orthography (Liu et al., 2011a; Han and Baldwin, 2011; Xue et al., 2011; Gouws et al., 2011). For example, se u 2morw!!! would be normalised to see you tomorrow! The normalisation approach is especially attractive as a preprocessing step for applications which rely on keyword match or word frequency statistics. For example, earthqu, eathquake, and earthquakeee all attested in a Twitter corpus have the standard form earthquake; by normalising these types to their standard form, better coverage can be achieved for keyword-based methods, and better word frequency estimates can be obtained. In this paper, we focus on the task of lexical normalisation of English Twitter messages, in which out-of-vocabulary (OOV) tokens are normalised to their in-vocabulary (IV) standard form, i.e., a standard form that is in a dictionary. Following other recent work on lexical normalisation (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011; Liu et al., 2012), we specifically focus on one-to-one normalisation in which one OOV token is normalised to one IV word. Naturally, not all OOV words in microblogs are lexical variants of IV words: named entities, e.g., — — are prevalent in microblogs, but not all named entities are included in our dictionary. One challenge for lexical normalisation is therefore to disPLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 42t C1–o4n3f2e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r.a ?lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl tinguish those OOV tokens that require normalisation from those that are well-formed. Recent unsupervised approaches have not attempted to distinguish such tokens from other types of OOV tokens (Cook and Stevenson, 2009; Liu et al., 2011a), limiting their applicability to real-world normalisation tasks. Other approaches (Han and Baldwin, 2011; Gouws et al., 2011) have followed a cascaded approach in which lexical variants are first identified, and then normalised. However, such two-step approaches suffer from poor lexical variant identification performance, which is propagated to the normalisation step. Motivated by the observation that most lexical variants have an unambiguous standard form (especially for longer tokens), and that a lexical variant and its standard form typically occur in similar contexts, in this paper we propose methods for automatically constructing a lexical normalisation dictionary a dictionary whose entries consist — of (lexical variant, standard form) pairs that enables type-based normalisation. Despite the simplicity of this dictionary-based normalisation method, we show it to outperform previously-proposed approaches. This very fast, lightweight solution is suitable for real-time processing of the large volume of streaming microblog data available from Twitter, and offers a simple solution to the lexical variant detection problem that hinders other normalisation methods. Furthermore, this dictionary-based method can be easily integrated with other more-complex normalisation approaches (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011) to produce hybrid systems. After discussing related work in Section 2, we present an overview of our dictionary-based approach to normalisation in Section 3. In Sections 4 and 5 we experimentally select the optimised context similarity parameters and string similarity reranking method. We present experimental results on the unseen test data in Section 6, and offer some concluding remarks in Section 7. — 2 Related Work Given a token t, lexical normalisation is the task of finding arg max P(s|t) ∝ arg max P(t| s)P(s), wofh efinred s igs tahreg smtaanxdaPrd(s form, i.e., an aIVx Pw(otr|sd). PSt(asn)-, dardly in lexical normalisation, t is assumed to be an 422 OOV token, relative to a fixed dictionary. In practice, not all OOV tokens should be normalised; i.e., only lexical variants (e.g., tmrw “tomorrow”) should be normalised and tokens that are OOV but otherwise not lexical variants (e.g., iPad “iPad”) should be unchanged. Most work in this area focuses only on the normalisation task itself, oftentimes assuming that the task of lexical variant detection has already been completed. Various approaches have been proposed to estimate the error model, P(t|s). For example, in work on spell-checking, eBl,ril Pl (atn|ds) M. Fooorre e (2000) improve on a standard edit-distance approach by considering multi-character edit operations; Toutanova and Moore (2002) build on this by incorporating phonological information. Li et al. (2006) utilise distributional similarity (Lin, 1998) to correct misspelled search queries. In text message normalisation, Choudhury et al. (2007) model the letter transformations and emissions using a hidden Markov model (Rabiner, 1989). Cook and Stevenson (2009) and Xue et al. (201 1) propose multiple simple error models, each of which captures a particular way in which lexical variants are formed, such as phonetic spelling (e.g., epik “epic”) or clipping (e.g., walkin “walking”). Nevertheless, optimally weighting the various error models in these approaches is challenging. Without pre-categorising lexical variants into different types, Liu et al. (201 1a) collect Google search snippets from carefully-designed queries from which they then extract noisy lexical variant– standard form pairs. These pairs are used to train a conditional random field (Lafferty et al., 2001) to estimate P(t|s) at the character level. One shortcoming eo fP querying a ese cahracrha engine teol. .o Obtanein strhaoirnt-ing pairs is it tends to be costly in terms of time and bandwidth. Here we exploit microblog data directly to derive (lexical variant, standard form) pairs, instead of relying on external resources. In morerecent work, Liu et al. (2012) endeavour to improve the accuracy of top-n normalisation candidates by integrating human cognitive inference, characterlevel transformations and spell checking in their normalisation model. The encouraging results shift the focus to reranking and promoting the correct normalisation to the top-1 position. However, like much previous work on lexical normalisation, this work assumes perfect lexical variant detection. Aw et al. (2006) and Kaufmann and Kalita (2010) consider normalisation as a machine translation task from lexical variants to standard forms using off-theshelf tools. These methods do not assume that lexical variants have been pre-identified; however, these methods do rely on large quantities of labelled training data, which is not available for microblogs. Recently, Han and Baldwin (201 1) and Gouws et al. (201 1) propose two-step unsupervised approaches to normalisation, in which lexical variants are first identified, and then normalised. They approach lexical variant detection by using a context fitness classifier (Han and Baldwin, 2011) or through dictionary lookup (Gouws et al., 2011). However, the lexical variant detection of both meth- ods is rather unreliable, indicating the challenge of this aspect of normalisation. Both of these approaches incorporate a relatively small normalisation dictionary to capture frequent lexical variants with high precision. In particular, Gouws et al. (201 1) produce a small normalisation lexicon based on distributional similarity and string similarity (Lodhi et al., 2002). Our method adopts a similar strategy using distributional/string similarity, but instead of constructing a small lexicon for preprocessing, we build a much wider-coverage normalisation dictionary and opt for a fully lexiconbased end-to-end normalisation approach. In contrast to the normalisation dictionaries of Han and Baldwin (201 1) and Gouws et al. (201 1) which focus on very frequent lexical variants, we focus on moderate frequency lexical variants of a minimum character length, which tend to have unambiguous standard forms; our intention is to produce normalisation lexicons that are complementary to those currently available. Furthermore, we investigate the impact of a variety of contextual and string similarity measures on the quality of the resulting lexicons. In summary, our dictionary-based normalisation ap- proach is a lightweight end-to-end method which performs both lexical variant detection and normalisation, and thus is suitable for practical online preprocessing, despite its simplicity. 423 3 A Lexical Normalisation Dictionary Before discussing our method for creating a normalisation dictionary, we first discuss the feasibility of such an approach. 3.1 Feasibility Dictionary lookup approaches to normalisation have been shown to have high precision but low recall (Han and Baldwin, 2011; Gouws et al., 2011). Frequent (lexical variant, standard form) pairs such as (u, you) are typically included in the dictionaries used by such methods, while less-frequent items such as (g0tta, gotta) are generally omitted. Because of the degree of lexical creativity and large number of non-standard forms observed on Twitter, a wide-coverage normalisation dictionary would be expensive to construct manually. Based on the assumption that lexical variants occur in similar con- texts to their standard forms, however, it should be possible to automatically construct a normalisation dictionary with wider coverage than is currently available. Dictionary lookup is a type-based approach to normalisation, i.e., every token instance of a given type will always be normalised in the same way. However, lexical variants can be ambiguous, e.g., y corresponds to “you” in yeah, y r right! LOL but “why” in AM CONFUSED!!! y you did that? Nevertheless, the relative occurrence of ambiguous lexical variants is small (Liu et al., 2011a), and it has been observed that while shorter variants such as y are often ambiguous, longer variants tend to be unambiguous. For example bthday and 4eva are unlikely to have standard forms other than “birthday” and “forever”, respectively. Therefore, the normalisation lexicons we produce will only contain entries for OOVs with character length greater than a specified threshold, which are likely to have an unambiguous standard form. 3.2 Overview of approach Our method for constructing a normalisation dictio- nary is as follows: Input: Tokenised English tweets 1. Extract (OOV, IV) pairs based on distributional similarity. 2. Re-rank the extracted pairs by string similarity. Output: A list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs for inclusion in the normalisation lexicon. In Step 1, we leverage large volumes of Twitter data to identify the most distributionally-similar IV type for each OOV type. The result of this process is a set of (OOV, IV) pairs, ranked by distributional similarity. The extracted pairs will include (lexical variant, standard form) pairs, such as (tmrw, tomorrow), but will also contain false positives such as (Tusday, Sunday) Tusday is a lexical variant, but its standard form is not “Sunday” and (Youtube, web) Youtube is an OOV named entity, not a lexical variant. Nevertheless, lexical variants are typically formed from their standard forms through regular processes (Thurlow, 2003) e.g., the omission of characters and from this perspective Sunday and web are not plausible standard — — — — — forms for Tusday and Youtube, respectively. In Step 2, we therefore capture this intuition to re-rank the extracted pairs by string similarity. The top-n items in this re-ranked list then form the normalisation lexicon, which is based only on development data. Although computationally-expensive to build, this dictionary can be created offline. Once built, it then offers a very fast approach to normalisation. We can only reliably compute distributional similarity for types that are moderately frequent in a corpus. Nevertheless, many lexical variants are sufficiently frequent to be able to compute distributional similarity, and can potentially make their way into our normalisation lexicon. This approach is not suitable for normalising low-frequency lexical variants, nor is it suitable for shorter lexical variant types which as discussed in Section 3.1 are more likely to have an ambiguous standard form. Nevertheless, previously-proposed normalisation methods that can handle such phenomena also rely in part on a normalisation lexicon. The normalisation lexicons we create can therefore be easily integrated with previous approaches to form hybrid normalisation systems. — — 4 Contextually-similar Pair Generation Our objective is to extract contextually-similar (OOV, IV) pairs from a large-scale collection of mi424 croblog data. Fundamentally, the surrounding words define the primary context, but there are different ways of representing context and different similarity measures we can use, which may influence the quality of generated normalisation pairs. In representing the context, we experimentally explore the following factors: (1) context window size (from 1 to 3 tokens on both sides); (2) n-gram order ofthe context tokens (unigram, bigram, trigram); (3) whether context words are indexed for relative position or not; and (4) whether we use all context tokens, or only IV words. Because high-accuracy linguistic processing tools for Twitter are still under exploration (Liu et al., 2011b; Gimpel et al., 2011; Ritter et al., 2011; Foster et al., 2011), we do not consider richer representations of context, for example, incorporating information about part-of-speech tags or syntax. We also experiment with a number of simple but widely-used geometric and information theoretic distance/similarity measures. In particular, we use Kullback–Leibler (KL) divergence (Kullback and Leibler, 195 1), Jensen–Shannon (JS) divergence (Lin, 1991), Euclidean distance and Cosine distance. We use a corpus of 10 million English tweets to do parameter tuning over, and a larger corpus of tweets in the final candidate ranking. All tweets were collected from September 2010 to January 2011 via the Twitter API.1 From the raw data we extract English tweets using a language identification tool (Lui and Baldwin, 2011), and then apply a simplified Twitter tokeniser (adapted from O’Connor et al. (2010)). We use the Aspell dictionary (v6.06)2 to determine whether a word is IV, and only include in our normalisation dictionary OOV tokens with at least 64 occurrences in the corpus and character length ≥ 4, both of which were determined through empirical 4o,b bsoetrhva otifo wnh. Fcohr w weearceh d OetOeVrm winoedrd t type ginh the corpus, we select the most similar IV type to form (OOV, IV) pairs. To further narrow the search space, we only consider IV words which are morphophonemically similar to the OOV type, follow- ing settings in Han and Baldwin (201 1).3 1http s : / / dev .twitter . com/ docs / st reaming-api /methods 2http : / / aspe l .net / l 3We only consider IV words within an edit distance of 2 or a phonemic edit distance of 1from the OOV type, and we further In order to evaluate the generated pairs, we randomly selected 1000 OOV words from the 10 million tweet corpus. We set up an annotation task on Amazon Mechanical Turk,4 presenting five independent annotators with each word type (with no context) and asking for corrections where appropriate. For instance, given tmrw, the annotators would likely identify it as a non-standard variant of “tomorrow”. For correct OOV words like iPad, on the other hand, we would expect them to leave the word unchanged. If 3 or more of the 5 annotators make the same suggestion (in the form of either a canonical spelling or leaving the word unchanged), we include this in our gold standard for evaluation. In total, this resulted in 351 lexical variants and 282 correct OOV words, accounting for 63.3% of the 1000 OOV words. These 633 OOV words were used as (OOV, IV) pairs for parameter tuning. The remainder of the 1000 OOV words were ignored on the grounds that there was not sufficient consensus amongst the annotators.5 Contextually-similar pair generation aims to include as many correct normalisation pairs as possible. We evaluate the quality of the normalisation pairs using “Cumulative Gain” (CG): XN0 CG = Xreli0 Xi=1 Suppose there are N0 correct generated pairs (oovi, ivi), each of which is weighted by reli0, the frequency of oovi to indicate its relative importance; for example, (thinkin, thinking) has a higher weight than (g0tta, gotta) because thinkin is more frequent than g0tta in our corpus. In this evaluation we don’t consider the position of normalisation pairs, and nor do we penalise incorrect pairs. Instead, we push distinguishing between correct and incorrect pairs into the downstream re-ranking step in which we incorporate string similarity information. Given the development data and CG, we run an exhaustive search of parameter combinations over only consider the top 30% most-frequent of these IV words. 4https : / /www .mturk .com/mturk/welcome 5Note that the objective of this annotation task is to identify lexical variants that have agreed-upon standard forms irrespective of context, as a special case of the more general task of lexical normalisation (where context may or may not play a significant role in the determination of the normalisation). 425 our development corpus. The five best parameter combinations are shown in Table 1. We notice the CG is almost identical for the top combinations. As a context window size of 3 incurs a heavy processing and memory overhead over a size of 2, we use the 3rd-best parameter combination for subsequent experiments, namely: context window of ±2 tokens, teoxkpeenr bigrams, positional index, nadnodw wK oLf divergence as our distance measure. To better understand the sensitivity of the method to each parameter, we perform a post-hoc parameter analysis relative to a default setting (as underlined in Table 2), altering one parameter at a time. The results in Table 2 show that bigrams outperform other n-gram orders by a large margin (note that the evaluation is based on a log scale), and information-theoretic measures are superior to the geometric measures. Furthermore, it also indicates using the positional indexing better captures context. However, there is little to distinguish context modelling with just IV words or all tokens. Similarly, the context window size has relatively little impact on the overall performance, supporting our earlier observation from Table 1. 5 Pair Re-ranking by String Similarity Once the contextually-similar (OOV, IV) pairs are generated using the selected parameters in Section 4, we further re-rank this set of pairs in an attempt to boost morphophonemically-similar pairs like (bananaz, bananas), and penalise noisy pairs like (paninis, beans). Instead of using the small 10 million tweet corpus, from this step onwards, we use a larger corpus of 80 million English tweets (collected over the same period as the development corpus) to develop a larger-scale normalisation dictionary. This is because once pairs are generated, re-ranking based on string comparison is much faster. We only include in the dictionary OOV words with a token frequency > 15 to include more OOV types than in Section 4, and again apply a minimum length cutoff of 4 char- acters. To measure how well our re-ranking method promotes correct pairs and demotes incorrect pairs (including both OOV words that should not be normalised, e.g. (Youtube, web), and incorrect normalRankWindow sizen-gramPositional index?Lex. choiceSim/distance measurelog(CG) 1±32YesAllKL divergence19.571 2 ±±33 2 No All KL divergence 19.562 3 ±±23 2 Yes All KL divergence 19.562 4 ±±32 2 Yes IVs KL divergence 19.561 5 ±±23 2 Yes IVs JS divergence 19.554 ±2 Table 1: The five best parameter combinations in the exhaustive search of parameter combinations Window sizen-gramPositional index?Lexical choiceSimilarity/distance measure ±1 19.3251 19.328Yes 19.328IVs 19.335KL divergence 19.328 ±±21 1199..332275 2 19.571 No 19.263 All 19.328 Euclidean 19.227 ±±32 1199..332287 3 19.324 JS divergence 19.31 1 Cosine 19.170 Table 2: Parameter sensitivity analysis measured as log(CG) for correctly-generated pairs. We tune one parameter at a time, using the default (underlined) setting for other parameters; the non-exhaustive best-performing setting in each case is indicated in bold. isations for lexical variants, e.g. (bcuz, cause)), we modify our evaluation metric from Section 4 to evaluate the ranking at different points, using Discounted Cumulative Gain (DCG@N: Jarvelin and Kekalainen (2002)): DCG@N = rel1+XiN=2logr2el(i ) where reli again represents the frequency of the OOV, but it can be gain (a positive number) or loss (a negative number), depending on whether the ith pair is correct or incorrect. Because we also expect correct pairs to be ranked higher than incorrect pairs, DCG@N takes both factors into account. Given the generated pairs and the evaluation metric, we first consider three baselines: no re-ranking (i.e., the final ranking is that of the contextual similarity scores), and re-rankings of the pairs based on the frequencies of the OOVs in the Twitter corpus, and the IV unigram frequencies in the Google Web 1T corpus (Brants and Franz, 2006) to get less-noisy frequency estimates. We also compared a variety of re-rankings based on a number of string similarity measures that have been previously considered in normalisation work (reviewed in Section 2). We experiment with standard edit distance (Levenshtein, 1966), edit distance over double metaphone codes (phonetic edit distance: (Philips, 2000)), longest common subsequence ratio over the consonant edit distance of the paired words (hereafter, denoted as 426 consonant edit distance: (Contractor et al., 2010)), and a string subsequence kernel (Lodhi et al., 2002). In Figure 1, we present the DCG@N results for each of our ranking methods at different rank cutoffs. Ranking by OOV frequency is motivated by the assumption that lexical variants are frequently used by social media users. This is confirmed by our findings that lexical pairs like (goin, going) and (nite, night) are at the top of the ranking. However, many proper nouns and named entities are also used frequently and ranked at the top, mixed with lexical variants like (Facebook, speech) and (Youtube, web). In ranking by IV word frequency, we assume the lexical variants are usually derived from frequently-used IV equivalents, e.g. (abou, about). However, many less-frequent lexical variant types have high-frequency (IV) normalisations. For instance, the highest-frequency IV word the has more than 40 OOV lexical variants, such as tthe and thhe. These less-frequent types occupy the top positions, reducing the cumulative gain. Compared with these two baselines, ranking by default contextual similarity scores delivers promising results. It successfully ranks many more intuitive normalisation pairs at the top, such as (2day, today) and (wknd, weekend), but also ranks some incorrect pairs highly, such as (needa, gotta). The string similarity-based methods perform better than our baselines in general. Through manual analysis, we found that standard edit distance ranking is fairly accurate for lexical variants with low edit distance to their standard forms, but fails to identify heavily-altered variants like (tmrw, tomorrow). Consonant edit distance is similar to standard edit distance, but places many longer words at the top of the ranking. Edit distance over double metaphone codes (phonetic edit distance) performs particularly well for lexical variants that include character repetitions commonly used for emphasis on Twitter because such repetitions do not typically alter the phonetic codes. Compared with the other methods, the string subsequence kernel delivers encouraging results. It measures common character subsequences of length n between (OOV, IV) pairs. Because it is computationally expensive to calculate similarity for larger n, we choose n=2, following Gouws et al. (201 1). As N (the lexicon size cut-off) increases, the performance drops more slowly than the other meth— — ods. Although this method fails to rank heavilyaltered variants such as (4get,forget) highly, it typically works well for longer words. Given that we focus on longer OOVs (specifically those longer than 4 characters), this ultimately isn’t a great handicap. 6 Evaluation Given the re-ranked pairs from Section 5, here we apply them to a token-level normalisation task using the normalisation dataset of Han and Baldwin (201 1). 6.1 Metrics We evaluate using the standard evaluation metrics of precision (P), recall (R) and F-score (F) as detailed below. We also consider the false alarm rate (FA) and word error rate (WER), also as shown below. FA measures the negative effects of applying normalisation; a good approach to normalisation should not (incorrectly) normalise tokens that are already in their standard form and do not require normalisation.6 WER, like F-score, shows the overall benefits of normalisation, but unlike F-score, measures how many token-level edits are required for the output to be the same as the ground truth data. In general, dictionaries with a high F-score/low WER and low FA 6FA + P ≤ 1because some lexical variants might be incorrectly Ano +rm Pa ≤lise 1d b. 427 are preferable. P = R= F = FA = WER = # cor#re nctolrym naolrismedal tioskeden toskens # to ckoernresc rtelyqu niori nmga nloisremda tloiskaetniosn P2P +R R # inco#rr encotrlmya nliosremda tloikseedns tokens # token edits n#ee adlletd o akfetnesr normalisation 6.2 Results We select the three best re-ranking methods, and best cut-off N for each method, based on the highest DCG@N value for a given method over the development data, as presented in Figure 1. Namely, they are string subsequence kernel (S-dict, N=40,000), double metaphone edit distance (DMdict, N=10,000) and default contextual similarity without re-ranking (C-dict, N=10,000).7 We evaluate each of the learned dictionaries in Table 3. We also compare each dictionary with the performance of the manually-constructed Internet slang dictionary (HB-dict) used by Han and Baldwin (201 1), the small automatically-derived dictionary of Gouws et al. (201 1) (GHM-dict), and combinations of the different dictionaries. In addition, the contribution of these dictionaries in hybrid normalisation approaches is also presented, in which we first normalise OOVs using a given dictionary (combined or otherwise), and then apply the normalisation method of Gouws et al. (201 1) based on consonant edit distance (GHM-norm), or the approach of Han and Baldwin (201 1) based on the summation of many unsupervised approaches (HB-norm), to the remaining OOVs. Results are shown in Table 3, and discussed below. 6.2.1 Individual Dictionaries Overall, the individual dictionaries derived by the re-ranking methods (DM-dict, S-dict) perform bet- 7We also experimented with combining ranks using Mean Reciprocal Rank. However, the combined rank didn’t improve performance on the development data. We plan to explore other ranking aggregation methods in future work. 1 3 5 7 9 11 31 51 71 91 N cut−offs Figure 1: Re-ranking based on different string similarity methods. ter than that based on contextual similarity (C-dict) in terms of precision and false alarm rate, indicating the importance of re-ranking. Even though C-dict delivers higher recall indicating that many lexical variants are correctly normalised this is offset by its high false alarm rate, which is particularly undesirable in normalisation. Because S-dict has better performance than DM-dict in terms of both F-score and WER, and a much lower false alarm rate than C-dict, subsequent results are presented using S-dict only. — — Both HB-dict and GHM-dict achieve better than 90% precision with moderate recall. Compared to these methods, S-dict is not competitive in terms of either precision or recall. This result seems rather discouraging. However, considering that S-dict is an automatically-constructed dictionary targeting lexical variants of varying frequency, it is not surprising that the precision is worse than that of HB-dict which is manually-constructed and GHM-dict which includes entries only for more-frequent OOVs for which distributional similarity is more accurate. Additionally, the recall of S-dict is hampered by the — — — 428 restriction on lexical variant token length of 4 characters. 6.2.2 Combined Dictionaries Next we look to combining HB-dict, GHM-dict and S-dict. In combining the dictionaries, a given OOV word can be listed with different standard forms in different dictionaries. In such cases we use the following preferences for dictionaries motivated by our confidence in the normalisation pairs — of the dictionaries to resolve conflicts: HB-dict > GHM-dict > S-dict. When we combine dictionaries in the second section of Table 3, we find that they contain complementary information: in each case the recall and F-score are higher for the combined dictionary than any of the individual dictionaries. The combination of HB-dict+GHM-dict produces only a small improvement in terms of F-score over HBdict (the better-performing dictionary) suggesting that, as claimed, HB-dict and GHM-dict share many frequent normalisation pairs. HB-dict+S-dict and GHM-dict+S-dict, on the other hand, improve sub— MethodPrecisionRecallF-ScoreFalse AlarmWord Error Rate C-dict0.4740.2180.2990.2980.103 DM-dict S-dict HB-dict GHM-dict 0.727 0.700 0.915 0.982 0.106 0.179 0.435 0.319 0.185 0.285 0.590 0.482 0.145 0.162 0.048 0.000 0.102 0.097 0.066 0.076 HB-dict+S-dict0.8400.6010.7010.0900.052 GHM-dict+S-dict HB-dict+GHM-dict HB-dict+GHM-dict+S-dict 0.863 0.920 0.847 0.498 0.465 0.630 0.632 0.618 0.723 0.072 0.045 0.086 0.061 0.063 0.049 GHM-dict+GHM-norm0.3380.5780.4270.4580.135 HB-dict+GHM-dict+S-dict+GHM-norm HB-dict+HB-norm HB-dict+GHM-dict+S-dict+HB-norm 0.406 0.515 0.527 0.715 0.771 0.789 0.518 0.618 0.632 0.468 0.332 0.332 0.124 0.081 0.079 Table 3: Normalisation results using our derived dictionaries (contextual similarity (C-dict); double metaphone rendering (DM-dict); string subsequence kernel scores (S-dict)), the dictionary of Gouws et al. (201 1) (GHM-dict), the Internet slang dictionary (HB-dict) from Han and Baldwin (201 1), and combinations of these dictionaries. In addition, we combine the dictionaries with the normalisation method of Gouws et al. (201 1) (GHM-norm) and the combined unsupervised approach of Han and Baldwin (201 1) (HB-norm). stantially over HB-dict and GHM-dict, respectively, indicating that S-dict contains markedly different entries to both HB-dict and GHM-dict. The best Fscore and WER are obtained using the combination of all three dictionaries, HB-dict+GHM-dict+S-dict. Furthermore, the difference between the results using HB-dict+GHM-dict+S-dict and HB-dict+GHMdict is statistically significant (p < 0.01), based on the computationally-intensive Monte Carlo method of Yeh (2000), demonstrating the contribution of Sdict. 6.2.3 Hybrid Approaches The methods of Gouws et al. (201 1) (i.e. GHM-dict+GHM-norm) and Han and Baldwin (201 1) (i.e. HB-dict+HB-norm) have lower precision and higher false alarm rates than the dictionarybased approaches; this is largely caused by lexical variant detection errors.8 Using all dictionaries in combination with these methods HB-dict+GHM-dict+S-dict+GHM-norm and HBdict+GHM-dict+S-dict+HB-norm gives some improvements, but the false alarm rates remain high. Despite the limitations of a pure dictionary-based approach to normalisation discussed in Section 3.1 the current best practical approach to normal— — — — 8Here we report results that do not assume perfect detection of lexical variants, unlike the original published results in each case. 429 Error typeOOVDSitcat.ndard fGoromld (a) pluralsplayeplayersplayer (b) negation unlike like dislike (c) possessives anyones anyone anyone ’s (d) correct OOVs iphone phone iphone (e) test data errors durin during durin (f) ambiguity siging signing singing Table 4: Error types in the combined dictionary (HBdict+GHM-dict+S-dict) isation is to use a lexicon, combining hand-built and automatically-learned normalisation dictionaries. 6.3 Discussion and Error Analysis We first manually analyse the errors in the combined dictionary (HB-dict+GHM-dict+S-dict) and give examples of each error type in Table 4. The most frequent word errors are caused by slight morphologi- cal variations, including plural forms (a), negations (b), possessive cases (c), and OOVs that are correct and do not require normalisation (d). In addition, we also notice some missing annotations where lexical variants are skipped by human annotations but captured by our method (e). Ambiguity (f) definitely exists in longer OOVs, however, these cases do not appear to have a strong negative impact on the normalisation performance. An example of a remainLength cut-off (N)#VariantsPrecisionRecall (≥ N)Recall (all)False Alarm ≥45560.700Rec0al.l3 8(≥1 N)0.1790.162 ≥≥54 382 0.814 0.471 0.152 0.122 ≥≥65 254 0.804 0.484 0.104 0.131 ≥≥76 138 0.793 0.471 0.055 0.122 ≥71380.7930.4710.0550.122 Table 5: S-dict normalisation results broken down according to OOV token length. Recall is presented both over the subset of instances of length ≥ N in the data (“Recall (≥ N)”), and over the entirety of the dataset (“Recall (all)”); “su#bVsaertia onftis n” sitsa tnhcee snu omfb leenrg othf t≥ok Nen iinns tthaenc deast ao f( “tRhee cinadllic (a≥ted N length idn o othveer rt tehset d eanttaisreetty. ing miscellaneous error is bday “birthday”, which is mis-normalised as day. To further study the influence of OOV word length relative to the normalisation performance, we conduct a fine-grained analysis of the performance of the derived dictionary (S-dict) in Table 5, broken down across different OOV word lengths. The results generally support our hypothesis that our method works better for longer OOV words. The derived dictionary is much more reliable for longer tokens (length 5, 6, and 7 characters) in terms of precision and false alarm. Although the recall is relatively modest, in the future we intend to improve recall by mining more normalisation pairs from larger collections of microblog data. 7 Conclusions and Future Work In this paper, we describe a method for automatically constructing a normalisation dictionary that supports normalisation of microblog text through direct substitution of lexical variants with their standard forms. After investigating the impact of different distributional and string similarity methods on the quality of the dictionary, we present experimental results on a standard dataset showing that our proposed methods acquire high quality (lexical variant, standard form) pairs, with reasonable coverage, and achieve state-of-the-art end-toend lexical normalisation performance on a realworld token-level task. Furthermore, this dictionarylookup method combines the detection and normalisation of lexical variants into a simple, lightweight solution which is suitable for processing of highvolume microblog feeds. In the future, we intend to improve our dictionary by leveraging the constantly-growing volume of microblog data, and considering alternative ways to combine distributional and string similarity. In addi430 tion to direct evaluation, we also want to explore the benefits of applying normalisation for downstream social media text processing applications, e.g. event detection. Acknowledgements We would like to thank the three anonymous reviewers for their insightful comments, and Stephan Gouws for kindly sharing his data and discussing his work. NICTA is funded by the Australian government as represented by Department of Broadband, Communication and Digital Economy, and the Australian Research Council through the ICT centre of Excellence programme. References AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of COLING/ACL 2006, pages 33–40, Sydney, Australia. Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 389–398, Portland, Oregon, USA. Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Version 1. Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition, 10: 157–174. Danish Contractor, Tanveer A. Faruquie, and L. Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 189–196, Beijing, China. Paul Cook and Suzanne Stevenson. 2009. An unsu- pervised model for text message normalization. In CALC ’09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71– 78, Boulder, USA. Jennifer Foster, O¨zlem C ¸etinoglu, Joachim Wagner, Joseph L. Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. #hardtoparse: POS Tagging and Parsing the Twitterverse. In Analyzing Microtext: Papers from the 2011 AAAI Workshop, volume WS-1 1-05 of AAAI Workshops, pages 20–25, San Francisco, CA, USA. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 42–47, Portland, Oregon, USA. Roberto Gonz a´lez-Ib ´a n˜ez, Smaranda Muresan, and Nina Wacholder. 2011. Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 581–586, Portland, Oregon, USA. Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP, pages 82–90, Edinburgh, Scotland, UK. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 368–378, Portland, Oregon, USA. K. Jarvelin and J. Kekalainen. 2002. Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems, 20(4). Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target-dependent Twitter sentiment classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 151–160, Portland, Oregon, USA. Joseph Kaufmann and Jugal Kalita. 2010. Syntactic normalization of Twitter messages. In International Con431 ference on Natural Language Processing, Kharagpur, India. S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics, 22:49– 86. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings ofthe Eighteenth International Conference on Machine Learning, pages 282–289, San Francisco, CA, USA. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710. Mu Li, Yang Zhang, Muhua Zhu, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of COLING/ACL 2006, pages 1025–1032, Sydney, Australia. Jianhua Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1): 145–151. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the ACL and 1 International Con7th ference on Computational Linguistics (COLING/ACL98), pages 768–774, Montreal, Quebec, Canada. Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011a. Insertion, deletion, or substitution? normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 71–76, Portland, Oregon, USA. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011b. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 359–367, Portland, Oregon, USA. Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broadcoverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Jeju, Republic of Korea. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. J. Mach. Learn. Res., 2:419– 444. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), pages 553–561, Chiang Mai, Thailand. Brendan O’Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory search and topic summarization for Twitter. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM 2010), pages 384–385, Washington, USA. Lawrence Philips. 2000. The double metaphone search algorithm. C/C++ Users Journal, 18:38–43. Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of Twitter conversations. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL-HLT 2010), pages 172–180, Los Angeles, USA. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 1524–1534, Edinburgh, Scotland, UK. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on the World Wide Web (WWW 2010), pages 851–860, Raleigh, North Carolina, USA. Crispin Thurlow. 2003. Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Analysis Online, 1(1). Kristina Toutanova and Robert C. Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the ACL and 3rd Annual Meeting of the NAACL (ACL-02), pages 144–15 1, Philadelphia, USA. Official Blog Twitter. 2011. 200 million tweets per day. Retrived at August 17th, 2011. Jianshu Weng and Bu-Sung Lee. 2011. Event detection in Twitter. In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain. Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Proceedings of the AAAI11 Workshop on Analyzing Microtext, pages 74–79, San Francisco, USA. Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 947–953, Saarbr¨ ucken, Germany. 432
2 0.63199407 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
Author: Mengqiu Wang ; Christopher D. Manning
Abstract: Accurate and robust metrics for automatic evaluation are key to the development of statistical machine translation (MT) systems. We first introduce a new regression model that uses a probabilistic finite state machine (pFSM) to compute weighted edit distance as predictions of translation quality. We also propose a novel pushdown automaton extension of the pFSM model for modeling word swapping and cross alignments that cannot be captured by standard edit distance models. Our models can easily incorporate a rich set of linguistic features, and automatically learn their weights, eliminating the need for ad-hoc parameter tuning. Our methods achieve state-of-the-art correlation with human judgments on two different prediction tasks across a diverse set of standard evaluations (NIST OpenMT06,08; WMT0608).
3 0.4317514 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction
Author: David McClosky ; Christopher D. Manning
Abstract: We present a distantly supervised system for extracting the temporal bounds of fluents (relations which only hold during certain times, such as attends school). Unlike previous pipelined approaches, our model does not assume independence between each fluent or even between named entities with known connections (parent, spouse, employer, etc.). Instead, we model what makes timelines of fluents consistent by learning cross-fluent constraints, potentially spanning entities as well. For example, our model learns that someone is unlikely to start a job at age two or to marry someone who hasn’t been born yet. Our system achieves a 36% error reduction over a pipelined baseline.
4 0.4204101 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
5 0.41939124 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
6 0.418668 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
7 0.41836074 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
8 0.41738957 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
9 0.4163577 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
10 0.41515577 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
11 0.41470513 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures
12 0.41421908 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
13 0.41241503 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
14 0.41197574 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
15 0.41178513 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
16 0.41159803 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
17 0.41082969 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
18 0.40747398 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
19 0.40669408 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
20 0.40562233 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns