emnlp emnlp2013 emnlp2013-151 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
Reference: text
sentIndex sentText sentNum sentScore
1 pt Abstract Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. [sent-5, score-0.535]
2 When confronted with such input, conventional text analysis tools often perform poorly. [sent-6, score-0.06]
3 Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. [sent-7, score-0.14]
4 We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. [sent-8, score-0.768]
5 1 Introduction Microblogs such as Twitter, Sina Weibo (a popular Chinese microblog service) and Facebook have received increasing attention in diverse research communities (Han and Baldwin, 2011; Hawn, 2009, inter alia). [sent-10, score-0.248]
6 In contrast to traditional text domains that use carefully controlled, standardized language, microblog content is often informal, with less adherence to conventions regarding punctuation, spelling, and style, and with a higher proportion of dialect or pronouciation-derived orthography. [sent-11, score-0.319]
7 If retaining variation due to sociolinguistic or phonological factors is not crucial, text normalization can improve performance on downstream tasks (§2). [sent-16, score-0.511]
8 Starting from a parallel corpus of microblog messages consisting of English paired with several other languages (Ling et al. [sent-19, score-0.358]
9 , 2013), we use standard web machine translation systems to re-translate the non-English segment, producing hEnglish original, English MTi pairs (§3). [sent-20, score-0.095]
10 Several techniques for identifying high-precision normalization rules are proposed, and we introduce a character-based normalization model to account for predictable character-level processes, like repetition and substitution (§4). [sent-23, score-0.73]
11 n Wd esh thowen tdheastc our onourrm daecliozad-tiinogn pmroocdeedl improve t arnandsl sahtoiown quality rfo nr English– Chinese microblog translation (§6). [sent-25, score-0.343]
12 Consider the English tweet shown in the first row of Table 1 which contains several elements that NLP 1The datasets used in this paper are available from http / /www . [sent-27, score-0.107]
13 c th2o0d1s3 in A Nssaotcuiaratilo Lna fnogru Caogmep Purtoacteiosnsianlg L,i pnag ueis t 7ic3s–84, Table 1: Translations of an English microblog message into Mandarin, using three web translation services. [sent-33, score-0.372]
14 TD到啊oaiD nkiaewnlVie DeluV alenumile alVmn ea u是nleiykme的nawnik伊凋nw马 谢工i 关m 作m 于a,工这wo作方rk,面on的th工a作 systems trained on edited domains may not handle well. [sent-35, score-0.043]
15 First, it contains several nonstandard abbreviations, such as, yea, iknw and imma (abbreviations of yes, I know and I going to). [sent-36, score-0.85]
16 To illustrate the effect this can have, consider now the translations produced by Google Translate,2 Microsoft Bing,3 and Youdao,4 shown in rows 2–4. [sent-38, score-0.077]
17 Even with no knowledge of Chinese, it is not hard to see that all engines have produced poor translations: the abbreviation iknw is left translated by all engines, and imma is variously deleted, left untrans- lated, or transliterated into the meaningless sequence 伊马 (pronounced y ı¯ m aˇ). [sent-39, score-0.713]
18 While normalization to a form like To Daniel Veuleman: Yes, I know. [sent-40, score-0.365]
19 am does indeed lose some information (information important for an analysis of sociolinguistic or phonological variation clearly goes missing), it expresses the propositional content of the original in a form that is more amenable to processing by traditional tools. [sent-42, score-0.203]
20 Translating the normalized form with Google Translate produces 要丹尼尔Veuleman: 是的, 我 知道 。 我打算在那工作 。 , which is a substantial improvement over all translations in Table 1. [sent-43, score-0.251]
21 3 Obtaining Normalization Examples We want to treat normalization as a supervised learning problem akin to machine translation, and to do so, we need to obtain pairs of microblog posts and their normalized forms. [sent-44, score-0.819]
22 In this section, we propose a method for creating normalization examples without any human 2http : / /t rans late . [sent-46, score-0.393]
23 com/ 74 Table 2: Translations of Chinese original post to English using web-based service. [sent-52, score-0.11]
24 对DanielVeuleman说, 是的, 我知道, 我正在向那方面努力 MT1Right DanielVeuleman say, yes, I know, I’m Xiangna efforts MT2 DanielVeuleman said, Yes, I know, I’m that hard MT3 Said to DanielVeuleman, yes, I know, I’m to that effort that effort annotation, by leveraging existing tools and data resources. [sent-55, score-0.032]
25 The English example sentence in Table 1 was selected from the µtopia parallel corpus (Ling et al. [sent-56, score-0.078]
26 , 2013), which consists of self-translated messages from Twitter and Sina Weibo (i. [sent-57, score-0.032]
27 The key observation is what happens when we automatically translate the Mandarin version back into English. [sent-61, score-0.049]
28 Rows 3–5 shows automatic translations from three standard web MT engines. [sent-62, score-0.077]
29 While not perfect, the translations contain several correctly normalized subphrases. [sent-63, score-0.251]
30 We will use such re-translations as a source of (noisy) normalization examples. [sent-64, score-0.365]
31 Of course, to motivate this paper, we argued that NLP tools like the very translation systems we propose to use often fail on unnormalized input. [sent-66, score-0.199]
32 Work in translation studies has observed that translation tends to be a generalizing process that “smooths out” authorand work-specific idiosyncrasies (Laviosa, 1998; Volansky et al. [sent-70, score-0.19]
33 Assuming this observation is robust, we expect that dialectal variant forms found in microblogs to be normalized in translation. [sent-72, score-0.397]
34 Therefore, if the parallel segments in our microblog parallel corpus did indeed originate through a trans- lation process (rather than, e. [sent-73, score-0.441]
35 Any written language has the potential to make creative use of orthography: alphabetic scripts can render approximations of pronunciation variants; logographic scripts can use homophonic substitutions. [sent-77, score-0.175]
36 However, the kinds of innovations used in particular languages will be language specific (depending on details of the phonology, lexicon, and orthography of the language). [sent-78, score-0.048]
37 However, for language pairs that differ substantially in these dimensions, it may not always be possible (or at least easy) to preserve particular kinds of nonstandard orthographic forms in translation. [sent-79, score-0.254]
38 Consider the (relatively common) pronounverb compounds like iknw and imma from our motivating example: since Chinese uses a logographic script without spaces, there is no obvious equivalent. [sent-80, score-0.651]
39 1 Variant–Normalized Parallel Corpus For the two reasons outlined above, we argue that we will be able to translate back into English using MT, even when the underlying English part of the parallel corpus has a great deal of nonstandard content. [sent-82, score-0.282]
40 We leverage this fact to build the normalization corpus, where the original English tweet is treated as the variant form, and the automatic translation obtained from another language is considered a potential normalization. [sent-83, score-0.723]
41 The respective non-English side is translated into English using different translation engines. [sent-88, score-0.095]
42 The different sets we used and the engines we used to translate are shown in Table 3. [sent-89, score-0.133]
43 Thus, for each original English post o, we obtain n paraphrases {pi}in=1, from n different twraens olbattaioinn engines. [sent-90, score-0.11]
44 5We additionally assume that the translation engines are trained to output more standardized data, so there will be additional normalizing effect from the machine translation system. [sent-91, score-0.379]
45 2 Alignment and Filtering Our parallel microblog corpus was crawled automatically and contains many misaligned sentences. [sent-96, score-0.326]
46 To address lexical variants, we allow fuzzy word matching, that is, we allow lexically similar, such as yea and yes to be aligned (similarity is determined by the Levenshtein distance). [sent-98, score-0.437]
47 We also perform phrasal matchings, such as ikwn to iknow. [sent-99, score-0.169]
48 To do so, we extend the alignment algorithm from word to phrasal alignments. [sent-100, score-0.202]
49 More precisely, given the original post o and a candidate normalization n, we wish to find the optimal segmentation producing a good alignment. [sent-101, score-0.583]
50 segments tnhtaatt aligns as a block to a source word. [sent-106, score-0.037]
51 For instance, for the sentence yea iknw imma work on that, one possible segmentation could be s1 =yea ikwn, s2 =imma and s3 =work on that. [sent-107, score-0.836]
52 We define the score of an alignment a and segmentation s in using a model that makes semiMarkov independence assumptions, similar to the work in (Bansal et al. [sent-109, score-0.15]
53 , 2011), u(a, s | o, n) = Y|s| Y hue(si,ai iY= Y1 | n) ut(ai | ai−1) u‘(|si|)i In this model, the maximal scoring segmentation and alignment can be found using a polynomial time dynamic programming algorithm. [sent-110, score-0.15]
54 Each segment can be aligned to any word or segment in o. [sent-111, score-0.27]
55 For the alignment score ut, we assume that the relative order of the two sequences will be mostly monotonous. [sent-114, score-0.098]
56 Thus, we approximate ut with the following density poss (ak) − pose(ak−1) ∼ N(1, 1), where the poss is the in)d −ex p oofs the first) w∼o rNd (i1n, t1h)e, segment and pose the one of the last word. [sent-115, score-0.305]
57 After finding the Viterbi alignments, we compute the similarity measure τ = used in (Resnik and Smith, 2003), where |A| Aan|+d| |U| are the number aofn dw Somrdisth t,h 2a0t were aligned aanndd unaligned, respectively. [sent-116, score-0.116]
58 |A| +A||U|, 4 Normalization Model From the normalization corpus, we learn a normalization model that generalizes the normalization process. [sent-119, score-1.095]
59 That is, from the data we observe that To DanielVeuleman yea iknw imma work on that is normalized to To Daniel Veuleman: yes, I know. [sent-120, score-0.958]
60 However, this is not useful, since the chances of the exact sentence To DanielVeuleman yea iknw imma work on that occurring in the data is low. [sent-122, score-0.784]
61 We wish to learn a process to convert the original tweet into the normalized form. [sent-123, score-0.394]
62 T lehaart is, we dw–iwsho rtdo f ainndd that DanielVeuleman is normalized to Daniel Veuleman, that iknw is normalized to I know and that imma is normalized to I going. [sent-127, score-1.178]
63 These mappings am are more useful, since whenever iknw occurs in the data, we have the option to normalize it to I know. [sent-128, score-0.359]
64 However, we wish to learn that it is uncommon for the letters land v to occur in the same word sequentially, so that be can add missing spaces in words that contain the lv character sequence, such as normalizing phenomenalvoter to phenomenal voter. [sent-133, score-0.229]
65 76 I wanna go 4 pizza 2day I want to go for pizza today Figure 1: Variant–normalized alignment with the variant form above and the normalized form below; solid lines show potential normalizations, while dashed lines represent identical translations. [sent-134, score-0.729]
66 However, there are also cases where this is not true, for instance, in the word velvet, we do not wish to separate the letters land v. [sent-135, score-0.101]
67 Thus, we shall describe the process we use to decide when to apply these transformations. [sent-136, score-0.037]
68 The first step is to find the word-level alignments between the original post and its normalization. [sent-141, score-0.156]
69 Many alignment models have been proposed, such as, the HMM-based word alignment models (Vogel et al. [sent-144, score-0.196]
70 Generally, a symmetrization step is performed, where the bidirectional alignments are combined heuristically. [sent-146, score-0.046]
71 Figure 1 shows an example of an word aligned pair of a tweet and its normalization. [sent-149, score-0.221]
72 , 2010), uses the word aligned sentences and extracts phrasal mappings between the original tweet and its normalization, named phrase pairs. [sent-152, score-0.481]
73 For instance, in Figure 1, we would like to extract the phrasal mapping from go 4 to go for, so that we learn that the word 4 in the context of go is normalized to the proposition for. [sent-153, score-0.596]
74 0000 words inside the pair that are aligned to words not in the pair. [sent-161, score-0.114]
75 For instance, in the example above, the phrase pair that normalizes wanna to want to would be extracted, but the phrase pair normalizing wanna to want to go would not, because the word go in the normalization is aligned to a word not in the pair. [sent-162, score-1.027]
76 After extracting the phrase pairs, a model is produced with features derived from phrase pair occurrences during extraction. [sent-164, score-0.21]
77 This model is equivalent to phrasal translation model in MT, but we shall refer to it as the normalization model. [sent-165, score-0.601]
78 Table 4 gives a fragment of the normalization model. [sent-167, score-0.419]
79 The columns represent the original phrase, its normalization and the probability, respectively. [sent-168, score-0.422]
80 In Table 4, we observe that the abbreviation wanna is normalized to want to with a relatively high probability, but it can also be normalized to other equivalent expressions, such as will and going to. [sent-169, score-0.49]
81 The word 4 by itself has a low probability to be normalized to the preposition for. [sent-170, score-0.174]
82 However, we see that the phrase go 4 is normalized to go for with a high probability, which specifies that within the context of go, 4 is generally used as a preposition. [sent-172, score-0.477]
83 2 From Phrases to Characters While we can learn lexical variants that are in the corpora using the phrase model, we can only address word forms that have been observed in the corpora. [sent-174, score-0.196]
84 77 Table 5: Fragment of the character normalization model where examples representative of the lexical variant generation process are encoded in the model. [sent-175, score-0.5]
wordName wordTfidf (topN-words)
[('normalization', 0.365), ('danielveuleman', 0.293), ('iknw', 0.293), ('imma', 0.293), ('microblog', 0.248), ('yea', 0.198), ('normalized', 0.174), ('veuleman', 0.163), ('nonstandard', 0.155), ('yes', 0.118), ('tweet', 0.107), ('go', 0.106), ('phrasal', 0.104), ('variant', 0.099), ('alignment', 0.098), ('translation', 0.095), ('twitter', 0.094), ('wan', 0.093), ('segment', 0.092), ('phrase', 0.091), ('aligned', 0.086), ('engines', 0.084), ('ling', 0.084), ('parallel', 0.078), ('normalizations', 0.077), ('translations', 0.077), ('bing', 0.076), ('weibo', 0.072), ('sociolinguistic', 0.072), ('unnormalized', 0.072), ('mandarin', 0.072), ('going', 0.067), ('na', 0.066), ('ikwn', 0.065), ('laviosa', 0.065), ('logographic', 0.065), ('poss', 0.065), ('volansky', 0.065), ('youdao', 0.065), ('sina', 0.065), ('abbreviations', 0.065), ('normalizing', 0.062), ('variants', 0.061), ('original', 0.057), ('portugal', 0.057), ('pizza', 0.057), ('wish', 0.056), ('orthographic', 0.055), ('fragment', 0.054), ('post', 0.053), ('google', 0.053), ('segmentation', 0.052), ('microblogs', 0.052), ('english', 0.05), ('translate', 0.049), ('lisbon', 0.048), ('orthography', 0.048), ('mt', 0.048), ('ut', 0.047), ('alignments', 0.046), ('land', 0.045), ('forms', 0.044), ('standardized', 0.043), ('abbreviation', 0.043), ('edited', 0.043), ('know', 0.042), ('chinese', 0.042), ('scripts', 0.041), ('aan', 0.041), ('phonological', 0.04), ('ak', 0.04), ('segments', 0.037), ('shall', 0.037), ('character', 0.036), ('translating', 0.036), ('mappings', 0.036), ('pose', 0.036), ('lexically', 0.035), ('variation', 0.034), ('informal', 0.034), ('eisenstein', 0.033), ('tools', 0.032), ('messages', 0.032), ('want', 0.032), ('dw', 0.03), ('normalize', 0.03), ('spaces', 0.03), ('message', 0.029), ('paraphrasing', 0.029), ('pair', 0.028), ('dialectal', 0.028), ('dialect', 0.028), ('rans', 0.028), ('confronted', 0.028), ('homophonic', 0.028), ('ainndd', 0.028), ('isabel', 0.028), ('denkowski', 0.028), ('hue', 0.028), ('meteor', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
2 0.22701722 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
Author: Yi Yang ; Jacob Eisenstein
Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
3 0.15260717 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts
Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li
Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.
4 0.14686808 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
Author: Xuchen Yao ; Benjamin Van Durme ; Chris Callison-Burch ; Peter Clark
Abstract: We introduce a novel discriminative model for phrase-based monolingual alignment using a semi-Markov CRF. Our model achieves stateof-the-art alignment accuracy on two phrasebased alignment datasets (RTE and paraphrase), while doing significantly better than other strong baselines in both non-identical alignment and phrase-only alignment. Additional experiments highlight the potential benefit of our alignment model to RTE, paraphrase identification and question answering, where even a naive application of our model’s alignment score approaches the state ofthe art.
5 0.12182074 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora
Author: Karl Pichotta ; John DeNero
Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.
6 0.10409791 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
7 0.098414987 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
8 0.093398906 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
9 0.087337457 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
10 0.081994154 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
11 0.081939161 150 emnlp-2013-Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries
12 0.079571672 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
13 0.076554544 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
14 0.075749665 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
15 0.071345754 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation
16 0.070972577 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
17 0.065024398 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
18 0.064734697 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
19 0.064204507 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
20 0.06291052 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
topicId topicWeight
[(0, -0.202), (1, -0.101), (2, -0.035), (3, -0.048), (4, 0.06), (5, -0.057), (6, 0.002), (7, 0.126), (8, -0.007), (9, -0.056), (10, 0.006), (11, 0.146), (12, 0.222), (13, 0.277), (14, -0.074), (15, -0.042), (16, 0.122), (17, 0.03), (18, -0.036), (19, -0.058), (20, 0.046), (21, -0.142), (22, 0.024), (23, -0.059), (24, -0.098), (25, 0.024), (26, -0.039), (27, 0.087), (28, 0.104), (29, 0.039), (30, 0.054), (31, -0.011), (32, 0.135), (33, 0.18), (34, 0.004), (35, 0.052), (36, -0.033), (37, -0.0), (38, 0.147), (39, 0.078), (40, 0.073), (41, 0.027), (42, -0.035), (43, -0.197), (44, -0.039), (45, 0.056), (46, -0.078), (47, 0.158), (48, -0.091), (49, -0.138)]
simIndex simValue paperId paperTitle
same-paper 1 0.95296198 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
2 0.75944185 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
Author: Yi Yang ; Jacob Eisenstein
Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
3 0.51959145 14 emnlp-2013-A Synchronous Context Free Grammar for Time Normalization
Author: Steven Bethard
Abstract: We present an approach to time normalization (e.g. the day before yesterday⇒20 13-04- 12) based on a synchronous contex⇒t free grammar. Synchronous rules map the source language to formally defined operators for manipulating times (FINDENCLOSED, STARTATENDOF, etc.). Time expressions are then parsed using an extended CYK+ algorithm, and converted to a normalized form by applying the operators recursively. For evaluation, a small set of synchronous rules for English time expressions were developed. Our model outperforms HeidelTime, the best time normalization system in TempEval 2013, on four different time normalization corpora.
4 0.48973599 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
Author: Xuchen Yao ; Benjamin Van Durme ; Chris Callison-Burch ; Peter Clark
Abstract: We introduce a novel discriminative model for phrase-based monolingual alignment using a semi-Markov CRF. Our model achieves stateof-the-art alignment accuracy on two phrasebased alignment datasets (RTE and paraphrase), while doing significantly better than other strong baselines in both non-identical alignment and phrase-only alignment. Additional experiments highlight the potential benefit of our alignment model to RTE, paraphrase identification and question answering, where even a naive application of our model’s alignment score approaches the state ofthe art.
Author: Russell Beckley ; Brian Roark
Abstract: Pronunciation dictionaries provide a readily available parallel corpus for learning to transduce between character strings and phoneme strings or vice versa. Translation models can be used to derive character-level paraphrases on either side of this transduction, allowing for the automatic derivation of alternative pronunciations or spellings. We examine finitestate and SMT-based methods for these related tasks, and demonstrate that the tasks have different characteristics finding alternative spellings is harder than alternative pronunciations and benefits from round-trip algorithms when the other does not. We also show that we can increase accuracy by modeling syllable stress. –
6 0.35443005 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
7 0.35420439 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts
8 0.34680349 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
9 0.34238195 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora
10 0.33242881 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
11 0.33102536 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
12 0.32136181 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
13 0.3199237 101 emnlp-2013-Improving Alignment of System Combination by Using Multi-objective Optimization
14 0.31329504 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
15 0.29897258 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
16 0.26054877 156 emnlp-2013-Recurrent Continuous Translation Models
17 0.25754887 139 emnlp-2013-Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora
18 0.24508165 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
19 0.24489641 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
20 0.24012811 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation
topicId topicWeight
[(3, 0.026), (18, 0.042), (22, 0.033), (30, 0.083), (47, 0.033), (50, 0.012), (51, 0.17), (66, 0.053), (71, 0.361), (75, 0.025), (77, 0.032), (90, 0.016), (96, 0.012)]
simIndex simValue paperId paperTitle
1 0.98554075 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!
Author: Laura Chiticariu ; Yunyao Li ; Frederick R. Reiss
Abstract: The rise of “Big Data” analytics over unstructured text has led to renewed interest in information extraction (IE). We surveyed the landscape ofIE technologies and identified a major disconnect between industry and academia: while rule-based IE dominates the commercial world, it is widely regarded as dead-end technology by the academia. We believe the disconnect stems from the way in which the two communities measure the benefits and costs of IE, as well as academia’s perception that rulebased IE is devoid of research challenges. We make a case for the importance of rule-based IE to industry practitioners. We then lay out a research agenda in advancing the state-of-theart in rule-based IE systems which we believe has the potential to bridge the gap between academic research and industry practice.
2 0.89613265 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic
Author: Qi Zhang ; Jin Qian ; Huan Chen ; Jihua Kang ; Xuanjing Huang
Abstract: Explanatory sentences are employed to clarify reasons, details, facts, and so on. High quality online product reviews usually include not only positive or negative opinions, but also a variety of explanations of why these opinions were given. These explanations can help readers get easily comprehensible information of the discussed products and aspects. Moreover, explanatory relations can also benefit sentiment analysis applications. In this work, we focus on the task of identifying subjective text segments and extracting their corresponding explanations from product reviews in discourse level. We propose a novel joint extraction method using firstorder logic to model rich linguistic features and long distance constraints. Experimental results demonstrate the effectiveness of the proposed method.
same-paper 3 0.87141353 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
4 0.82003665 123 emnlp-2013-Learning to Rank Lexical Substitutions
Author: Gyorgy Szarvas ; Robert Busa-Fekete ; Eyke Hullermeier
Abstract: The problem to replace a word with a synonym that fits well in its sentential context is known as the lexical substitution task. In this paper, we tackle this task as a supervised ranking problem. Given a dataset of target words, their sentential contexts and the potential substitutions for the target words, the goal is to train a model that accurately ranks the candidate substitutions based on their contextual fitness. As a key contribution, we customize and evaluate several learning-to-rank models to the lexical substitution task, including classification-based and regression-based approaches. On two datasets widely used for lexical substitution, our best models signifi- cantly advance the state-of-the-art.
5 0.62975144 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
Author: Xinjie Zhou ; Xiaojun Wan ; Jianguo Xiao
Abstract: Microblog messages pose severe challenges for current sentiment analysis techniques due to some inherent characteristics such as the length limit and informal writing style. In this paper, we study the problem of extracting opinion targets of Chinese microblog messages. Such fine-grained word-level task has not been well investigated in microblogs yet. We propose an unsupervised label propagation algorithm to address the problem. The opinion targets of all messages in a topic are collectively extracted based on the assumption that similar messages may focus on similar opinion targets. Topics in microblogs are identified by hashtags or using clustering algorithms. Experimental results on Chinese microblogs show the effectiveness of our framework and algorithms.
7 0.61275983 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media
8 0.59620529 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
9 0.58835274 144 emnlp-2013-Opinion Mining in Newspaper Articles by Entropy-Based Word Connections
10 0.58808434 143 emnlp-2013-Open Domain Targeted Sentiment
11 0.58081752 202 emnlp-2013-Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections Using Online Reviews
12 0.57225645 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions
13 0.56716639 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
14 0.56508011 170 emnlp-2013-Sentiment Analysis: How to Derive Prior Polarities from SentiWordNet
15 0.56357884 2 emnlp-2013-A Convex Alternative to IBM Model 2
16 0.56181723 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
18 0.56150496 121 emnlp-2013-Learning Topics and Positions from Debatepedia
19 0.56117475 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
20 0.56104529 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging