acl acl2013 acl2013-37 knowledge-graph by maker-knowledge-mining

37 acl-2013-Adaptive Parser-Centric Text Normalization


Source: pdf

Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li

Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com fe i Abstract Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. [sent-5, score-0.606]

2 In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. [sent-7, score-0.675]

3 To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. [sent-8, score-0.569]

4 Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations. [sent-10, score-0.51]

5 1 Introduction Text normalization is the task of transforming informal writing into its standard form in the language. [sent-11, score-0.606]

6 The use of normalization in these applications poses multiple challenges. [sent-15, score-0.51]

7 First, as it is most often conceptualized, normalization is seen as the task of mapping all out-of-vocabulary non-standard word tokens to their in-vocabulary standard forms. [sent-16, score-0.57]

8 This broader definition of the normalization task may include modifying punctuation and capitalization, and adding, removing, or reordering words. [sent-18, score-0.679]

9 Second, as with other NLP techniques, normalization approaches are often focused on one primary domain of interest (e. [sent-19, score-0.541]

10 This work introduces a customizable normalization approach designed with domain transfer in mind. [sent-24, score-0.571]

11 In short, customization is done by providing the normalizer with replacement generators, which we define in Section 3. [sent-25, score-0.336]

12 We show that the introduction of a small set of domain-specific generators and training data allows our model to outperform a set of competitive baselines, including state-of-the-art word-to-word normalization. [sent-26, score-0.419]

13 Additionally, the flexibility ofthe model also allows it to attempt to produce fully grammatical sentences, something not typically handled by word-to-word normalization approaches. [sent-27, score-0.549]

14 Another potential problem with state-of-the-art normalization is the lack of appropriate evaluation metrics. [sent-28, score-0.51]

15 The normalization task is most frequently motivated by pointing to the need for clean text for downstream processing applications, such as syntactic parsing. [sent-29, score-0.638]

16 However, most studies of normalization give little insight into whether and to what degree the normalization process improves 1159 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h. [sent-30, score-1.02]

17 For instance, it is unclear how performance mea- sured by the typical normalization evaluation metrics of word error rate and BLEU score (Papineni et al. [sent-33, score-0.606]

18 To address this problem, this work introduces an evaluation metric that ties normalization performance directly to the performance of a downstream dependency parser. [sent-35, score-0.767]

19 In Section 2 we discuss previous approaches to the normalization problem. [sent-37, score-0.51]

20 Section 3 presents our normalization framework, including the actual normalization and learning procedures. [sent-38, score-1.02]

21 (2001) took the first major look at the normalization problem, citing the need for normalized text for downstream applications. [sent-43, score-0.693]

22 Unlike later works that would primarily focus on specific noisy data sets, their work is notable for attempting to develop normalization as a general process that could be applied to different domains. [sent-44, score-0.51]

23 The recent rise of heavily informal writing styles such as Twitter and SMS messages set off a new round of interest in the normalization problem. [sent-45, score-0.666]

24 Research on SMS and Twitter normalization has been roughly categorized as drawing inspiration from three other areas ofNLP (Kobus et al. [sent-46, score-0.55]

25 The statistical machine translation (SMT) metaphor was the first proposed to handle the text normalization problem (Aw et al. [sent-48, score-0.51]

26 Recent work has looked at the construction of normalization dictionaries (Han et al. [sent-64, score-0.51]

27 Although it is almost universally used as a motivating factor, most normalization work does not directly focus on improving downstream applications. [sent-67, score-0.638]

28 While a few notable exceptions highlight the need for normalization as part of textto-speech systems (Beaufort et al. [sent-68, score-0.51]

29 , 2010; Pennell and Liu, 2010), these works do not give any direct insight into how much the normalization process actually improves the performance of these systems. [sent-69, score-0.539]

30 To our knowledge, the work presented here is the first to clearly link the output of a normalization system to the output of the downstream application. [sent-70, score-0.638]

31 3 Model In this section we introduce our normalization framework, which draws inspiration from our previous work on spelling correction for search (Bao et al. [sent-72, score-0.577]

32 Given the input x, we apply a series of replacement generators, where a replacement generator is a function that takes x as input and produces a collection of replacements. [sent-80, score-0.581]

33 Here, a replacement is a statement of the form “replace tokens xi, . [sent-81, score-0.288]

34 ” More precisely, a replacement is a triple hi, j,si, wMhoerere p1r ≤ sie ≤ j ≤ n + 1m aenndt s i as a sequence soi,f wtokheenres. [sent-85, score-0.309]

35 For instance, in our running example the replacement h2, 3, would noti replaces x2 = weo reupdleacnemt ewnitth h 2w,o3u,wl odu not ; h1, 2, Ayi replaces x1 wdeithn tits weiltfh (hence, dd noeost n;o ht1 change x); h1, 2, ? [sent-90, score-0.336]

36 The provided replacement generators can be either generic (cross domain) or domain-specific, allowing for domain customization. [sent-95, score-0.761]

37 In Section 4, we discuss the replacement generators used in our empirical study. [sent-96, score-0.65]

38 2 Normalization Graph Given the input x and the set of replacements produced by our generators, we associate a unique Boolean variable Xr with each replacement r. [sent-98, score-0.431]

39 As expected, Xr being true means that the replacement r takes place in producing the output sequence. [sent-99, score-0.26]

40 A truth assignment α to our variables Xr is sound if every two replacements r and r0 with α(Xr) = α(Xr0) = true are locally consistent. [sent-108, score-0.353]

41 We say that α is complete if every token of x is captured by at least one replacement r with α(Xr) = true. [sent-109, score-0.26]

42 The output (normalized sequence) defined by a legal assignment α is, naturally, the concatenation (from left to right) of the strings s in the replacements r = hi, j,si with α(Xr) = true. [sent-111, score-0.353]

43 In this work, dependencies of the second type are restricted to pairs of variables, where each pair corresponds to a replacement and a consistent follower thereof. [sent-117, score-0.307]

44 Therefore, we propose a clearer model by a directed graph, as illustrated in Figure 1 (where nodes are represented by replacements r instead of the variables Xr, for readability). [sent-122, score-0.239]

45 Moreover, we introduce two dummy nodes, start and end, with an edge from start to each variable that corresponds to a prefix of the input sequence x, and an edge from each variable that corresponds to a suffix of x to end. [sent-124, score-0.189]

46 The principal advantage of modeling the dependencies in such a directed graph is that now, the legal assignments are in one-to-one correspondence with the paths from start to end; this is a straightforward observation that we do not prove here. [sent-125, score-0.201]

47 ih42,6h3 ,1sw4e2,o uIfihldmi end Figure 1: Example of a normalization graph; the nodes are replacements generated by the replacement generators, and every path from start to end implies a legal assignment x, Θ) = 0 if α is not legal, and otherwise, p(α | x,Θ) =Z(1x)X→YY ∈eαxp(Xjθjφj(X,Y,x)). [sent-131, score-1.208]

48 2, a legal assignment α corresponds itno a path nfr 3o. [sent-149, score-0.235]

49 4 Learning Our labeled data consists of pairs (xi, where xi is an input sequence (to normalize) and is a (manually) normalized sequence. [sent-155, score-0.199]

50 In particular, we describe our replacement generators and features. [sent-172, score-0.65]

51 1 Replacement Generators One advantage of our proposed model is that the reliance on replacement generators allows for strong flexibility. [sent-174, score-0.65]

52 Each generator can be seen as a black box, allowing replacements that are created heuristically, statistically, or by external tools to be incorporated within the same framework. [sent-175, score-0.264]

53 Table 1: Example replacement generators To build a set of generic replacement generators suitable for normalizing a variety of data types, we collected a set of about 400 Twitter posts as development data. [sent-176, score-1.462]

54 Using that data, a series of generators were created; a sample of them are shown in Table 1. [sent-177, score-0.39]

55 As shown in the table, these gener- ators cover a variety of normalization behavior, from changing non-standard word forms to inserting and deleting tokens. [sent-178, score-0.51]

56 Positional: Information from positions is used primarily to handle capitalization and punctuation insertion, for example, by incorporating features for capitalized words after stop punctuation or the insertion of stop punctuation at the end of the sentence. [sent-190, score-0.373]

57 The goal is to evaluate the framework in two aspects: (1) usefulness for downstream applications (specifically dependency parsing), and (2) domain adaptability. [sent-194, score-0.19]

58 In this work, we aim to evaluate the performance of a normalizer based on how it affects the performance of downstream applications. [sent-198, score-0.262]

59 They also cannot take into account other aspects that may have an impact on downstream performance, such as the word reordering as seen in the example in Figure 4. [sent-202, score-0.205]

60 Therefore, we propose a new evaluation metric that directly equates normalization performance with the performance of a common downstream application—dependency parsing. [sent-203, score-0.736]

61 First, we produce gold standard normalized data by manually normalizing sentences to their full grammatically correct form. [sent-205, score-0.276]

62 In addition to the word-to-word mapping performed in typical normalization gold standard generation, this annotation procedure includes all actions necessary to make the sentence grammatical, such as word reordering, modifying capitalization, and removing emoticons. [sent-206, score-0.602]

63 We then run an off-the-shelf dependency parser on the gold standard normalized data to produce our gold standard parses. [sent-207, score-0.281]

64 fied on example test/gold text, and corresponding metric scores To compare the parses produced over automatically normalized data to the gold standard, we look at the subjects, verbs, and objects (SVO) identified in each parse. [sent-209, score-0.186]

65 Note that SO denotes the set of identified subjects and objects whereas SOgold denotes the set of subjects and objects identified when parsing the gold-standard normalization. [sent-211, score-0.203]

66 2 Results To establish the extensibility of our normalization system, we present results in three different domains: Twitter posts, Short Message Service (SMS) messages, and call-center logs. [sent-215, score-0.51]

67 In each case, we ran the proposed system with two different configurations: one using only the generic replacement generators presented in Section 4 (denoted as generic), and one that adds additional domain-specific generators for the cor- responding domain (denoted as domain-specific). [sent-218, score-1.151]

68 We compare our system to the following baseline solutions: w/oN: No normalization is performed. [sent-227, score-0.51]

69 w2wN: The output of the word-to-word normalization of Han and Baldwin (201 1). [sent-229, score-0.51]

70 To produce Twitter-specific generators, we examined the Twitter development data collected for generic generator production (Section 4). [sent-241, score-0.18]

71 These generators focused on the Twitter-specific notions of hashtags (#), ats (@), and retweets (RT). [sent-242, score-0.39]

72 For each case, we implemented generators that allowed for either the initial symbol or the entire token to be deleted (e. [sent-243, score-0.39]

73 As shown, the domain-specific generators yielded performance significantly above the generic ones and all baselines. [sent-248, score-0.499]

74 Even without domain-specific generators, our system outperformed the word-to-word normalization approaches. [sent-249, score-0.51]

75 These results validate the hypothesis that simple word-to-word normalization is insufficient if the goal of normalization is to improve dependency parsing; even if a system could produce perfect word-to-word normalization, it would produce lower quality parses than those produced by our approach. [sent-251, score-1.129]

76 As a replacement generator for SMS-specific substitutions, we used a mapping dictionary of SMS abbreviations. [sent-266, score-0.321]

77 Nonetheless, the trends on SMS data mirror those on Twitter data, with the domain-specific generators achieving the greatest overall performance. [sent-272, score-0.39]

78 However, while the generic setting still manages to outperform most baselines, it did not outperform the gold word-to-word normalization. [sent-273, score-0.201]

79 In fact, the gold word-to-word normalization was much more competitive on this data, outperforming even the domain-specific system on verbs alone. [sent-274, score-0.573]

80 This should not be seen as surprising, as word-to-word normalization is most likely to be beneficial for cases like this where the proportion of non-standard tokens is high. [sent-275, score-0.57]

81 The examination of callcenter logs allows us to examine the ability of our system to perform normalization in more disparate domains. [sent-299, score-0.537]

82 However, the use of domain-specific generators once again led to significantly increased perfor- mance on subjects and objects. [sent-308, score-0.449]

83 6 Discussion The results presented in the previous section suggest that domain transfer using the proposed normalization framework is possible with only a small amount of effort. [sent-309, score-0.541]

84 The relatively modest set of additional replacement generators included in each data set allowed the domain-specific approaches to significantly outperform the generic approach. [sent-310, score-0.759]

85 2 establish a point that has often been assumed but, to the best of our knowledge, has never been explicitly shown: per- forming normalization is indeed beneficial to dependency parsing on informal text. [sent-316, score-0.666]

86 The parse of the normalized text was substantially better than the parse of the original raw text in all domains, with absolute performance increases ranging from about 18-25% on subjects and objects. [sent-317, score-0.175]

87 The proposed approach significantly outperforms the state-of-the-art word-to-word normalization approach. [sent-319, score-0.51]

88 This result gives strong evidence for the conclusion that parser-targeted normalization requires a broader understanding of the scope of the normalization task. [sent-321, score-1.054]

89 Although word reordering could be incor- porated into the model as a combination of a deletion and an insertion, the model as currently devised cannot easily link these two replacements to one another. [sent-324, score-0.246]

90 As such, no reordering-based replacement generators were implemented in the presented system. [sent-326, score-0.65]

91 Similarly, punctuation insertion proved to be challenging, often requiring a deep analysis of the sentence. [sent-332, score-0.181]

92 7 Conclusions This work presents a framework for normalization with an eye towards domain adaptation. [sent-337, score-0.541]

93 The proposed framework builds a statistical model over a series of replacement generators. [sent-338, score-0.26]

94 Additionally, this work introduces a parsercentric view of normalization, in which the performance of the normalizer is directly tied to the performance of a downstream dependency parser. [sent-341, score-0.293]

95 This evaluation metric allows for a deeper understanding of how certain normalization actions impact the output of the parser. [sent-342, score-0.579]

96 Using this met- ric, this work established that, when dependency parsing is the goal, typical word-to-word normalization approaches are insufficient. [sent-343, score-0.618]

97 By taking a broader look at the normalization task, the approach presented here is able to outperform not only state-of-the-art word-to-word normalization approaches but also manual word-to-word annotations. [sent-344, score-1.083]

98 Although the work presented here established that more than word-to-word normalization was necessary to produce parser-ready normalizations, it remains unclear which specific normalization tasks are most critical to parser performance. [sent-345, score-1.173]

99 A hybrid rule/model-based finite-state framework for normalizing sms messages. [sent-365, score-0.379]

100 A character-level machine translation approach for normalization of SMS abbreviations. [sent-432, score-0.51]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('normalization', 0.51), ('generators', 0.39), ('sms', 0.297), ('replacement', 0.26), ('replacements', 0.171), ('twitter', 0.151), ('downstream', 0.128), ('xr', 0.12), ('informal', 0.096), ('xi', 0.095), ('legal', 0.093), ('punctuation', 0.09), ('assignment', 0.089), ('ygiold', 0.085), ('normalizing', 0.082), ('generic', 0.08), ('normalizer', 0.076), ('pennell', 0.076), ('choudhury', 0.072), ('yunyao', 0.07), ('han', 0.069), ('giold', 0.064), ('hertz', 0.064), ('gold', 0.063), ('generator', 0.061), ('messages', 0.06), ('subjects', 0.059), ('insertion', 0.058), ('normalized', 0.055), ('path', 0.053), ('kobus', 0.052), ('normalizations', 0.052), ('baldwin', 0.051), ('sequence', 0.049), ('established', 0.048), ('spell', 0.047), ('follower', 0.047), ('reordering', 0.045), ('capitalization', 0.045), ('sproat', 0.045), ('feel', 0.043), ('instantiation', 0.043), ('bao', 0.043), ('beaufort', 0.043), ('benny', 0.043), ('kimelfeld', 0.043), ('reannotated', 0.043), ('treasure', 0.043), ('inspiration', 0.04), ('metric', 0.04), ('produce', 0.039), ('replaces', 0.038), ('directed', 0.038), ('edge', 0.038), ('graph', 0.038), ('deana', 0.038), ('pulls', 0.038), ('grammatically', 0.037), ('ritter', 0.037), ('message', 0.037), ('consistency', 0.037), ('unclear', 0.036), ('chiticariu', 0.035), ('fuliang', 0.035), ('monojit', 0.035), ('systemt', 0.035), ('wat', 0.035), ('broader', 0.034), ('cook', 0.034), ('ep', 0.034), ('ching', 0.033), ('proved', 0.033), ('start', 0.032), ('truth', 0.032), ('seen', 0.032), ('raw', 0.032), ('locally', 0.031), ('normalisation', 0.031), ('metrics', 0.031), ('dependency', 0.031), ('domain', 0.031), ('deletion', 0.03), ('variables', 0.03), ('liu', 0.03), ('parser', 0.03), ('customizable', 0.03), ('weng', 0.03), ('dataset', 0.03), ('parsing', 0.029), ('outperform', 0.029), ('actions', 0.029), ('xj', 0.029), ('performance', 0.029), ('afrl', 0.028), ('tokens', 0.028), ('normalize', 0.028), ('objects', 0.028), ('correction', 0.027), ('valued', 0.027), ('logs', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 37 acl-2013-Adaptive Parser-Centric Text Normalization

Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li

Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.

2 0.44981983 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

Author: Hany Hassan ; Arul Menezes

Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.

3 0.090281337 288 acl-2013-Punctuation Prediction with Transition-based Parsing

Author: Dongdong Zhang ; Shuangzhi Wu ; Nan Yang ; Mu Li

Abstract: Punctuations are not available in automatic speech recognition outputs, which could create barriers to many subsequent text processing tasks. This paper proposes a novel method to predict punctuation symbols for the stream of words in transcribed speech texts. Our method jointly performs parsing and punctuation prediction by integrating a rich set of syntactic features when processing words from left to right. It can exploit a global view to capture long-range dependencies for punctuation prediction with linear complexity. The experimental results on the test data sets of IWSLT and TDT4 show that our method can achieve high-level performance in punctuation prediction over the stream of words in transcribed speech text. 1

4 0.084797584 240 acl-2013-Microblogs as Parallel Corpora

Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso

Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.

5 0.084295951 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts

Author: Alexandra Balahur ; Hristo Tanev

Abstract: Nowadays, the importance of Social Media is constantly growing, as people often use such platforms to share mainstream media news and comment on the events that they relate to. As such, people no loger remain mere spectators to the events that happen in the world, but become part of them, commenting on their developments and the entities involved, sharing their opinions and distributing related content. This paper describes a system that links the main events detected from clusters of newspaper articles to tweets related to them, detects complementary information sources from the links they contain and subsequently applies sentiment analysis to classify them into positive, negative and neutral. In this manner, readers can follow the main events happening in the world, both from the perspective of mainstream as well as social media and the public’s perception on them. This system will be part of the EMM media monitoring framework working live and it will be demonstrated using Google Earth.

6 0.078828052 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

7 0.078620166 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

8 0.077603742 146 acl-2013-Exploiting Social Media for Natural Language Processing: Bridging the Gap between Language-centric and Real-world Applications

9 0.077470899 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars

10 0.076227017 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction

11 0.072808817 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

12 0.062322617 62 acl-2013-Automatic Term Ambiguity Detection

13 0.060613599 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

14 0.059564184 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

15 0.056938656 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

16 0.055787168 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

17 0.05538458 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

18 0.055365007 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

19 0.054793723 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

20 0.054336168 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.187), (1, 0.004), (2, -0.004), (3, 0.023), (4, 0.045), (5, 0.01), (6, 0.055), (7, 0.043), (8, 0.06), (9, -0.061), (10, -0.106), (11, 0.042), (12, -0.007), (13, -0.103), (14, 0.0), (15, -0.05), (16, 0.027), (17, 0.014), (18, 0.002), (19, -0.014), (20, 0.042), (21, 0.052), (22, 0.085), (23, -0.054), (24, 0.006), (25, 0.077), (26, -0.012), (27, 0.07), (28, -0.042), (29, -0.039), (30, -0.053), (31, -0.073), (32, -0.156), (33, 0.167), (34, 0.149), (35, -0.105), (36, -0.112), (37, -0.05), (38, -0.081), (39, 0.064), (40, 0.091), (41, 0.076), (42, 0.134), (43, -0.187), (44, -0.126), (45, 0.062), (46, -0.087), (47, -0.045), (48, -0.029), (49, 0.169)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9405278 37 acl-2013-Adaptive Parser-Centric Text Normalization

Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li

Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.

2 0.91587335 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

Author: Hany Hassan ; Arul Menezes

Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.

3 0.52884334 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints

Author: Alessandro Valitutti ; Hannu Toivonen ; Antoine Doucet ; Jukka M. Toivanen

Abstract: We propose a method for automated generation of adult humor by lexical replacement and present empirical evaluation results of the obtained humor. We propose three types of lexical constraints as building blocks of humorous word substitution: constraints concerning the similarity of sounds or spellings of the original word and the substitute, a constraint requiring the substitute to be a taboo word, and constraints concerning the position and context of the replacement. Empirical evidence from extensive user studies indicates that these constraints can increase the effectiveness of humor generation significantly.

4 0.45844001 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars

Author: David Chiang ; Jacob Andreas ; Daniel Bauer ; Karl Moritz Hermann ; Bevan Jones ; Kevin Knight

Abstract: Hyperedge replacement grammar (HRG) is a formalism for generating and transforming graphs that has potential applications in natural language understanding and generation. A recognition algorithm due to Lautemann is known to be polynomial-time for graphs that are connected and of bounded degree. We present a more precise characterization of the algorithm’s complexity, an optimization analogous to binarization of contextfree grammars, and some important implementation details, resulting in an algorithm that is practical for natural-language applications. The algorithm is part of Bolinas, a new software toolkit for HRG processing.

5 0.45096502 146 acl-2013-Exploiting Social Media for Natural Language Processing: Bridging the Gap between Language-centric and Real-world Applications

Author: Simone Paolo Ponzetto ; Andrea Zielinski

Abstract: unkown-abstract

6 0.43093762 371 acl-2013-Unsupervised joke generation from big data

7 0.42500779 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation

8 0.39950326 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

9 0.39820066 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation

10 0.39613202 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

11 0.39304286 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning

12 0.37626991 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures

13 0.37466335 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

14 0.37466219 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

15 0.37449843 293 acl-2013-Random Walk Factoid Annotation for Collective Discourse

16 0.37240842 114 acl-2013-Detecting Chronic Critics Based on Sentiment Polarity and Userâ•Žs Behavior in Social Media

17 0.36841512 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

18 0.36681944 163 acl-2013-From Natural Language Specifications to Program Input Parsers

19 0.36094832 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation

20 0.35700577 301 acl-2013-Resolving Entity Morphs in Censored Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.047), (6, 0.032), (11, 0.044), (15, 0.014), (24, 0.041), (26, 0.065), (35, 0.058), (37, 0.011), (42, 0.048), (48, 0.034), (70, 0.035), (88, 0.015), (90, 0.02), (95, 0.461)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99469793 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

Author: Chi-kiu Lo ; Karteek Addanki ; Markus Saers ; Dekai Wu

Abstract: We present the first ever results showing that tuning a machine translation system against a semantic frame based objective function, MEANT, produces more robustly adequate translations than tuning against BLEU or TER as measured across commonly used metrics and human subjective evaluation. Moreover, for informal web forum data, human evaluators preferred MEANT-tuned systems over BLEU- or TER-tuned systems by a significantly wider margin than that for formal newswire—even though automatic semantic parsing might be expected to fare worse on informal language. We argue thatbypreserving the meaning ofthe trans- lations as captured by semantic frames right in the training process, an MT system is constrained to make more accurate choices of both lexical and reordering rules. As a result, MT systems tuned against semantic frame based MT evaluation metrics produce output that is more adequate. Tuning a machine translation system against a semantic frame based objective function is independent ofthe translation model paradigm, so, any translation model can benefit from the semantic knowledge incorporated to improve translation adequacy through our approach.

2 0.99402153 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

Author: Kareem Darwish

Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.

3 0.99194676 359 acl-2013-Translating Dialectal Arabic to English

Author: Hassan Sajjad ; Kareem Darwish ; Yonatan Belinkov

Abstract: We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG0, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.

4 0.98838508 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Author: Kang Liu ; Liheng Xu ; Jun Zhao

Abstract: Mining opinion targets is a fundamental and important task for opinion mining from online reviews. To this end, there are usually two kinds of methods: syntax based and alignment based methods. Syntax based methods usually exploited syntactic patterns to extract opinion targets, which were however prone to suffer from parsing errors when dealing with online informal texts. In contrast, alignment based methods used word alignment model to fulfill this task, which could avoid parsing errors without using parsing. However, there is no research focusing on which kind of method is more better when given a certain amount of reviews. To fill this gap, this paper empiri- cally studies how the performance of these two kinds of methods vary when changing the size, domain and language of the corpus. We further combine syntactic patterns with alignment model by using a partially supervised framework and investigate whether this combination is useful or not. In our experiments, we verify that our combination is effective on the corpus with small and medium size.

5 0.98285192 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

Author: Tsutomu Hirao ; Tomoharu Iwata ; Masaaki Nagata

Abstract: Unsupervised object matching (UOM) is a promising approach to cross-language natural language processing such as bilingual lexicon acquisition, parallel corpus construction, and cross-language text categorization, because it does not require labor-intensive linguistic resources. However, UOM only finds one-to-one correspondences from data sets with the same number of instances in source and target domains, and this prevents us from applying UOM to real-world cross-language natural language processing tasks. To alleviate these limitations, we proposes latent semantic matching, which embeds objects in both source and target language domains into a shared latent topic space. We demonstrate the effectiveness of our method on cross-language text categorization. The results show that our method outperforms conventional unsupervised object matching methods.

same-paper 6 0.98027241 37 acl-2013-Adaptive Parser-Centric Text Normalization

7 0.96523964 66 acl-2013-Beam Search for Solving Substitution Ciphers

8 0.96360075 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection

9 0.94473606 289 acl-2013-QuEst - A translation quality estimation framework

10 0.8867051 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

11 0.88188082 135 acl-2013-English-to-Russian MT evaluation campaign

12 0.87762803 255 acl-2013-Name-aware Machine Translation

13 0.86974537 317 acl-2013-Sentence Level Dialect Identification in Arabic

14 0.85832644 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

15 0.85183161 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

16 0.85127807 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

17 0.8468141 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

18 0.84456235 240 acl-2013-Microblogs as Parallel Corpora

19 0.8438186 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation

20 0.82903051 97 acl-2013-Cross-lingual Projections between Languages from Different Families