emnlp emnlp2013 emnlp2013-204 knowledge-graph by maker-knowledge-mining

204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication


Source: pdf

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Seza Do g˘ru o¨z23 (1) Human Media Interaction, University of Twente, Enschede, The Netherlands (2) Tilburg School of Humanities, Tilburg University, Tilburg, The Netherlands (3) Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA dong . [sent-2, score-0.037]

2 com, Abstract Multilingual speakers switch between languages in online and spoken communication. [sent-5, score-0.391]

3 Analyses of large scale multilingual data require automatic language identification at the word level. [sent-6, score-0.362]

4 For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. [sent-7, score-0.316]

5 Besides word level accuracy, we use two new metrics to evaluate this task. [sent-10, score-0.07]

6 1 Introduction There are more multilingual speakers in the world than monolingual speakers (Auer and Wei, 2007). [sent-11, score-0.442]

7 Multilingual speakers switch across languages in daily communication (Auer, 1999). [sent-12, score-0.435]

8 With the in- creasing use of social media, multilingual speakers also communicate with each other in online environments (Paolillo, 2011). [sent-13, score-0.417]

9 Data from such resources can be used to study code switching patterns and language preferences in online multilingual conversations. [sent-14, score-0.318]

10 Although most studies on multilingual online communication rely on manual identification of languages in relatively small datasets (Danet and Herring, 2007; Androutsopoulos, 2007), there is a growing demand for automatic language identification in larger datasets. [sent-15, score-0.817]

11 Such a system would also be useful for selecting the right parsers to process multilingual documents and to build language resources for minority languages (King and Abney, 2013). [sent-16, score-0.303]

12 In this paper, we identify Dutch (NL) en Turkish (TR) at the word level in a large online forum for Turkish-Dutch speakers living in the Netherlands. [sent-20, score-0.309]

13 The users in the forum frequently switch languages within posts, for example: Sariyi ver Wel mooi doelpunt So far, language identification has mostly been modeled as a document classification problem. [sent-21, score-0.534]

14 Most approaches rely on character or byte n-grams, by comparing n-gram profiles (Cavnar and Trenkle, 1994), or using various machine learning classifiers. [sent-22, score-0.091]

15 While McNamee (2005) argues that language identification is a solved problem, classification on a more finegrained level (instead of document level) remains a challenge (Hughes et al. [sent-23, score-0.238]

16 Furthermore, language identification is more difficult for short texts (Baldwin and Lui, 2010; Vatanen et al. [sent-25, score-0.16]

17 Tagging individual words (without context) has been done using dictionaries, affix statistics and classifiers using character n-grams (Hammarstr o¨m, 2007; Gottron and Lipka, 2010). [sent-29, score-0.086]

18 Although Yamaguchi and Tanaka-Ishii (2012) segmented text by language, their data was artificially created by randomly sampling and concatenating text segments (40-160 characters) from monolingual texts. [sent-30, score-0.096]

19 Therefore, the language switches do not reflect realistic switches as they occur in natural texts. [sent-31, score-0.22]

20 Most related to ours is the work by King and Abney (2013) who labeled languages of words in multilingual web pages, but evaluated the task only using word level accuracy. [sent-32, score-0.34]

21 2 Corpus Our data1 comes from one of the largest online communities in The Netherlands for Turkish-Dutch speakers. [sent-39, score-0.075]

22 All posts from May 2006 until October 2012 were crawled. [sent-40, score-0.267]

23 Examples 1 and 2 illustrate switches between Dutch and Turkish within the same post. [sent-45, score-0.11]

24 Example 1 is a switch at sentence level, example 2 is a switch at word level. [sent-46, score-0.25]

25 hotttt), replacement of Turkish characters (kahvalti instead of kahvaltı) and spelling variations (tankyu instead of thank you). [sent-49, score-0.099]

26 ben is am in Dutch and I Turkish), in making this a challenging task. [sent-52, score-0.037]

27 Since Dutch and English are typologically more similar to each other than Turkish, the English phrases (less than 1%) are classified as Dutch. [sent-57, score-0.092]

28 A native Dutch speaker annotated a random set of 100 posts (Cohen’s kappa = 0. [sent-59, score-0.267]

29 The following tokens were ignored for language identification: • • • • • • Smileys (as part of the forum markup, as well as tielxetyusal ( smileys fsu tchhe as “:)” ). [sent-61, score-0.248]

30 Posts for which all tokens are ignored, are not included in the corpus. [sent-70, score-0.086]

31 posts) are short, with on average 18 tokens per document. [sent-76, score-0.086]

32 The data represents realistic texts found in online multilingual communication. [sent-77, score-0.277]

33 1 Training Corpora We used the following corpora to extract dictionaries and language models. [sent-80, score-0.118]

34 Each corpus was chunked into large segments which were then selected randomly until 5M tokens were obtained for each language. [sent-86, score-0.122]

35 TextCat is based on the comparison of n-gram profiles and langid. [sent-91, score-0.044]

36 Words for which no language could be determined were assigned to Dutch. [sent-94, score-0.038]

37 These models were developed to identify the languages of the documents instead of words and we did not retrain them. [sent-95, score-0.101]

38 3 Models We start with models that assign languages based on only the current word. [sent-98, score-0.101]

39 Words with the highest probability for English were assigned to Dutch for evaluation. [sent-100, score-0.038]

40 Dictionary lookup (DICT) We extract dictionaries with word frequencies from the training corpora. [sent-101, score-0.158]

41 This approach looks up the words in the dictionaries and chooses the language for which the word has the highest probability. [sent-102, score-0.118]

42 nl/∼vannoord/TextCat/ 859 Language model (LM) We build a character n-gram language model for each language (max. [sent-109, score-0.047]

43 Dictionary + Language model (DICT+LM) We first use the dictionary lookup approach (D ICT). [sent-112, score-0.091]

44 Logistic Regression (LR) We use a logistic regression model that incorporates context with the following features: • • (Individual word) Label assigned by the (DI InCdiTv+idLuMal model. [sent-114, score-0.125]

45 the sequence “ben thuis” (am home) as a whole if ben is the current token). [sent-117, score-0.037]

46 We compare the use of the assigned labels (LAB) with the use of the log probability values (PROB) as feature values. [sent-119, score-0.038]

47 , 2001) in three settings: • • (Individual word) A CRF with only the tags assigned by lt wheo D ICT+LM wtoi thh oen ilnyd tihveid tuagasl taos-kens as a feature (BASE). [sent-121, score-0.038]

48 a CtuRreFss (same tfheeat LuAreBs as PinR OthBe logistic regression model) to capture additional context. [sent-124, score-0.087]

49 4 Implementation Language identification was not performed for texts within quotes. [sent-126, score-0.16]

50 lolllll), words are normalized by trimming same character sequences of three characters or more. [sent-129, score-0.084]

51 5 Evaluation The assigned labels can be used for computational analysis of multilingual data in different ways. [sent-134, score-0.24]

52 For example, these labels can be used to analyze language preferences in multilingual communication or the direction of the switches (from Turkish to Dutch or the other way around). [sent-135, score-0.472]

53 The evaluation at word and post levels is done with the following metrics: • Word classification precision (P), recall (R) and accuracy. [sent-137, score-0.196]

54 Although pthreisc sisi othne ( Pm),o rset straightforward approach to evaluate the task, it ignores the document boundaries. [sent-138, score-0.032]

55 This evaluates the measured proportion of languages in a post when the actual tags for individual words are not needed. [sent-140, score-0.295]

56 For example, such information is useful for analyzing the language preferences of users in the online forum. [sent-141, score-0.149]

57 Besides reporting the MAE over all posts, we also separate the performance over monolingual and bilingual posts (BL). [sent-142, score-0.435]

58 Post classification: Durham (2003) analyzed tPhoes tsw cliatcshsi f biceatwtieone:n languages 2in0 t3e)rm asn aolyf ztehed amount of monolingual and bilingual posts. [sent-143, score-0.269]

59 Our posts are classified as NL, TR or bilingual (BL) if all words are tagged in the particular language or both. [sent-144, score-0.45]

60 Significance tests were done by comparing the results of the word and post classification measures using McNemar’s test, and comparing the MAEs using paired t-tests. [sent-147, score-0.196]

61 All runs were significantly different from each other based on these tests (p < 0. [sent-148, score-0.04]

62 05), except the MAEs of the D ICT+LM and LR+LAB runs and the MAEs and post classification metrics between the CRFs runs. [sent-149, score-0.269]

63 The difficulty of the task is illustrated by examining the coverage of the tokens by the dictionaries. [sent-150, score-0.086]

64 6% of the tokens (dev + test set) appear in both dictionaries, 31. [sent-152, score-0.086]

65 This confirms that language identification at the word level needs different approaches than identification at the document level. [sent-157, score-0.357]

66 The combination of language models and dictionaries is more effective than the individual models. [sent-160, score-0.157]

67 The results improve when context was added using a logistic regression model, especially with the probability values as feature values. [sent-161, score-0.087]

68 More specifi- cally, CRFs improve the performance on monolingual posts, especially when a single word is tagged in the wrong language. [sent-163, score-0.093]

69 However, when the influence of the context is too high, CRFs reduce the performance in bilingual posts. [sent-164, score-0.108]

70 This is also illustrated with the results of the post classification. [sent-165, score-0.155]

71 559) for bilingual posts, while the CRF+PROB approach has a low recall (0. [sent-168, score-0.108]

72 The fraction of Dutch and Turkish in posts varies widely, providing additional challenges to the use of CRFs for this task. [sent-171, score-0.267]

73 Classifying posts first as monolingual/bilingual and tagging individual words afterwards for bilingual posts might improve the performance. [sent-172, score-0.681]

74 The evaluation metrics highlight different aspects of the task whereas word level accuracy gives a limited view. [sent-173, score-0.07]

75 We suggest using multiple metrics to evaluate this task for future research. [sent-174, score-0.033]

76 Dictionaries versus Language Models The results reported in Table 2 were obtained by sampling 5M tokens of each language. [sent-175, score-0.086]

77 To study the effect of the number of tokens on the performance of the D ICT and LM runs, we vary the amount of data. [sent-176, score-0.086]

78 This is probably due to the highly informal and noisy nature of our data. [sent-179, score-0.065]

79 s×a10m6pledto4kLDeM×InC1sT06 Figure 1: Effect of sampling size Post classification We experimented with classifying posts into TR, NL and bilingual using the results of the word level language identification (Table 2: post classification). [sent-184, score-0.768]

80 Posts were classified as a particular language if all words were tagged as belonging to that language, and bilingual otherwise. [sent-185, score-0.183]

81 10 classifies posts as TR if at least 90% of the words are classified as TR). [sent-190, score-0.309]

82 Allowing a small margin already increases the results of simpler approaches (such as the LR-PROB run, Table 3) by making it more robust against errors. [sent-191, score-0.054]

83 However, allowing a margin reduces the performance of the CRF runs. [sent-192, score-0.054]

84 865 Table 3: Effect of margin on post classification (LR-PROB run) Error analysis The manual analysis of the results revealed three main challenges: 1) Our data is highly informal with many spelling variations (e. [sent-203, score-0.377]

85 asdfghjfgsha- haha) 2) Words sharing spelling in Dutch and Turkish are difficult to identify especially when there is no context available (e. [sent-207, score-0.062]

86 For example, the word super in “Seyma, super” is annotated as Turkish since Seyma is also a Turkish word. [sent-211, score-0.042]

87 Based on precompiled lists, our system ignores named entities. [sent-213, score-0.032]

88 5 Conclusion We presented experiments on identifying the language of individual words in multilingual conversational data. [sent-217, score-0.291]

89 Our results reveal that language models are more robust than dictionaries and adding context improves the performance. [sent-218, score-0.118]

90 We evaluate our methods from different perspectives based on how language identification at word level can be used to analyze multilingual data. [sent-219, score-0.399]

91 The highly informal spelling in online environments and the occurrences of named entities pose challenges. [sent-220, score-0.252]

92 Future work could focus on cases with more than two languages, and languages that are typologically less distinct from each other or dialects (Trieschnigg et al. [sent-221, score-0.151]

93 Language, Culture and communication online, chapter Language choice and code-switching in German-based diasporic web-forums. [sent-230, score-0.119]

94 From codeswitching via language mix- ing to fused lects toward a dynamic typology of bilingual speech. [sent-244, score-0.171]

95 A comparison of lan- guage identification approaches on short, query-style texts. [sent-298, score-0.16]

96 Labeling the languages of words in mixed-language documents using weakly supervised methods. [sent-318, score-0.101]

97 An exploration of language identification techniques for the Dutch folktale database. [sent-396, score-0.16]

98 Language identification of short text segments with ngram models. [sent-404, score-0.196]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('turkish', 0.419), ('dutch', 0.351), ('posts', 0.267), ('multilingual', 0.202), ('identification', 0.16), ('post', 0.155), ('switch', 0.125), ('communication', 0.119), ('dictionaries', 0.118), ('tilburg', 0.11), ('switches', 0.11), ('ict', 0.11), ('prob', 0.11), ('bilingual', 0.108), ('forum', 0.107), ('crfs', 0.106), ('nl', 0.104), ('lm', 0.103), ('languages', 0.101), ('auer', 0.094), ('cavnar', 0.094), ('lui', 0.094), ('trieschnigg', 0.094), ('speakers', 0.09), ('tokens', 0.086), ('netherlands', 0.084), ('tr', 0.082), ('yamaguchi', 0.082), ('maes', 0.082), ('online', 0.075), ('abney', 0.075), ('baldwin', 0.071), ('bl', 0.071), ('mae', 0.07), ('king', 0.069), ('informal', 0.065), ('ceylan', 0.063), ('codeswitching', 0.063), ('danet', 0.063), ('gottron', 0.063), ('kahvalti', 0.063), ('multilingualism', 0.063), ('pedregosa', 0.063), ('seyma', 0.063), ('textcat', 0.063), ('thuis', 0.063), ('trenkle', 0.063), ('usernames', 0.063), ('vatanen', 0.063), ('spelling', 0.062), ('monolingual', 0.06), ('smileys', 0.055), ('margin', 0.054), ('dictionary', 0.051), ('conversational', 0.05), ('oxford', 0.05), ('dict', 0.05), ('hughes', 0.05), ('environments', 0.05), ('mcnamee', 0.05), ('typologically', 0.05), ('carter', 0.05), ('humanities', 0.05), ('theune', 0.05), ('sak', 0.05), ('internet', 0.049), ('lr', 0.049), ('logistic', 0.049), ('character', 0.047), ('fer', 0.047), ('lengthening', 0.047), ('hammarstr', 0.047), ('crf', 0.046), ('lab', 0.045), ('culture', 0.044), ('schler', 0.044), ('profiles', 0.044), ('androutsopoulos', 0.044), ('classified', 0.042), ('lrec', 0.042), ('home', 0.042), ('super', 0.042), ('preferences', 0.041), ('classification', 0.041), ('runs', 0.04), ('lookup', 0.04), ('individual', 0.039), ('regression', 0.038), ('assigned', 0.038), ('characters', 0.037), ('level', 0.037), ('dong', 0.037), ('secondly', 0.037), ('ben', 0.037), ('segments', 0.036), ('tagged', 0.033), ('metrics', 0.033), ('analyzing', 0.033), ('bergsma', 0.033), ('ignores', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

2 0.18930431 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.

3 0.14206856 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos

Abstract: Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models. 1 An Elephant in the Room Tokenization, the task of segmenting a text into words and sentences, is often regarded as a solved problem in natural language processing (Dridan and . Oepen, 2012), probably because many corpora are already in tokenized format. But like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the NLP pipeline affect further analysis negatively. And even though current tokenizers reach high performance, there are three issues that we feel haven’t been addressed satisfactorily so far: • • Most tokenizers are rule-based and therefore hard to maintain and hard to adapt to new domains and new languages (Silla Jr. and Kaestner, 2004); Word and sentence segmentation are often seen as separate tasks, but they obviously inform each other and it could be advantageous to view them as a combined task; 1422 bo s }@ rug .nl † g .chrupal a @ uvt .nl • Most tokenization methods provide no align- ment between raw and tokenized text, which makes mapping the tokenized version back onto the actual source hard or impossible. In short, we believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task? 2 Related Work Usually the text segmentation task is split into word tokenization and sentence boundary detection. Rulebased systems for finding word and sentence boundaries often are variations on matching hand-coded regular expressions (Grefenstette, 1999; Silla Jr. and Kaestner, 2004; Jurafsky and Martin, 2008; Dridan and Oepen, 2012). Several unsupervised systems have been proposed for sentence boundary detection. Kiss and Strunk (2006) present a language-independent, unsupervised approach and note that abbreviations form a major source of ambiguity in sentence boundary detection and use collocation detection to build a high-accuracy abbreviation detector. The resulting system reaches high accuracy, rivalling handcrafted rule-based and supervised systems. A similar system was proposed earlier by Mikheev (2002). Existing supervised learning approaches for sentence boundary detection use as features tokens preceding and following potential sentence boundary, part of speech, capitalization information and lists of abbreviations. Learning methods employed in Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t2ic2s–1426, these approaches include maximum entropy models (Reynar and Ratnaparkhi, 1997) decision trees (Riley, 1989), and neural networks (Palmer and Hearst, 1997). Closest to our work are approaches that present token and sentence splitters using conditional random fields (Tomanek et al., 2007; Fares et al., 2013). However, these previous approaches consider tokens (i.e. character sequences) as basic units for labeling, whereas we consider single characters. As a consequence, labeling is more resource-intensive, but it also gives us more expressive power. In fact, our approach kills two birds with one stone, as it allows us to integrate token and sentence boundaries detection into one task. 3 Method 3.1 IOB Tokenization IOB tagging is widely used in tasks identifying chunks of tokens. We use it to identify chunks of characters. Characters outside of tokens are labeled O, inside of tokens I. For characters at the beginning of tokens, we use S at sentence boundaries, otherwise T (for token). This scheme offers some nice features, like allowing for discontinuous tokens (e.g. hyphenated words at line breaks) and starting a new token in the middle of a typographic word if the tokenization scheme requires it, as e.g. in did|n ’t. An example ins given ien r Figure 1 i.t It didn ’ t matter i f the face s were male , S I I T I OT I I I IOT I OT I I OT I I I I OT I I I I OT I II I OT I TO female or tho se of chi ldren . Eighty T I I I I I I OT I I I I I I I OT OT I I OT I I I TOS I I I O III three percent o f people in the 3 0 -to-3 4 I I I I I I OT I I I I I I OT I I I I I I OT I I I OT I I OT OT I I I IO year old age range gave correct responses . T I I I OT I OT I I OT I I I I I OT I I I I T I OT I I II I OT I I I IIII Figure 1: Example of IOB-labeled characters 3.2 Datasets In our experiments we use three datasets to compare our method for different languages and for different domains: manually checked English newswire texts taken from the Groningen Meaning Bank, GMB (Basile et al., 2012), Dutch newswire texts, comprising two days from January 2000 extracted from the Twente News Corpus, TwNC (Ordelman et al., 1423 2007), and a random sample of Italian texts from the corpus (Borghetti et al., 2011). PAISA` Table 1: Datasets characteristics. NameLanguageDomainSentences Tokens TGNMCB EDnugtclihshNNeewwsswwiir ee492,,58387686 604,,644337 PAIItalianWeb/various42,674869,095 The data was converted into IOB format by inferring an alignment between the raw text and the segmented text. 3.3 Sequence labeling We apply the Wapiti implementation (Lavergne et al., 2010) of Conditional Random Fields (Lafferty et al., 2001), using as features the output label of each character, combined with 1) the character itself, 2) the output label on the previous character, 3) characters and/or their Unicode categories from context windows of varying sizes. For example, with a context size of 3, in Figure 1, features for the E in Eighty-three with the output label S would be E/S, O/S, /S, i/S, Space/S, Lowercase/S. The intuition is that the 3 1 existing Unicode categories can generalize across similar characters whereas character features can identify specific contexts such as abbreviations or contractions (e.g. didn ’t). The context window sizes we use are 0, 1, 3, 5, 7, 9, 11 and 13, centered around the focus character. 3.4 Deep learning of features Automatically learned word embeddings have been successfully used in NLP to reduce reliance on manual feature engineering and boost performance. We adapt this approach to the character level, and thus, in addition to hand-crafted features we use text representations induced in an unsupervised fashion from character strings. A complete discussion of our approach to learning text embeddings can be found in (Chrupała, 2013). Here we provide a brief overview. Our representations correspond to the activation of the hidden layer in a simple recurrent neural (SRN) network (Elman, 1990; Elman, 1991), implemented in a customized version of Mikolov (2010)’s RNNLM toolkit. The network is sequentially presented with a large amount of raw text and learns to predict the next character in the sequence. It uses the units in the hidden layer to store a generalized representation of the recent history. After training the network on large amounts on unlabeled text, we run it on the training and test data, and record the activation of the hidden layer at each position in the string as it tries to predict the next character. The vector of activations of the hidden layer provides additional features used to train and run the CRF. For each of the K = 10 most active units out of total J = 400 hidden units, we create features (f(1) . . . f(K)) defined as f(k) = 1if sj(k) > 0.5 and f(k) = 0 otherwise, where sj (k) returns the activation of the kth most active unit. For training the SRN only raw text is necessary. We trained on the entire GMB 2.0.0 (2.5M characters), the portion of TwNC corresponding to January 2000 (43M characters) and a sample of the PAISA` corpus (39M characters). 4 Results and Evaluation In order to evaluate the quality of the tokenization produced by our models we conducted several experiments with different combinations of features and context sizes. For these tests, the models are trained on an 80% portion of the data sets and tested on a 10% development set. Final results are obtained on a 10% test set. We report both absolute number of errors and error rates per thousand (‰). 4.1 Feature sets We experiment with two kinds of features at the character level, namely Unicode categories (31 dif- ferent ones), Unicode character codes, and a combination of them. Unicode categories are less sparse than the character codes (there are 88, 134, and 502 unique characters for English, Dutch and Italian, respectively), so the combination provide some generalization over just character codes. Table 2: Error rates obtained with different feature sets. Cat stands for Unicode category, Code for Unicode character code, and Cat-Code for a union of these features. Error rates per thousand (‰) Feature setEnglishDutchItalian C ao td-9eC-9ode-94568 ( 0 1. 241950) 1,7 4807243 ( 12 . 685078) 1,65 459872 ( 12 . 162470) 1424 From these results we see that categories alone perform worse than only codes. For English there is no gain from the combination over using only character codes. For Dutch and Italian there is an improvement, although it is only significant for Italian (p = 0.480 and p = 0.005 respectively, binomial exact test). We use this feature combination in the experiments that follow. Note that these models are trained using a symmetrical context of 9 characters (four left and four right of the current character). In the next section we show performance of models with different window sizes. 4.2 Context window We run an experiment to evaluate how the size of the context in the training phase impacts the classification. In Table 4.2 we show the results for symmetrical windows ranging in size from 1to 13. Table 3: Using different context window sizes. Feature setEngElisrhror rateDs puetrch thousandI (t‰al)ian C Ca t - C Co d e - 31957217830 ( 308 . 2635218) 4,39 2753742085(1 (017. 0956208 6) 92,1760 8516873 (1 (135. 31854617) CCaat - CCood e - 1 3198 ( 0 . 2 58) 7 561 ( 1 . 5 64) 6 9702 ( 1 . 1271) 4.3 SRN features We also tested the automatically learned features de- rived from the activation of the hidden layer of an SRN language model, as explained in Section 3. We combined these features with character code and Unicode category features in windows of different sizes. The results of this test are shown in Table 4. The first row shows the performance of SRN features on their own. The following rows show the combination of SRN features with the basic feature sets of varying window size. It can be seen that augmenting the feature sets with SRN features results in large reductions of error rates. The Cat-Code-1SRN setting has error rates comparable to Cat-Code9. The addition of SRN features to the two best previous models, Cat-Code-9 and Cat-Code-13, reduces the error rate by 83% resp. 81% for Dutch, and by 24% resp. 26% for Italian. All these differences are statistically significant according to the binomial test (p < 0.001). For English, there are too few errors to detect a statistically significant effect for Cat-Code-9 (p = 0.07), but for Cat-Code-13 we find p = 0.016. Table 4: Results obtained using different context window sizes and addition of SRN features. Error rates per thousand (‰) Feature setEnglishDutchItalian C SaRtN-C o d e -59173 S -R SN 27413( 0 . 2107635)12 7643251 (0 .42358697)45 90376489(01 .829631) In a final step, we selected the best models based on the development sets (Cat-Code-7-SRN for English and Dutch, Cat-Code-1 1-SRN for Italian), and checked their performance on the final test set. This resulted in 10 errors (0.27 ‰) for English (GMB corpus), 199 errors (0.35 ‰) for Dutch (TwNC corpus), and 454 errors (0.76 ‰) for Italian (PAISA` corpus). 5 Discussion It is interesting to examine what kind of errors the SRN features help avoid. In the English and Dutch datasets many errors are caused by failure to recognize personal titles and initials or misparsing of numbers. In the Italian data, a large fraction of errors is due to verbs with clitics, which are written as a single word, but treated as separate tokens. Table 5 shows examples of errors made by a simpler model that are fixed by adding SRN features. Table 6 shows the confusion matrices for the Cat-Code-7 and CatCode-7-SRN sets on the Dutch data. The mistake most improved by SRN features is T/I with 89% error reduction (see also Table 5). The is also the most common remaining mistake. A comparison with other approaches is hard because of the difference in datasets and task definition (combined word/sentence segmentation). Here we just compare our results for sentence segmentation (sentence F1 score) with Punkt, a state-of-the1425 Table 5: Positive impact of SRN features. Table 6: Confusion matrix for Dutch development set. GoTOSIld32P8r1e52d480iIc7te52d,3O0C4 at-32C So20d8e-47612T089P3r2e8d5ic43t1065Ied7,2C3Oa04 t-C3o1d2S0 e-78S1R0562TN038 art sentence boundary detection system (Kiss and Strunk, 2006). With its standard distributed models, Punkt achieves 98.51% on our English test set, 98.87% on Dutch and 98.34% on Italian, compared with 100%, 99.54% and 99.51% for our system. Our system benefits here from its ability to adapt to a new domain with relatively little (but annotated) training data. 6 What Elephant? Word and sentence segmentation can be recast as a combined tagging task. This way, tokenization is cast as a supervised learning task, causing a shift of labor from writing rules to manually correcting labels. Learning this task with CRF achieves high accuracy.1 Furthermore, our tagging method does not lose the connection between original text and tokens. In future work, we plan to broaden the scope of this work to other steps in document preparation, 1All software needed to replicate our experiments is available at http : / / gmb . let . rug . nl / e lephant / experiments . php such as normalization of punctuation, and their interaction with segmentation. We further plan to test our method on a wider range of datasets, allowing a more direct comparison with other approaches. Finally, we plan to explore the possibility of a statistical universal segmentation model for mutliple languages and domains. In a famous scene with a live elephant on stage, the comedian Jimmy Durante was asked about it by a policeman and surprisedly answered: “What elephant?” We feel we can say the same now as far as tokenization is concerned. References Valerio Basile, Johan Bos, Kilian Evang, and Noortje Venhuizen. 2012. Developing a large semantically annotated corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pages 3 196–3200, Istanbul, Turkey. Claudia Borghetti, Sara Castagnoli, and Marco Brunello. 2011. Itesti del web: una proposta di classificazione sulla base del corpus PAISA`. In M. Cerruti, E. Corino, and C. Onesti, editors, Formale e informale. La variazione di registro nella comunicazione elettronica, pages 147–170. Carocci, Roma. Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, USA. Rebecca Dridan and Stephan Oepen. 2012. Tokenization: Returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea. Association for Computational Linguistics. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive science, 14(2): 179–21 1. Jeffrey L. Elman. 1991 . Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2): 195–225. Murhaf Fares, Stephan Oepen, and Zhang Yi. 2013. Machine learning for high-quality tokenization - replicating variable tokenization schemes. In A. Gelbukh, editor, CICLING 2013, volume 7816 of Lecture Notes in Computer Science, pages 23 1–244, Berlin Heidelberg. Springer-Verlag. Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117–133. Kluwer Academic Publishers, Dordrecht. – –. 1426 Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2nd edition. Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485–525. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289. Thomas Lavergne, Olivier Capp e´, and Fran ¸cois Yvon. 2010. Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504–5 13, Uppsala, Sweden, July. Association for Computational Linguistics. Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics, 28(3):289–3 18. Tom a´ˇ s Mikolov, Martin Karafi´ at, Luk a´ˇ s Burget, Jan Cˇernock y´, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech. Roeland Ordelman, Franciska de Jong, Arjan van Hessen, and Hendri Hondorp. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsleter, 12(3/4):4–7. David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 16– 19, Washington, DC, USA. Association for Computational Linguistics. Michael D. Riley. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, HLT ’89, pages 339–352, Stroudsburg, PA, USA. Association for Computational Linguistics. Carlos N. Silla Jr. and Celso A. A. Kaestner. 2004. An analysis of sentence boundary detection systems for English and Portuguese documents. In Fifth International Conference on Intelligent Text Processing and Computational Linguistics, volume 2945 of Lecture Notes in Computer Science, pages 135–141. Springer. Katrin Tomanek, Joachim Wermter, and Udo Hahn. 2007. Sentence and token splitting based on conditional random fields. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49–57, Melbourne, Australia.

4 0.14069016 4 emnlp-2013-A Dataset for Research on Short-Text Conversations

Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen

Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.

5 0.10388272 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

Author: Ruihong Huang ; Ellen Riloff

Abstract: The goal of our research is to distinguish veterinary message board posts that describe a case involving a specific patient from posts that ask a general question. We create a text classifier that incorporates automatically generated attribute lists for veterinary patients to tackle this problem. Using a small amount of annotated data, we train an information extraction (IE) system to identify veterinary patient attributes. We then apply the IE system to a large collection of unannotated texts to produce a lexicon of veterinary patient attribute terms. Our experimental results show that using the learned attribute lists to encode patient information in the text classifier yields improved performance on this task.

6 0.09180674 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

7 0.087133497 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

8 0.074299924 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

9 0.074206814 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge

10 0.067682274 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

11 0.065126911 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

12 0.062502205 27 emnlp-2013-Authorship Attribution of Micro-Messages

13 0.059450805 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

14 0.05890974 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

15 0.058448169 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

16 0.057533264 32 emnlp-2013-Automatic Idiom Identification in Wiktionary

17 0.056015074 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging

18 0.055563323 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

19 0.054525822 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization

20 0.053787 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.177), (1, -0.008), (2, -0.057), (3, -0.096), (4, -0.034), (5, -0.046), (6, 0.036), (7, 0.155), (8, 0.015), (9, -0.114), (10, -0.028), (11, 0.128), (12, 0.16), (13, 0.075), (14, -0.127), (15, -0.001), (16, -0.059), (17, 0.033), (18, -0.273), (19, 0.017), (20, 0.174), (21, 0.091), (22, 0.038), (23, -0.06), (24, 0.027), (25, -0.009), (26, 0.072), (27, -0.01), (28, -0.004), (29, -0.022), (30, -0.081), (31, 0.015), (32, -0.109), (33, -0.118), (34, -0.172), (35, -0.05), (36, 0.002), (37, 0.033), (38, 0.037), (39, 0.045), (40, -0.014), (41, 0.007), (42, -0.023), (43, 0.104), (44, 0.057), (45, 0.019), (46, 0.042), (47, -0.074), (48, 0.059), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95157945 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

2 0.77512443 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

Author: Ruihong Huang ; Ellen Riloff

Abstract: The goal of our research is to distinguish veterinary message board posts that describe a case involving a specific patient from posts that ask a general question. We create a text classifier that incorporates automatically generated attribute lists for veterinary patients to tackle this problem. Using a small amount of annotated data, we train an information extraction (IE) system to identify veterinary patient attributes. We then apply the IE system to a large collection of unannotated texts to produce a lexicon of veterinary patient attribute terms. Our experimental results show that using the learned attribute lists to encode patient information in the text classifier yields improved performance on this task.

3 0.72151107 4 emnlp-2013-A Dataset for Research on Short-Text Conversations

Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen

Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.

4 0.62209868 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.

5 0.45623702 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos

Abstract: Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models. 1 An Elephant in the Room Tokenization, the task of segmenting a text into words and sentences, is often regarded as a solved problem in natural language processing (Dridan and . Oepen, 2012), probably because many corpora are already in tokenized format. But like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the NLP pipeline affect further analysis negatively. And even though current tokenizers reach high performance, there are three issues that we feel haven’t been addressed satisfactorily so far: • • Most tokenizers are rule-based and therefore hard to maintain and hard to adapt to new domains and new languages (Silla Jr. and Kaestner, 2004); Word and sentence segmentation are often seen as separate tasks, but they obviously inform each other and it could be advantageous to view them as a combined task; 1422 bo s }@ rug .nl † g .chrupal a @ uvt .nl • Most tokenization methods provide no align- ment between raw and tokenized text, which makes mapping the tokenized version back onto the actual source hard or impossible. In short, we believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task? 2 Related Work Usually the text segmentation task is split into word tokenization and sentence boundary detection. Rulebased systems for finding word and sentence boundaries often are variations on matching hand-coded regular expressions (Grefenstette, 1999; Silla Jr. and Kaestner, 2004; Jurafsky and Martin, 2008; Dridan and Oepen, 2012). Several unsupervised systems have been proposed for sentence boundary detection. Kiss and Strunk (2006) present a language-independent, unsupervised approach and note that abbreviations form a major source of ambiguity in sentence boundary detection and use collocation detection to build a high-accuracy abbreviation detector. The resulting system reaches high accuracy, rivalling handcrafted rule-based and supervised systems. A similar system was proposed earlier by Mikheev (2002). Existing supervised learning approaches for sentence boundary detection use as features tokens preceding and following potential sentence boundary, part of speech, capitalization information and lists of abbreviations. Learning methods employed in Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t2ic2s–1426, these approaches include maximum entropy models (Reynar and Ratnaparkhi, 1997) decision trees (Riley, 1989), and neural networks (Palmer and Hearst, 1997). Closest to our work are approaches that present token and sentence splitters using conditional random fields (Tomanek et al., 2007; Fares et al., 2013). However, these previous approaches consider tokens (i.e. character sequences) as basic units for labeling, whereas we consider single characters. As a consequence, labeling is more resource-intensive, but it also gives us more expressive power. In fact, our approach kills two birds with one stone, as it allows us to integrate token and sentence boundaries detection into one task. 3 Method 3.1 IOB Tokenization IOB tagging is widely used in tasks identifying chunks of tokens. We use it to identify chunks of characters. Characters outside of tokens are labeled O, inside of tokens I. For characters at the beginning of tokens, we use S at sentence boundaries, otherwise T (for token). This scheme offers some nice features, like allowing for discontinuous tokens (e.g. hyphenated words at line breaks) and starting a new token in the middle of a typographic word if the tokenization scheme requires it, as e.g. in did|n ’t. An example ins given ien r Figure 1 i.t It didn ’ t matter i f the face s were male , S I I T I OT I I I IOT I OT I I OT I I I I OT I I I I OT I II I OT I TO female or tho se of chi ldren . Eighty T I I I I I I OT I I I I I I I OT OT I I OT I I I TOS I I I O III three percent o f people in the 3 0 -to-3 4 I I I I I I OT I I I I I I OT I I I I I I OT I I I OT I I OT OT I I I IO year old age range gave correct responses . T I I I OT I OT I I OT I I I I I OT I I I I T I OT I I II I OT I I I IIII Figure 1: Example of IOB-labeled characters 3.2 Datasets In our experiments we use three datasets to compare our method for different languages and for different domains: manually checked English newswire texts taken from the Groningen Meaning Bank, GMB (Basile et al., 2012), Dutch newswire texts, comprising two days from January 2000 extracted from the Twente News Corpus, TwNC (Ordelman et al., 1423 2007), and a random sample of Italian texts from the corpus (Borghetti et al., 2011). PAISA` Table 1: Datasets characteristics. NameLanguageDomainSentences Tokens TGNMCB EDnugtclihshNNeewwsswwiir ee492,,58387686 604,,644337 PAIItalianWeb/various42,674869,095 The data was converted into IOB format by inferring an alignment between the raw text and the segmented text. 3.3 Sequence labeling We apply the Wapiti implementation (Lavergne et al., 2010) of Conditional Random Fields (Lafferty et al., 2001), using as features the output label of each character, combined with 1) the character itself, 2) the output label on the previous character, 3) characters and/or their Unicode categories from context windows of varying sizes. For example, with a context size of 3, in Figure 1, features for the E in Eighty-three with the output label S would be E/S, O/S, /S, i/S, Space/S, Lowercase/S. The intuition is that the 3 1 existing Unicode categories can generalize across similar characters whereas character features can identify specific contexts such as abbreviations or contractions (e.g. didn ’t). The context window sizes we use are 0, 1, 3, 5, 7, 9, 11 and 13, centered around the focus character. 3.4 Deep learning of features Automatically learned word embeddings have been successfully used in NLP to reduce reliance on manual feature engineering and boost performance. We adapt this approach to the character level, and thus, in addition to hand-crafted features we use text representations induced in an unsupervised fashion from character strings. A complete discussion of our approach to learning text embeddings can be found in (Chrupała, 2013). Here we provide a brief overview. Our representations correspond to the activation of the hidden layer in a simple recurrent neural (SRN) network (Elman, 1990; Elman, 1991), implemented in a customized version of Mikolov (2010)’s RNNLM toolkit. The network is sequentially presented with a large amount of raw text and learns to predict the next character in the sequence. It uses the units in the hidden layer to store a generalized representation of the recent history. After training the network on large amounts on unlabeled text, we run it on the training and test data, and record the activation of the hidden layer at each position in the string as it tries to predict the next character. The vector of activations of the hidden layer provides additional features used to train and run the CRF. For each of the K = 10 most active units out of total J = 400 hidden units, we create features (f(1) . . . f(K)) defined as f(k) = 1if sj(k) > 0.5 and f(k) = 0 otherwise, where sj (k) returns the activation of the kth most active unit. For training the SRN only raw text is necessary. We trained on the entire GMB 2.0.0 (2.5M characters), the portion of TwNC corresponding to January 2000 (43M characters) and a sample of the PAISA` corpus (39M characters). 4 Results and Evaluation In order to evaluate the quality of the tokenization produced by our models we conducted several experiments with different combinations of features and context sizes. For these tests, the models are trained on an 80% portion of the data sets and tested on a 10% development set. Final results are obtained on a 10% test set. We report both absolute number of errors and error rates per thousand (‰). 4.1 Feature sets We experiment with two kinds of features at the character level, namely Unicode categories (31 dif- ferent ones), Unicode character codes, and a combination of them. Unicode categories are less sparse than the character codes (there are 88, 134, and 502 unique characters for English, Dutch and Italian, respectively), so the combination provide some generalization over just character codes. Table 2: Error rates obtained with different feature sets. Cat stands for Unicode category, Code for Unicode character code, and Cat-Code for a union of these features. Error rates per thousand (‰) Feature setEnglishDutchItalian C ao td-9eC-9ode-94568 ( 0 1. 241950) 1,7 4807243 ( 12 . 685078) 1,65 459872 ( 12 . 162470) 1424 From these results we see that categories alone perform worse than only codes. For English there is no gain from the combination over using only character codes. For Dutch and Italian there is an improvement, although it is only significant for Italian (p = 0.480 and p = 0.005 respectively, binomial exact test). We use this feature combination in the experiments that follow. Note that these models are trained using a symmetrical context of 9 characters (four left and four right of the current character). In the next section we show performance of models with different window sizes. 4.2 Context window We run an experiment to evaluate how the size of the context in the training phase impacts the classification. In Table 4.2 we show the results for symmetrical windows ranging in size from 1to 13. Table 3: Using different context window sizes. Feature setEngElisrhror rateDs puetrch thousandI (t‰al)ian C Ca t - C Co d e - 31957217830 ( 308 . 2635218) 4,39 2753742085(1 (017. 0956208 6) 92,1760 8516873 (1 (135. 31854617) CCaat - CCood e - 1 3198 ( 0 . 2 58) 7 561 ( 1 . 5 64) 6 9702 ( 1 . 1271) 4.3 SRN features We also tested the automatically learned features de- rived from the activation of the hidden layer of an SRN language model, as explained in Section 3. We combined these features with character code and Unicode category features in windows of different sizes. The results of this test are shown in Table 4. The first row shows the performance of SRN features on their own. The following rows show the combination of SRN features with the basic feature sets of varying window size. It can be seen that augmenting the feature sets with SRN features results in large reductions of error rates. The Cat-Code-1SRN setting has error rates comparable to Cat-Code9. The addition of SRN features to the two best previous models, Cat-Code-9 and Cat-Code-13, reduces the error rate by 83% resp. 81% for Dutch, and by 24% resp. 26% for Italian. All these differences are statistically significant according to the binomial test (p < 0.001). For English, there are too few errors to detect a statistically significant effect for Cat-Code-9 (p = 0.07), but for Cat-Code-13 we find p = 0.016. Table 4: Results obtained using different context window sizes and addition of SRN features. Error rates per thousand (‰) Feature setEnglishDutchItalian C SaRtN-C o d e -59173 S -R SN 27413( 0 . 2107635)12 7643251 (0 .42358697)45 90376489(01 .829631) In a final step, we selected the best models based on the development sets (Cat-Code-7-SRN for English and Dutch, Cat-Code-1 1-SRN for Italian), and checked their performance on the final test set. This resulted in 10 errors (0.27 ‰) for English (GMB corpus), 199 errors (0.35 ‰) for Dutch (TwNC corpus), and 454 errors (0.76 ‰) for Italian (PAISA` corpus). 5 Discussion It is interesting to examine what kind of errors the SRN features help avoid. In the English and Dutch datasets many errors are caused by failure to recognize personal titles and initials or misparsing of numbers. In the Italian data, a large fraction of errors is due to verbs with clitics, which are written as a single word, but treated as separate tokens. Table 5 shows examples of errors made by a simpler model that are fixed by adding SRN features. Table 6 shows the confusion matrices for the Cat-Code-7 and CatCode-7-SRN sets on the Dutch data. The mistake most improved by SRN features is T/I with 89% error reduction (see also Table 5). The is also the most common remaining mistake. A comparison with other approaches is hard because of the difference in datasets and task definition (combined word/sentence segmentation). Here we just compare our results for sentence segmentation (sentence F1 score) with Punkt, a state-of-the1425 Table 5: Positive impact of SRN features. Table 6: Confusion matrix for Dutch development set. GoTOSIld32P8r1e52d480iIc7te52d,3O0C4 at-32C So20d8e-47612T089P3r2e8d5ic43t1065Ied7,2C3Oa04 t-C3o1d2S0 e-78S1R0562TN038 art sentence boundary detection system (Kiss and Strunk, 2006). With its standard distributed models, Punkt achieves 98.51% on our English test set, 98.87% on Dutch and 98.34% on Italian, compared with 100%, 99.54% and 99.51% for our system. Our system benefits here from its ability to adapt to a new domain with relatively little (but annotated) training data. 6 What Elephant? Word and sentence segmentation can be recast as a combined tagging task. This way, tokenization is cast as a supervised learning task, causing a shift of labor from writing rules to manually correcting labels. Learning this task with CRF achieves high accuracy.1 Furthermore, our tagging method does not lose the connection between original text and tokens. In future work, we plan to broaden the scope of this work to other steps in document preparation, 1All software needed to replicate our experiments is available at http : / / gmb . let . rug . nl / e lephant / experiments . php such as normalization of punctuation, and their interaction with segmentation. We further plan to test our method on a wider range of datasets, allowing a more direct comparison with other approaches. Finally, we plan to explore the possibility of a statistical universal segmentation model for mutliple languages and domains. In a famous scene with a live elephant on stage, the comedian Jimmy Durante was asked about it by a policeman and surprisedly answered: “What elephant?” We feel we can say the same now as far as tokenization is concerned. References Valerio Basile, Johan Bos, Kilian Evang, and Noortje Venhuizen. 2012. Developing a large semantically annotated corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pages 3 196–3200, Istanbul, Turkey. Claudia Borghetti, Sara Castagnoli, and Marco Brunello. 2011. Itesti del web: una proposta di classificazione sulla base del corpus PAISA`. In M. Cerruti, E. Corino, and C. Onesti, editors, Formale e informale. La variazione di registro nella comunicazione elettronica, pages 147–170. Carocci, Roma. Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, USA. Rebecca Dridan and Stephan Oepen. 2012. Tokenization: Returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea. Association for Computational Linguistics. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive science, 14(2): 179–21 1. Jeffrey L. Elman. 1991 . Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2): 195–225. Murhaf Fares, Stephan Oepen, and Zhang Yi. 2013. Machine learning for high-quality tokenization - replicating variable tokenization schemes. In A. Gelbukh, editor, CICLING 2013, volume 7816 of Lecture Notes in Computer Science, pages 23 1–244, Berlin Heidelberg. Springer-Verlag. Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117–133. Kluwer Academic Publishers, Dordrecht. – –. 1426 Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2nd edition. Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485–525. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289. Thomas Lavergne, Olivier Capp e´, and Fran ¸cois Yvon. 2010. Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504–5 13, Uppsala, Sweden, July. Association for Computational Linguistics. Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics, 28(3):289–3 18. Tom a´ˇ s Mikolov, Martin Karafi´ at, Luk a´ˇ s Burget, Jan Cˇernock y´, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech. Roeland Ordelman, Franciska de Jong, Arjan van Hessen, and Hendri Hondorp. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsleter, 12(3/4):4–7. David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 16– 19, Washington, DC, USA. Association for Computational Linguistics. Michael D. Riley. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, HLT ’89, pages 339–352, Stroudsburg, PA, USA. Association for Computational Linguistics. Carlos N. Silla Jr. and Celso A. A. Kaestner. 2004. An analysis of sentence boundary detection systems for English and Portuguese documents. In Fifth International Conference on Intelligent Text Processing and Computational Linguistics, volume 2945 of Lecture Notes in Computer Science, pages 135–141. Springer. Katrin Tomanek, Joachim Wermter, and Udo Hahn. 2007. Sentence and token splitting based on conditional random fields. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49–57, Melbourne, Australia.

6 0.390479 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

7 0.33622667 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

8 0.33547094 26 emnlp-2013-Assembling the Kazakh Language Corpus

9 0.33330923 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge

10 0.32525927 32 emnlp-2013-Automatic Idiom Identification in Wiktionary

11 0.2905713 23 emnlp-2013-Animacy Detection with Voting Models

12 0.28730506 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

13 0.28414419 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

14 0.28219068 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

15 0.27467909 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

16 0.26662925 24 emnlp-2013-Application of Localized Similarity for Web Documents

17 0.26658276 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

18 0.26075396 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

19 0.25848091 27 emnlp-2013-Authorship Attribution of Micro-Messages

20 0.25674105 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.089), (18, 0.044), (22, 0.031), (30, 0.089), (36, 0.012), (47, 0.322), (50, 0.015), (51, 0.188), (66, 0.033), (71, 0.025), (75, 0.018), (77, 0.023), (96, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82118154 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization

Author: Yi Yang ; Jacob Eisenstein

Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.

same-paper 2 0.79094827 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

3 0.7689929 118 emnlp-2013-Learning Biological Processes with Global Constraints

Author: Aju Thalappillil Scaria ; Jonathan Berant ; Mengqiu Wang ; Peter Clark ; Justin Lewis ; Brittany Harding ; Christopher D. Manning

Abstract: Biological processes are complex phenomena involving a series of events that are related to one another through various relationships. Systems that can understand and reason over biological processes would dramatically improve the performance of semantic applications involving inference such as question answering (QA) – specifically “How? ” and “Why? ” questions. In this paper, we present the task of process extraction, in which events within a process and the relations between the events are automatically extracted from text. We represent processes by graphs whose edges describe a set oftemporal, causal and co-reference event-event relations, and characterize the structural properties of these graphs (e.g., the graphs are connected). Then, we present a method for extracting relations between the events, which exploits these structural properties by performing joint in- ference over the set of extracted relations. On a novel dataset containing 148 descriptions of biological processes (released with this paper), we show significant improvement comparing to baselines that disregard process structure.

4 0.56235063 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a novel approach to determine textual similarity. A layered methodology to transform text into logic forms is proposed, and semantic features are derived from a logic prover. Experimental results show that incorporating the semantic structure of sentences is beneficial. When training data is unavailable, scores obtained from the logic prover in an unsupervised manner outperform supervised methods.

5 0.56120914 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

6 0.5610109 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

7 0.5596742 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

8 0.55962414 61 emnlp-2013-Detecting Promotional Content in Wikipedia

9 0.55947012 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

10 0.55878317 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

11 0.55809003 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

12 0.55767024 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation

13 0.55751508 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

14 0.55663794 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

15 0.55537796 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

16 0.55496323 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

17 0.55469435 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

18 0.55387664 143 emnlp-2013-Open Domain Targeted Sentiment

19 0.55333817 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction

20 0.55317634 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation