emnlp emnlp2013 emnlp2013-72 knowledge-graph by maker-knowledge-mining

72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

Source: pdf

Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos

Abstract: Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models. 1 An Elephant in the Room Tokenization, the task of segmenting a text into words and sentences, is often regarded as a solved problem in natural language processing (Dridan and . Oepen, 2012), probably because many corpora are already in tokenized format. But like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the NLP pipeline affect further analysis negatively. And even though current tokenizers reach high performance, there are three issues that we feel haven’t been addressed satisfactorily so far: • • Most tokenizers are rule-based and therefore hard to maintain and hard to adapt to new domains and new languages (Silla Jr. and Kaestner, 2004); Word and sentence segmentation are often seen as separate tasks, but they obviously inform each other and it could be advantageous to view them as a combined task; 1422 bo s }@ rug .nl † g .chrupal a @ uvt .nl • Most tokenization methods provide no align- ment between raw and tokenized text, which makes mapping the tokenized version back onto the actual source hard or impossible. In short, we believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task? 2 Related Work Usually the text segmentation task is split into word tokenization and sentence boundary detection. Rulebased systems for finding word and sentence boundaries often are variations on matching hand-coded regular expressions (Grefenstette, 1999; Silla Jr. and Kaestner, 2004; Jurafsky and Martin, 2008; Dridan and Oepen, 2012). Several unsupervised systems have been proposed for sentence boundary detection. Kiss and Strunk (2006) present a language-independent, unsupervised approach and note that abbreviations form a major source of ambiguity in sentence boundary detection and use collocation detection to build a high-accuracy abbreviation detector. The resulting system reaches high accuracy, rivalling handcrafted rule-based and supervised systems. A similar system was proposed earlier by Mikheev (2002). Existing supervised learning approaches for sentence boundary detection use as features tokens preceding and following potential sentence boundary, part of speech, capitalization information and lists of abbreviations. Learning methods employed in Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t2ic2s–1426, these approaches include maximum entropy models (Reynar and Ratnaparkhi, 1997) decision trees (Riley, 1989), and neural networks (Palmer and Hearst, 1997). Closest to our work are approaches that present token and sentence splitters using conditional random fields (Tomanek et al., 2007; Fares et al., 2013). However, these previous approaches consider tokens (i.e. character sequences) as basic units for labeling, whereas we consider single characters. As a consequence, labeling is more resource-intensive, but it also gives us more expressive power. In fact, our approach kills two birds with one stone, as it allows us to integrate token and sentence boundaries detection into one task. 3 Method 3.1 IOB Tokenization IOB tagging is widely used in tasks identifying chunks of tokens. We use it to identify chunks of characters. Characters outside of tokens are labeled O, inside of tokens I. For characters at the beginning of tokens, we use S at sentence boundaries, otherwise T (for token). This scheme offers some nice features, like allowing for discontinuous tokens (e.g. hyphenated words at line breaks) and starting a new token in the middle of a typographic word if the tokenization scheme requires it, as e.g. in did|n ’t. An example ins given ien r Figure 1 i.t It didn ’ t matter i f the face s were male , S I I T I OT I I I IOT I OT I I OT I I I I OT I I I I OT I II I OT I TO female or tho se of chi ldren . Eighty T I I I I I I OT I I I I I I I OT OT I I OT I I I TOS I I I O III three percent o f people in the 3 0 -to-3 4 I I I I I I OT I I I I I I OT I I I I I I OT I I I OT I I OT OT I I I IO year old age range gave correct responses . T I I I OT I OT I I OT I I I I I OT I I I I T I OT I I II I OT I I I IIII Figure 1: Example of IOB-labeled characters 3.2 Datasets In our experiments we use three datasets to compare our method for different languages and for different domains: manually checked English newswire texts taken from the Groningen Meaning Bank, GMB (Basile et al., 2012), Dutch newswire texts, comprising two days from January 2000 extracted from the Twente News Corpus, TwNC (Ordelman et al., 1423 2007), and a random sample of Italian texts from the corpus (Borghetti et al., 2011). PAISA` Table 1: Datasets characteristics. NameLanguageDomainSentences Tokens TGNMCB EDnugtclihshNNeewwsswwiir ee492,,58387686 604,,644337 PAIItalianWeb/various42,674869,095 The data was converted into IOB format by inferring an alignment between the raw text and the segmented text. 3.3 Sequence labeling We apply the Wapiti implementation (Lavergne et al., 2010) of Conditional Random Fields (Lafferty et al., 2001), using as features the output label of each character, combined with 1) the character itself, 2) the output label on the previous character, 3) characters and/or their Unicode categories from context windows of varying sizes. For example, with a context size of 3, in Figure 1, features for the E in Eighty-three with the output label S would be E/S, O/S, /S, i/S, Space/S, Lowercase/S. The intuition is that the 3 1 existing Unicode categories can generalize across similar characters whereas character features can identify specific contexts such as abbreviations or contractions (e.g. didn ’t). The context window sizes we use are 0, 1, 3, 5, 7, 9, 11 and 13, centered around the focus character. 3.4 Deep learning of features Automatically learned word embeddings have been successfully used in NLP to reduce reliance on manual feature engineering and boost performance. We adapt this approach to the character level, and thus, in addition to hand-crafted features we use text representations induced in an unsupervised fashion from character strings. A complete discussion of our approach to learning text embeddings can be found in (Chrupała, 2013). Here we provide a brief overview. Our representations correspond to the activation of the hidden layer in a simple recurrent neural (SRN) network (Elman, 1990; Elman, 1991), implemented in a customized version of Mikolov (2010)’s RNNLM toolkit. The network is sequentially presented with a large amount of raw text and learns to predict the next character in the sequence. It uses the units in the hidden layer to store a generalized representation of the recent history. After training the network on large amounts on unlabeled text, we run it on the training and test data, and record the activation of the hidden layer at each position in the string as it tries to predict the next character. The vector of activations of the hidden layer provides additional features used to train and run the CRF. For each of the K = 10 most active units out of total J = 400 hidden units, we create features (f(1) . . . f(K)) defined as f(k) = 1if sj(k) > 0.5 and f(k) = 0 otherwise, where sj (k) returns the activation of the kth most active unit. For training the SRN only raw text is necessary. We trained on the entire GMB 2.0.0 (2.5M characters), the portion of TwNC corresponding to January 2000 (43M characters) and a sample of the PAISA` corpus (39M characters). 4 Results and Evaluation In order to evaluate the quality of the tokenization produced by our models we conducted several experiments with different combinations of features and context sizes. For these tests, the models are trained on an 80% portion of the data sets and tested on a 10% development set. Final results are obtained on a 10% test set. We report both absolute number of errors and error rates per thousand (‰). 4.1 Feature sets We experiment with two kinds of features at the character level, namely Unicode categories (31 dif- ferent ones), Unicode character codes, and a combination of them. Unicode categories are less sparse than the character codes (there are 88, 134, and 502 unique characters for English, Dutch and Italian, respectively), so the combination provide some generalization over just character codes. Table 2: Error rates obtained with different feature sets. Cat stands for Unicode category, Code for Unicode character code, and Cat-Code for a union of these features. Error rates per thousand (‰) Feature setEnglishDutchItalian C ao td-9eC-9ode-94568 ( 0 1. 241950) 1,7 4807243 ( 12 . 685078) 1,65 459872 ( 12 . 162470) 1424 From these results we see that categories alone perform worse than only codes. For English there is no gain from the combination over using only character codes. For Dutch and Italian there is an improvement, although it is only significant for Italian (p = 0.480 and p = 0.005 respectively, binomial exact test). We use this feature combination in the experiments that follow. Note that these models are trained using a symmetrical context of 9 characters (four left and four right of the current character). In the next section we show performance of models with different window sizes. 4.2 Context window We run an experiment to evaluate how the size of the context in the training phase impacts the classification. In Table 4.2 we show the results for symmetrical windows ranging in size from 1to 13. Table 3: Using different context window sizes. Feature setEngElisrhror rateDs puetrch thousandI (t‰al)ian C Ca t - C Co d e - 31957217830 ( 308 . 2635218) 4,39 2753742085(1 (017. 0956208 6) 92,1760 8516873 (1 (135. 31854617) CCaat - CCood e - 1 3198 ( 0 . 2 58) 7 561 ( 1 . 5 64) 6 9702 ( 1 . 1271) 4.3 SRN features We also tested the automatically learned features de- rived from the activation of the hidden layer of an SRN language model, as explained in Section 3. We combined these features with character code and Unicode category features in windows of different sizes. The results of this test are shown in Table 4. The first row shows the performance of SRN features on their own. The following rows show the combination of SRN features with the basic feature sets of varying window size. It can be seen that augmenting the feature sets with SRN features results in large reductions of error rates. The Cat-Code-1SRN setting has error rates comparable to Cat-Code9. The addition of SRN features to the two best previous models, Cat-Code-9 and Cat-Code-13, reduces the error rate by 83% resp. 81% for Dutch, and by 24% resp. 26% for Italian. All these differences are statistically significant according to the binomial test (p < 0.001). For English, there are too few errors to detect a statistically significant effect for Cat-Code-9 (p = 0.07), but for Cat-Code-13 we find p = 0.016. Table 4: Results obtained using different context window sizes and addition of SRN features. Error rates per thousand (‰) Feature setEnglishDutchItalian C SaRtN-C o d e -59173 S -R SN 27413( 0 . 2107635)12 7643251 (0 .42358697)45 90376489(01 .829631) In a final step, we selected the best models based on the development sets (Cat-Code-7-SRN for English and Dutch, Cat-Code-1 1-SRN for Italian), and checked their performance on the final test set. This resulted in 10 errors (0.27 ‰) for English (GMB corpus), 199 errors (0.35 ‰) for Dutch (TwNC corpus), and 454 errors (0.76 ‰) for Italian (PAISA` corpus). 5 Discussion It is interesting to examine what kind of errors the SRN features help avoid. In the English and Dutch datasets many errors are caused by failure to recognize personal titles and initials or misparsing of numbers. In the Italian data, a large fraction of errors is due to verbs with clitics, which are written as a single word, but treated as separate tokens. Table 5 shows examples of errors made by a simpler model that are fixed by adding SRN features. Table 6 shows the confusion matrices for the Cat-Code-7 and CatCode-7-SRN sets on the Dutch data. The mistake most improved by SRN features is T/I with 89% error reduction (see also Table 5). The is also the most common remaining mistake. A comparison with other approaches is hard because of the difference in datasets and task definition (combined word/sentence segmentation). Here we just compare our results for sentence segmentation (sentence F1 score) with Punkt, a state-of-the1425 Table 5: Positive impact of SRN features. Table 6: Confusion matrix for Dutch development set. GoTOSIld32P8r1e52d480iIc7te52d,3O0C4 at-32C So20d8e-47612T089P3r2e8d5ic43t1065Ied7,2C3Oa04 t-C3o1d2S0 e-78S1R0562TN038 art sentence boundary detection system (Kiss and Strunk, 2006). With its standard distributed models, Punkt achieves 98.51% on our English test set, 98.87% on Dutch and 98.34% on Italian, compared with 100%, 99.54% and 99.51% for our system. Our system benefits here from its ability to adapt to a new domain with relatively little (but annotated) training data. 6 What Elephant? Word and sentence segmentation can be recast as a combined tagging task. This way, tokenization is cast as a supervised learning task, causing a shift of labor from writing rules to manually correcting labels. Learning this task with CRF achieves high accuracy.1 Furthermore, our tagging method does not lose the connection between original text and tokens. In future work, we plan to broaden the scope of this work to other steps in document preparation, 1All software needed to replicate our experiments is available at http : / / gmb . let . rug . nl / e lephant / experiments . php such as normalization of punctuation, and their interaction with segmentation. We further plan to test our method on a wider range of datasets, allowing a more direct comparison with other approaches. Finally, we plan to explore the possibility of a statistical universal segmentation model for mutliple languages and domains. In a famous scene with a live elephant on stage, the comedian Jimmy Durante was asked about it by a policeman and surprisedly answered: “What elephant?” We feel we can say the same now as far as tokenization is concerned. References Valerio Basile, Johan Bos, Kilian Evang, and Noortje Venhuizen. 2012. Developing a large semantically annotated corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pages 3 196–3200, Istanbul, Turkey. Claudia Borghetti, Sara Castagnoli, and Marco Brunello. 2011. Itesti del web: una proposta di classificazione sulla base del corpus PAISA`. In M. Cerruti, E. Corino, and C. Onesti, editors, Formale e informale. La variazione di registro nella comunicazione elettronica, pages 147–170. Carocci, Roma. Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, USA. Rebecca Dridan and Stephan Oepen. 2012. Tokenization: Returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea. Association for Computational Linguistics. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive science, 14(2): 179–21 1. Jeffrey L. Elman. 1991 . Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2): 195–225. Murhaf Fares, Stephan Oepen, and Zhang Yi. 2013. Machine learning for high-quality tokenization - replicating variable tokenization schemes. In A. Gelbukh, editor, CICLING 2013, volume 7816 of Lecture Notes in Computer Science, pages 23 1–244, Berlin Heidelberg. Springer-Verlag. Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117–133. Kluwer Academic Publishers, Dordrecht. – –. 1426 Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2nd edition. Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485–525. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289. Thomas Lavergne, Olivier Capp e´, and Fran ¸cois Yvon. 2010. Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504–5 13, Uppsala, Sweden, July. Association for Computational Linguistics. Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics, 28(3):289–3 18. Tom a´ˇ s Mikolov, Martin Karafi´ at, Luk a´ˇ s Burget, Jan Cˇernock y´, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech. Roeland Ordelman, Franciska de Jong, Arjan van Hessen, and Hendri Hondorp. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsleter, 12(3/4):4–7. David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 16– 19, Washington, DC, USA. Association for Computational Linguistics. Michael D. Riley. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, HLT ’89, pages 339–352, Stroudsburg, PA, USA. Association for Computational Linguistics. Carlos N. Silla Jr. and Celso A. A. Kaestner. 2004. An analysis of sentence boundary detection systems for English and Portuguese documents. In Fifth International Conference on Intelligent Text Processing and Computational Linguistics, volume 2945 of Lecture Notes in Computer Science, pages 135–141. Springer. Katrin Tomanek, Joachim Wermter, and Udo Hahn. 2007. Sentence and token splitting based on conditional random fields. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49–57, Melbourne, Australia.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 v bas i le , j ohan Abstract Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. [sent-3, score-0.201]

2 But rule-based tokenizers are hard to maintain and their rules language specific. [sent-4, score-0.165]

3 We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. [sent-5, score-0.449]

4 We evaluated our method on three languages and obtained error rates of 0. [sent-6, score-0.167]

5 1 An Elephant in the Room Tokenization, the task of segmenting a text into words and sentences, is often regarded as a solved problem in natural language processing (Dridan and . [sent-10, score-0.034]

6 Oepen, 2012), probably because many corpora are already in tokenized format. [sent-11, score-0.046]

7 But like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. [sent-12, score-0.55]

8 It is moreover an important problem, because any errors occurring early in the NLP pipeline affect further analysis negatively. [sent-13, score-0.063]

9 And even though current tokenizers reach high performance, there are three issues that we feel haven’t been addressed satisfactorily so far: • • Most tokenizers are rule-based and therefore hard to maintain and hard to adapt to new domains and new languages (Silla Jr. [sent-14, score-0.442]

10 and Kaestner, 2004); Word and sentence segmentation are often seen as separate tasks, but they obviously inform each other and it could be advantageous to view them as a combined task; 1422 bo s }@ rug . [sent-15, score-0.229]

11 nl • Most tokenization methods provide no align- ment between raw and tokenized text, which makes mapping the tokenized version back onto the actual source hard or impossible. [sent-18, score-0.441]

12 In short, we believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. [sent-19, score-0.046]

13 Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? [sent-21, score-0.029]

14 Can we combine word and sentence boundary detection into one task? [sent-23, score-0.218]

15 2 Related Work Usually the text segmentation task is split into word tokenization and sentence boundary detection. [sent-24, score-0.52]

16 Rulebased systems for finding word and sentence boundaries often are variations on matching hand-coded regular expressions (Grefenstette, 1999; Silla Jr. [sent-25, score-0.085]

17 Several unsupervised systems have been proposed for sentence boundary detection. [sent-27, score-0.202]

18 Kiss and Strunk (2006) present a language-independent, unsupervised approach and note that abbreviations form a major source of ambiguity in sentence boundary detection and use collocation detection to build a high-accuracy abbreviation detector. [sent-28, score-0.334]

19 Existing supervised learning approaches for sentence boundary detection use as features tokens preceding and following potential sentence boundary, part of speech, capitalization information and lists of abbreviations. [sent-31, score-0.313]

20 Closest to our work are approaches that present token and sentence splitters using conditional random fields (Tomanek et al. [sent-35, score-0.123]

21 character sequences) as basic units for labeling, whereas we consider single characters. [sent-40, score-0.242]

22 As a consequence, labeling is more resource-intensive, but it also gives us more expressive power. [sent-41, score-0.044]

23 In fact, our approach kills two birds with one stone, as it allows us to integrate token and sentence boundaries detection into one task. [sent-42, score-0.178]

24 1 IOB Tokenization IOB tagging is widely used in tasks identifying chunks of tokens. [sent-44, score-0.067]

25 Characters outside of tokens are labeled O, inside of tokens I. [sent-46, score-0.102]

26 For characters at the beginning of tokens, we use S at sentence boundaries, otherwise T (for token). [sent-47, score-0.139]

27 This scheme offers some nice features, like allowing for discontinuous tokens (e. [sent-48, score-0.051]

28 hyphenated words at line breaks) and starting a new token in the middle of a typographic word if the tokenization scheme requires it, as e. [sent-50, score-0.302]

29 t It didn ’ t matter i f the face s were male , S I I T I OT I I I IOT I OT I I OT I I I I OT I I I I OT I II I OT I TO female or tho se of chi ldren . [sent-54, score-0.056]

30 T I I I OT I OT I I OT I I I I I OT I I I I T I OT I I II I OT I I I IIII Figure 1: Example of IOB-labeled characters 3. [sent-56, score-0.095]

31 2 Datasets In our experiments we use three datasets to compare our method for different languages and for different domains: manually checked English newswire texts taken from the Groningen Meaning Bank, GMB (Basile et al. [sent-57, score-0.152]

32 , 2012), Dutch newswire texts, comprising two days from January 2000 extracted from the Twente News Corpus, TwNC (Ordelman et al. [sent-58, score-0.031]

33 NameLanguageDomainSentences Tokens TGNMCB EDnugtclihshNNeewwsswwiir ee492,,58387686 604,,644337 PAIItalianWeb/various42,674869,095 The data was converted into IOB format by inferring an alignment between the raw text and the segmented text. [sent-62, score-0.058]

34 3 Sequence labeling We apply the Wapiti implementation (Lavergne et al. [sent-64, score-0.044]

35 , 2001), using as features the output label of each character, combined with 1) the character itself, 2) the output label on the previous character, 3) characters and/or their Unicode categories from context windows of varying sizes. [sent-66, score-0.422]

36 The intuition is that the 3 1 existing Unicode categories can generalize across similar characters whereas character features can identify specific contexts such as abbreviations or contractions (e. [sent-68, score-0.369]

37 The context window sizes we use are 0, 1, 3, 5, 7, 9, 11 and 13, centered around the focus character. [sent-71, score-0.06]

38 4 Deep learning of features Automatically learned word embeddings have been successfully used in NLP to reduce reliance on manual feature engineering and boost performance. [sent-73, score-0.036]

39 We adapt this approach to the character level, and thus, in addition to hand-crafted features we use text representations induced in an unsupervised fashion from character strings. [sent-74, score-0.474]

40 A complete discussion of our approach to learning text embeddings can be found in (Chrupała, 2013). [sent-75, score-0.036]

41 Our representations correspond to the activation of the hidden layer in a simple recurrent neural (SRN) network (Elman, 1990; Elman, 1991), implemented in a customized version of Mikolov (2010)’s RNNLM toolkit. [sent-77, score-0.296]

42 The network is sequentially presented with a large amount of raw text and learns to predict the next character in the sequence. [sent-78, score-0.299]

43 It uses the units in the hidden layer to store a generalized representation of the recent history. [sent-79, score-0.152]

44 After training the network on large amounts on unlabeled text, we run it on the training and test data, and record the activation of the hidden layer at each position in the string as it tries to predict the next character. [sent-80, score-0.24]

45 The vector of activations of the hidden layer provides additional features used to train and run the CRF. [sent-81, score-0.113]

46 For each of the K = 10 most active units out of total J = 400 hidden units, we create features (f(1) . [sent-82, score-0.084]

47 5 and f(k) = 0 otherwise, where sj (k) returns the activation of the kth most active unit. [sent-86, score-0.123]

48 4 Results and Evaluation In order to evaluate the quality of the tokenization produced by our models we conducted several experiments with different combinations of features and context sizes. [sent-92, score-0.254]

49 We report both absolute number of errors and error rates per thousand (‰). [sent-95, score-0.243]

50 1 Feature sets We experiment with two kinds of features at the character level, namely Unicode categories (31 dif- ferent ones), Unicode character codes, and a combination of them. [sent-97, score-0.464]

51 Unicode categories are less sparse than the character codes (there are 88, 134, and 502 unique characters for English, Dutch and Italian, respectively), so the combination provide some generalization over just character codes. [sent-98, score-0.593]

52 Table 2: Error rates obtained with different feature sets. [sent-99, score-0.084]

53 Cat stands for Unicode category, Code for Unicode character code, and Cat-Code for a union of these features. [sent-100, score-0.203]

54 Error rates per thousand (‰) Feature setEnglishDutchItalian C ao td-9eC-9ode-94568 ( 0 1. [sent-101, score-0.134]

55 162470) 1424 From these results we see that categories alone perform worse than only codes. [sent-104, score-0.029]

56 For English there is no gain from the combination over using only character codes. [sent-105, score-0.232]

57 We use this feature combination in the experiments that follow. [sent-109, score-0.029]

58 Note that these models are trained using a symmetrical context of 9 characters (four left and four right of the current character). [sent-110, score-0.146]

59 In the next section we show performance of models with different window sizes. [sent-111, score-0.06]

60 2 Context window We run an experiment to evaluate how the size of the context in the training phase impacts the classification. [sent-113, score-0.06]

61 2 we show the results for symmetrical windows ranging in size from 1to 13. [sent-115, score-0.11]

62 3 SRN features We also tested the automatically learned features de- rived from the activation of the hidden layer of an SRN language model, as explained in Section 3. [sent-124, score-0.202]

63 We combined these features with character code and Unicode category features in windows of different sizes. [sent-125, score-0.327]

64 The following rows show the combination of SRN features with the basic feature sets of varying window size. [sent-128, score-0.089]

65 It can be seen that augmenting the feature sets with SRN features results in large reductions of error rates. [sent-129, score-0.046]

66 The Cat-Code-1SRN setting has error rates comparable to Cat-Code9. [sent-130, score-0.13]

67 The addition of SRN features to the two best previous models, Cat-Code-9 and Cat-Code-13, reduces the error rate by 83% resp. [sent-131, score-0.046]

68 All these differences are statistically significant according to the binomial test (p < 0. [sent-134, score-0.047]

69 For English, there are too few errors to detect a statistically significant effect for Cat-Code-9 (p = 0. [sent-136, score-0.063]

70 Table 4: Results obtained using different context window sizes and addition of SRN features. [sent-139, score-0.06]

71 Error rates per thousand (‰) Feature setEnglishDutchItalian C SaRtN-C o d e -59173 S -R SN 27413( 0 . [sent-140, score-0.134]

72 829631) In a final step, we selected the best models based on the development sets (Cat-Code-7-SRN for English and Dutch, Cat-Code-1 1-SRN for Italian), and checked their performance on the final test set. [sent-143, score-0.038]

73 5 Discussion It is interesting to examine what kind of errors the SRN features help avoid. [sent-148, score-0.063]

74 In the English and Dutch datasets many errors are caused by failure to recognize personal titles and initials or misparsing of numbers. [sent-149, score-0.109]

75 In the Italian data, a large fraction of errors is due to verbs with clitics, which are written as a single word, but treated as separate tokens. [sent-150, score-0.063]

76 Table 5 shows examples of errors made by a simpler model that are fixed by adding SRN features. [sent-151, score-0.063]

77 Table 6 shows the confusion matrices for the Cat-Code-7 and CatCode-7-SRN sets on the Dutch data. [sent-152, score-0.034]

78 The mistake most improved by SRN features is T/I with 89% error reduction (see also Table 5). [sent-153, score-0.046]

79 A comparison with other approaches is hard because of the difference in datasets and task definition (combined word/sentence segmentation). [sent-155, score-0.083]

80 Here we just compare our results for sentence segmentation (sentence F1 score) with Punkt, a state-of-the1425 Table 5: Positive impact of SRN features. [sent-156, score-0.137]

81 GoTOSIld32P8r1e52d480iIc7te52d,3O0C4 at-32C So20d8e-47612T089P3r2e8d5ic43t1065Ied7,2C3Oa04 t-C3o1d2S0 e-78S1R0562TN038 art sentence boundary detection system (Kiss and Strunk, 2006). [sent-158, score-0.218]

82 Our system benefits here from its ability to adapt to a new domain with relatively little (but annotated) training data. [sent-165, score-0.039]

83 Word and sentence segmentation can be recast as a combined tagging task. [sent-167, score-0.201]

84 This way, tokenization is cast as a supervised learning task, causing a shift of labor from writing rules to manually correcting labels. [sent-168, score-0.254]

85 1 Furthermore, our tagging method does not lose the connection between original text and tokens. [sent-170, score-0.028]

86 In future work, we plan to broaden the scope of this work to other steps in document preparation, 1All software needed to replicate our experiments is available at http : / / gmb . [sent-171, score-0.111]

87 Finally, we plan to explore the possibility of a statistical universal segmentation model for mutliple languages and domains. [sent-177, score-0.13]

88 In a famous scene with a live elephant on stage, the comedian Jimmy Durante was asked about it by a policeman and surprisedly answered: “What elephant? [sent-178, score-0.192]

89 ” We feel we can say the same now as far as tokenization is concerned. [sent-179, score-0.29]

90 Itesti del web: una proposta di classificazione sulla base del corpus PAISA`. [sent-186, score-0.084]

91 Tokenization: Returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea. [sent-199, score-0.034]

92 Machine learning for high-quality tokenization - replicating variable tokenization schemes. [sent-213, score-0.508]

93 Conditional random fields: Probabilistic models for segmenting and labeling sequence data. [sent-235, score-0.044]

94 Some applications of tree-based modelling to speech and language. [sent-269, score-0.032]

95 An analysis of sentence boundary detection systems for English and Portuguese documents. [sent-278, score-0.218]

96 Sentence and token splitting based on conditional random fields. [sent-283, score-0.048]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ot', 0.399), ('srn', 0.362), ('tokenization', 0.254), ('dutch', 0.226), ('unicode', 0.223), ('character', 0.203), ('elephant', 0.192), ('italian', 0.163), ('boundary', 0.129), ('paisa', 0.128), ('tokenizers', 0.128), ('twnc', 0.128), ('gmb', 0.111), ('silla', 0.096), ('characters', 0.095), ('segmentation', 0.093), ('activation', 0.089), ('rates', 0.084), ('basile', 0.084), ('dridan', 0.084), ('evang', 0.084), ('groningen', 0.084), ('oepen', 0.084), ('iob', 0.076), ('chrupa', 0.076), ('kiss', 0.071), ('layer', 0.068), ('borghetti', 0.064), ('kaestner', 0.064), ('ordelman', 0.064), ('setenglishdutchitalian', 0.064), ('tomanek', 0.064), ('valerio', 0.064), ('errors', 0.063), ('window', 0.06), ('windows', 0.059), ('raw', 0.058), ('recurrent', 0.056), ('elman', 0.056), ('lavergne', 0.056), ('didn', 0.056), ('fares', 0.056), ('punkt', 0.056), ('reynar', 0.056), ('rug', 0.056), ('strunk', 0.056), ('tilburg', 0.056), ('tokens', 0.051), ('symmetrical', 0.051), ('thousand', 0.05), ('token', 0.048), ('bos', 0.047), ('binomial', 0.047), ('datasets', 0.046), ('error', 0.046), ('room', 0.046), ('tokenized', 0.046), ('detection', 0.045), ('hidden', 0.045), ('sentence', 0.044), ('labeling', 0.044), ('january', 0.042), ('abbreviations', 0.042), ('del', 0.042), ('netherlands', 0.042), ('boundaries', 0.041), ('kilian', 0.041), ('grzegorz', 0.041), ('adapt', 0.039), ('rulebased', 0.039), ('chunks', 0.039), ('units', 0.039), ('network', 0.038), ('checked', 0.038), ('hard', 0.037), ('languages', 0.037), ('feel', 0.036), ('embeddings', 0.036), ('mikolov', 0.036), ('jeffrey', 0.036), ('combined', 0.036), ('english', 0.035), ('codes', 0.034), ('confusion', 0.034), ('sj', 0.034), ('johan', 0.034), ('palmer', 0.034), ('solved', 0.034), ('lecture', 0.032), ('speech', 0.032), ('stephan', 0.031), ('fields', 0.031), ('newswire', 0.031), ('editor', 0.031), ('combination', 0.029), ('unsupervised', 0.029), ('lafferty', 0.029), ('categories', 0.029), ('code', 0.029), ('tagging', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos

2 0.18292108 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

3 0.14206856 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

4 0.11657235 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

Author: Longkai Zhang ; Houfeng Wang ; Xu Sun ; Mairgup Mansur

Abstract: Nowadays supervised sequence labeling models can reach competitive performance on the task of Chinese word segmentation. However, the ability of these models is restricted by the availability of annotated data and the design of features. We propose a scalable semi-supervised feature engineering approach. In contrast to previous works using pre-defined taskspecific features with fixed values, we dynamically extract representations of label distributions from both an in-domain corpus and an out-of-domain corpus. We update the representation values with a semi-supervised approach. Experiments on the benchmark datasets show that our approach achieve good results and reach an f-score of 0.961. The feature engineering approach proposed here is a general iterative semi-supervised method and not limited to the word segmentation task.

5 0.097266294 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training

Author: Fan Yang ; Paul Vozila

Abstract: In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. These two segmenters are initially trained with a small amount of segmented data, and then iteratively improve each other using the large amount of unlabelled data. Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.

6 0.094439715 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks

7 0.093167283 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

8 0.069318205 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

9 0.062900156 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

10 0.055276398 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

11 0.051446754 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

12 0.051141176 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations

13 0.04714502 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

14 0.046898711 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation

15 0.046314374 111 emnlp-2013-Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning

16 0.045990311 27 emnlp-2013-Authorship Attribution of Micro-Messages

17 0.044508528 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

18 0.043301873 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation

19 0.04230421 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging

20 0.041203506 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.162), (1, -0.024), (2, -0.024), (3, -0.084), (4, -0.102), (5, -0.009), (6, 0.072), (7, 0.179), (8, -0.159), (9, 0.099), (10, 0.02), (11, 0.048), (12, 0.1), (13, -0.047), (14, -0.026), (15, 0.061), (16, 0.004), (17, -0.03), (18, -0.032), (19, -0.001), (20, -0.008), (21, 0.08), (22, 0.084), (23, -0.056), (24, -0.014), (25, -0.035), (26, 0.057), (27, -0.08), (28, 0.064), (29, 0.06), (30, -0.085), (31, -0.002), (32, -0.046), (33, -0.081), (34, -0.104), (35, 0.014), (36, 0.073), (37, -0.031), (38, -0.012), (39, -0.101), (40, 0.017), (41, 0.02), (42, 0.002), (43, 0.069), (44, 0.005), (45, 0.12), (46, -0.033), (47, 0.049), (48, -0.006), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92409509 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos

2 0.70393121 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

3 0.66163981 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training

Author: Fan Yang ; Paul Vozila

4 0.64572096 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

Author: Longkai Zhang ; Houfeng Wang ; Xu Sun ; Mairgup Mansur

5 0.57411236 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

Author: Micha Elsner ; Sharon Goldwater ; Naomi Feldman ; Frank Wood

Abstract: We present a cognitive model of early lexical acquisition which jointly performs word segmentation and learns an explicit model of phonetic variation. We define the model as a Bayesian noisy channel; we sample segmentations and word forms simultaneously from the posterior, using beam sampling to control the size of the search space. Compared to a pipelined approach in which segmentation is performed first, our model is qualitatively more similar to human learners. On data with vari- able pronunciations, the pipelined approach learns to treat syllables or morphemes as words. In contrast, our joint model, like infant learners, tends to learn multiword collocations. We also conduct analyses of the phonetic variations that the model learns to accept and its patterns of word recognition errors, and relate these to developmental evidence.

6 0.52649063 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

7 0.52401179 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks

8 0.47902954 111 emnlp-2013-Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning

9 0.45006225 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

10 0.38454893 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations

11 0.36721915 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

12 0.3613323 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

13 0.35866231 156 emnlp-2013-Recurrent Continuous Translation Models

14 0.35729113 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation

15 0.35210475 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

16 0.34841806 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

17 0.34194601 55 emnlp-2013-Decoding with Large-Scale Neural Language Models Improves Translation

18 0.32619753 27 emnlp-2013-Authorship Attribution of Micro-Messages

19 0.32145622 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

20 0.31835365 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.011), (3, 0.071), (18, 0.036), (22, 0.047), (30, 0.051), (43, 0.011), (47, 0.014), (50, 0.023), (51, 0.165), (66, 0.034), (67, 0.341), (71, 0.041), (75, 0.034), (77, 0.029), (96, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75037843 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos

2 0.67852485 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

Author: He He ; Hal Daume III ; Jason Eisner

Abstract: Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. We test our method on 7 languages. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.

3 0.49377686 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni

Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.

4 0.49288335 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a novel approach to determine textual similarity. A layered methodology to transform text into logic forms is proposed, and semantic features are derived from a logic prover. Experimental results show that incorporating the semantic structure of sentences is beneficial. When training data is unavailable, scores obtained from the logic prover in an unsupervised manner outperform supervised methods.

5 0.49276853 152 emnlp-2013-Predicting the Presence of Discourse Connectives

Author: Gary Patterson ; Andrew Kehler

Abstract: We present a classification model that predicts the presence or omission of a lexical connective between two clauses, based upon linguistic features of the clauses and the type of discourse relation holding between them. The model is trained on a set of high frequency relations extracted from the Penn Discourse Treebank and achieves an accuracy of 86.6%. Analysis of the results reveals that the most informative features relate to the discourse dependencies between sequences of coherence relations in the text. We also present results of an experiment that provides insight into the nature and difficulty of the task.

6 0.49247742 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

7 0.49188375 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

8 0.49134836 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

9 0.48985323 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

10 0.48759311 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

11 0.48713338 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

12 0.48675942 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

13 0.48652408 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

14 0.48624358 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

15 0.48618627 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

16 0.48534864 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge

17 0.48495975 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

18 0.48475811 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

19 0.48384175 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

20 0.4835957 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations