emnlp emnlp2013 emnlp2013-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fan Yang ; Paul Vozila
Abstract: In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. These two segmenters are initially trained with a small amount of segmented data, and then iteratively improve each other using the large amount of unlabelled data. Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. [sent-4, score-0.177]
2 We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. [sent-5, score-1.241]
3 These two segmenters are initially trained with a small amount of segmented data, and then iteratively improve each other using the large amount of unlabelled data. [sent-6, score-0.624]
4 1 Introduction In the literature there exist two general models for supervised Chinese word segmentation, the wordbased approach and the character-based approach. [sent-8, score-0.146]
5 The character-based approach treats segmentation as a character sequence labeling problem, indicating whether a character is located at the boundary of a word. [sent-10, score-0.755]
6 Typically the word-based approach uses word level features, such as word n-grams and word length; while the character-based approach uses character level information, such as character ngrams. [sent-11, score-0.597]
7 The goal is to make use of the indomain unsegmented data to improve the ultimate performance of word segmentation. [sent-17, score-0.325]
8 ” This naturally motivates the use of co-training, which utilizes two models trained on different views of the input labeled data which then iteratively educate each other with the unlabelled data. [sent-20, score-0.181]
9 At the end of the cotraining iterations, the initially weak models achieve improved performance. [sent-21, score-0.187]
10 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t9ic1s–120 , Figure 1: A search space for word segmenter scribe our co-training experiments. [sent-29, score-0.631]
11 cn, the wordbased approach searches in all possible segmentations for one that maximizes a pre-defined utility function, formally represented as in Equation 1. [sent-39, score-0.247]
12 cn), can be represented as a lattice, where each vertex represents a character boundary index and each arc represents a word candidate which is the sequence of characters within the index range. [sent-43, score-0.383]
13 For example, given the character sequence “发展 中 家” and a dictionary that contains the words {发展 中 国家} and all single Ctahininse tshee characters, 展the, s中ear国ch, space i}s ainl duas tlrla steindg line Figure 1. [sent-45, score-0.333]
14 Util(W) = − | W |), of number of 1A dictionary is not a must to create the search space but it could shrink the search space and also lead to improved segmentation performance. [sent-51, score-0.195]
15 Alternatively one can search for the segmentation that maximizes the word sequence probability P(W) (i. [sent-53, score-0.233]
16 2 Character-Based Segmenter The character-based approach treats word segmentation as a character sequence labeling problem, to label each character with its location in a word, first proposed by Xue (2003). [sent-77, score-0.783]
17 2 The basic labeling scheme is to use two tags: ‘B’ for the beginning character of a word and ‘O’ for other characters (Peng et al. [sent-78, score-0.371]
18 Training and decoding of the character labeling problem is similar to part-of-speech tagging, which 2Teahan et al. [sent-84, score-0.261]
19 (2000) use a character language model to determine whether a word boundary should be inserted after each character, which can also be considered as a character-based approach as well. [sent-85, score-0.309]
20 3 Comparison and Combination It is more natural to use word-level information, such as word n-grams and word length, in a word-based segmenter; while it is more natural to use character-level information, such as character ngrams, in a character-based segmenter. [sent-92, score-0.311]
21 Andrew (2006) also shows that semi-Markov CRF makes strictly weaker independence assumptions than linear CRF and so a word-based segmenter using an order-K semiMarkov model is more expressive than a characterbased model using an order-K CRF. [sent-98, score-0.639]
22 For example, the character 们 usually works as a suffix to signal plural; the character 者 can also be a suffix meaning a group of people; and 阿 generally works as a prefix before a person’s nickname that has one character. [sent-101, score-0.522]
23 For example, a character-based model can learn that 阿 is usually tagged as ‘B’ and the next character is usually tagged as ‘E’ . [sent-103, score-0.261]
24 The inputs are two sets of data, a labelled set S and an unlabelled set U. [sent-117, score-0.186]
25 Se le ct exampl e s l abe l led by M2 add t o L 1 7 Randomly move s ample s from U to C s o that C maint ains it s s i ze UNT IL stopping crit eria . [sent-131, score-0.183]
26 In practice it has been shown that co-training can still achieve improved performance when this assumption is violated, but conforming to the conditionally independent assumption leads to 1194 a bigger gain (Nigam and Ghani, 2000; Pierce and Cardie, 2001). [sent-146, score-0.149]
27 Third, the decoding and training of the two models need to be efficient, as in co-training we need to segment the unlabelled data and re-train the models in each iteration. [sent-149, score-0.141]
28 Word-based segmenter In the word-based segmenter, we utilize a statistical n-gram language model and try to optimize the language modeling score together with a word insertion penalty, as show in Equation 4. [sent-151, score-0.631]
29 Util(W) = ln(P(W)) − |W| ∗ K (4) Character-based segmenter We use an order-1 linear conditional random field to label a character sequence. [sent-155, score-0.893]
30 The features that we use are character n-grams within the neighboring 5-character window and tag bigrams. [sent-158, score-0.317]
31 Given a character c0 in the character sequence c−2 c− 1c0c1c2 , we extract the following features: character unigrams c−2, c−1, c0, c1, c2, bigrams c−1c0 and c0c1 . [sent-159, score-0.812]
32 As can be seen, we build a word-based segmenter that uses only word level features, and a characterbased segmenter that uses only character level features. [sent-161, score-1.531]
33 These two segmenters by no means satisfy the conditionally independence assumption, but we have the hope that they are not too correlated as they use different levels of information and these 3http : / / crfpp . [sent-162, score-0.251]
34 Also the effectiveness of these two segmenters has been demonstrated in literature and will be shown again in our results in Section 4. [sent-166, score-0.204]
35 Finally, both segmenters can decode and be trained pretty quickly. [sent-167, score-0.181]
36 2 Co-Training We follow the framework in Figure 2 for the cotraining setup. [sent-171, score-0.164]
37 We do not use the cache C, but directly label the whole unlabelled data set U, because in our experiment setup (see Section 4) U is not huge and computationally we can afford to label the whole set. [sent-172, score-0.208]
38 In every iteration, we pick some sentences that are segmented by the character-based model with high confidence but are segmented by the word-based model with low confidence to add to the training data of the word-based model, and vice versa. [sent-177, score-0.495]
39 Thus we rank the sentences by their confidence scores in each segmenter respectively, and calculate the rank difference between the two segmenters. [sent-182, score-0.662]
40 In each run one set is used as the labelled data S, and the other nine sets are combined and used as the unlabelled data U with segmentations removed. [sent-191, score-0.297]
41 That is, 10% of the training data is used as segmented data, and 90% are used as unsegmented data in our semi-supervised training. [sent-192, score-0.492]
42 This setup resembles our semi-supervised application, where there is only a small limited amount of segmented data but a relatively large amount of in-domain unsegmented data available. [sent-193, score-0.577]
43 The final trained character-based and word-based segmenters from co-training are then evaluated on the testing data. [sent-194, score-0.21]
44 2 Co-Training Results For comparison, we measure the baseline as the performance of a model trained with the 10% segmented data only (referred to as BASIC baselines). [sent-201, score-0.169]
45 The BASIC baselines, both for the wordbased model and the character-based model, however, use only the segmented data but leave out the large amount of available unsegmented data. [sent-202, score-0.613]
46 We thus measure another baseline (referred to as FOLDIN), which naively uses the unsegmented data. [sent-203, score-0.336]
47 In the FOLD-IN baseline, a model is first trained with the 10% segmented data, and then this model is used Table 1: Co-training results BCFAOESIL-TDICR-NAIGN c9 h104. [sent-204, score-0.169]
48 r294d5 Figure 3: Gap filling with different split ratio to label the unsegmented data. [sent-208, score-0.38]
49 The automatic segmentation is then combined with the segmented data to build a new model. [sent-209, score-0.349]
50 we use the true segmentations of the 90% unsegmented data together with the 10% segmented data to train a model. [sent-212, score-0.552]
51 The CEILING tells us the oracle performance when we have all segmented data for training, while the BASIC shows how much performance is dropped when we only have 10% of the segmented data. [sent-213, score-0.338]
52 The performance of co-training will tell us how much we can fill the gap by taking advantage of the other 90% as unsegmented data in the semi-supervised training. [sent-214, score-0.353]
53 co-training should perform better than naively folding in the unsegmented data. [sent-217, score-0.369]
54 Second, we see that under all four conditions, the character-based segmenter performs better than the word-based model. [sent-225, score-0.606]
55 The word-based segmenter implemented in this work is less powerful, and it needs a good dictionary to achieve good performance. [sent-227, score-0.649]
56 In our implementation, a dictionary is extracted from the segmented training set. [sent-228, score-0.235]
57 More interestingly, the character-based segmenter is able to benefit from the less powerful word-based segmenter. [sent-232, score-0.606]
58 Finally, comparing FOLD-IN and BASIC, we see that naively using the unsegmented data does not lead to a significant improvement. [sent-234, score-0.336]
59 This suggests that cotraining provides a process that effectively makes use of the unsegmented data. [sent-235, score-0.464]
60 For completeness, in Figure 3 we also show the relative gap filling with different splits of the segmented vs unsegmented data. [sent-236, score-0.572]
61 With more data moving to the segmented set, the absolute improvement of co-training over BASIC gets smaller, while the gap between the BASIC and CEILING also becomes smaller. [sent-237, score-0.256]
62 3 Further Analysis It is not surprising that the word-based segmenter benefits from co-training since it learns from the more accurate character-based segmenter. [sent-242, score-0.63]
63 Our fo- cus, however, is to better understand what benefit the character-based segmenter gains from the cotraining procedure. [sent-243, score-0.77]
64 The character-based segmenter treats word segmentation as a character sequence labelling problem with four tags “B IE S”. [sent-244, score-1.124]
65 Assuming that segmentation accuracy is proportional to tag accuracy, we examine the tag accuracy of the character-based segmenter before and after cotraining. [sent-245, score-0.87]
66 If a character is labelled with tag T0 initially before co-training and with tag T1 after co-training, with the tag T1 different from T0, there can be one of three cases: 1) T0 is correct; 2) T1 is correct; or 3) neither is correct. [sent-246, score-0.52]
67 The absolute gain from cotraining of switching from tag T0 to T1 is defined as the number of case 2 instances less case 1 instances. [sent-247, score-0.378]
68 Absolute gain indicates the gain of tag accuracy where co-training learns to switch from T0 to T1, and it contributes to the overall tag accuracy improvement. [sent-248, score-0.303]
69 We also define relative gain of switching from tag T0 to T1 as the absolute gain divided by the total number of cases switching from tag T0 to T1. [sent-249, score-0.418]
70 4 Feature Combination We split the features into two sets, a character-level feature set used by the character-based segmenter and a word-level feature set used by the word-based 1197 Table 2: Absolute Gain and Relative Gain segmenter. [sent-255, score-0.634]
71 We have shown that these two segmenters improve each other via co-training. [sent-256, score-0.181]
72 under the CEILING condition), a segmenter with combined features tends to perform better than only using one set of features. [sent-261, score-0.634]
73 First, we want to understand whether co-training, which splits the features, can actually beat the BASIC and FOLDIN baselines of a segmenter with combined features. [sent-263, score-0.634]
74 We use this segmenter because it is publicly available and it performs well on both the PKU corpus and CU corpus. [sent-266, score-0.606]
75 It models word segmentation as a character labelling problem, and solves it with a passive-aggressive optimization algorithm. [sent-267, score-0.46]
76 Character-level features include character uni-grams and bi-grams in the five character window, and whether the current character is the same as the next or the one after the next character. [sent-270, score-0.783]
77 For ease of description, we will refer to Weiwei Sun’s segmenter with combined features as Sun-Segmenter, and the character-based segmenter used in our cotraining which uses character-level features as CharSegmenter. [sent-283, score-1.404]
78 The Sun-Segmenter has more gain when folding in the unsegmented data than the Char-Segmenter, further suggesting that the SunSegmenter is benefiting from the size of data. [sent-290, score-0.416]
79 For both corpora, however, the Char-Segmenter after cotraining beats the FOLD-IN baseline of the SunSegmenter by at least 0. [sent-291, score-0.164]
80 When there is only a small amount of segmented data available, using a more advanced segmenter with combined features still under-performs compared to co-training. [sent-293, score-0.882]
81 Next we would like to explore whether we could 1198 further improve the co-training performance, given that we have a more advanced segmenter using com- bined features. [sent-295, score-0.631]
82 In the first approach, after all the iterations of co-training, the data are split into two sets, one set for training the word-based segmenter L1 and the other set for training the character-based segmenter L2. [sent-297, score-1.286]
83 The segmentations of these two sets of data are probably better than the segmentations under the FOLD-IN condition. [sent-298, score-0.166]
84 In the second approach, we use the character-based segmenter after co-training, which has an improved performance, to relabel the set of unsegmented data U, and then combine it with the segmented data set S. [sent-300, score-1.075]
85 These research works aim to use huge amount of unsegmented data to further improve the performance of an already well-trained supervised model. [sent-310, score-0.385]
86 In this paper, we assume a much limited amount of segmented data available, and try to boost up the performance by using in-domain unsegmented data. [sent-311, score-0.523]
87 For example, a CRF segmenter trained on the SIGhan MSR training data, which achieves an F-measure of 96. [sent-313, score-0.629]
88 8% when applied to the PKU testing data; and the same CRF segmenter trained on the PKU training data achieves 94. [sent-315, score-0.658]
89 When one starts a new application that requires word segmentation in a new domain, it is likely that there is only a very small amount of segmented data available. [sent-317, score-0.4]
90 We propose the approach of co-training for Chinese word segmentation for the semi-supervised setting where there is only a limited amount of humansegmented data available, but there exists a relatively large amount of in-domain unsegmented data. [sent-318, score-0.585]
91 We split the feature set into character-level features and word-level features, and then build a character-based segmenter with character-level features and a word- based segmenter with word-level features, using the limited amount of available segmented data. [sent-319, score-1.463]
92 These two segmenters then iteratively educate and improve each other by making use of the large amount of unsegmented data. [sent-320, score-0.598]
93 Finally we combine the wordlevel and character-level features with an advanced segmenter to further improve the co-training performance. [sent-321, score-0.661]
94 Our experiments show that using 10% data as segmented data and the other 90% data as unsegmented data, co-training reaches 20% performance improvement achieved by supervised training with all data in the SIGHAN 2005 PKU corpus and 31% in the CU corpus. [sent-322, score-0.523]
95 Chinese word segmentation and named entity recognition: A pragmatic approach. [sent-348, score-0.177]
96 Chinese and japanese word segmentation using word-level and character-level information. [sent-366, score-0.177]
97 Chinese segmentation and new word detection using conditional random fields. [sent-374, score-0.177]
98 , native latent variable chinese segmenter with hybrid word/character information. [sent-387, score-0.732]
99 Word-based and character-based word segmentation models: comparison and combination. [sent-391, score-0.177]
100 A unified character-based tagging framework for chinese word segmentation. [sent-419, score-0.151]
wordName wordTfidf (topN-words)
[('segmenter', 0.606), ('unsegmented', 0.3), ('character', 0.261), ('pku', 0.188), ('ceiling', 0.181), ('segmenters', 0.181), ('segmented', 0.169), ('cotraining', 0.164), ('segmentation', 0.152), ('cu', 0.138), ('chinese', 0.126), ('sun', 0.125), ('unlabelled', 0.118), ('sunsegmenter', 0.094), ('util', 0.094), ('wordbased', 0.09), ('sighan', 0.088), ('gain', 0.083), ('segmentations', 0.083), ('pierce', 0.076), ('blum', 0.075), ('labelled', 0.068), ('nigam', 0.066), ('crf', 0.062), ('nuance', 0.06), ('weiwei', 0.06), ('ghani', 0.057), ('guz', 0.057), ('tag', 0.056), ('confidence', 0.056), ('amount', 0.054), ('gap', 0.053), ('abney', 0.045), ('characters', 0.045), ('conditionally', 0.044), ('dictionary', 0.043), ('dasgupta', 0.042), ('dat', 0.041), ('switching', 0.041), ('basic', 0.04), ('pool', 0.039), ('educate', 0.038), ('exampl', 0.038), ('foldin', 0.038), ('gokhan', 0.038), ('labe', 0.038), ('raining', 0.038), ('relabelling', 0.038), ('singlecharacter', 0.038), ('cache', 0.038), ('stopping', 0.038), ('xue', 0.037), ('naively', 0.036), ('absolute', 0.034), ('led', 0.034), ('characterbased', 0.033), ('folding', 0.033), ('supervised', 0.031), ('rain', 0.03), ('nakagawa', 0.03), ('wordlevel', 0.03), ('testing', 0.029), ('sequence', 0.029), ('treats', 0.029), ('split', 0.028), ('combined', 0.028), ('bakeoff', 0.028), ('seconds', 0.028), ('maximizes', 0.027), ('cardie', 0.026), ('satisfy', 0.026), ('filling', 0.026), ('abe', 0.026), ('punctuation', 0.026), ('condition', 0.026), ('peng', 0.026), ('label', 0.026), ('augment', 0.025), ('msr', 0.025), ('ample', 0.025), ('switch', 0.025), ('advanced', 0.025), ('iteratively', 0.025), ('wn', 0.025), ('word', 0.025), ('mitchell', 0.024), ('surprising', 0.024), ('utility', 0.024), ('mu', 0.024), ('wen', 0.024), ('relative', 0.024), ('initially', 0.023), ('searches', 0.023), ('boundary', 0.023), ('internal', 0.023), ('training', 0.023), ('effectiveness', 0.023), ('add', 0.022), ('labelling', 0.022), ('bigger', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
Author: Fan Yang ; Paul Vozila
Abstract: In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. These two segmenters are initially trained with a small amount of segmented data, and then iteratively improve each other using the large amount of unlabelled data. Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.
2 0.26046616 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu
Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.
3 0.25586253 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
Author: Longkai Zhang ; Houfeng Wang ; Xu Sun ; Mairgup Mansur
Abstract: Nowadays supervised sequence labeling models can reach competitive performance on the task of Chinese word segmentation. However, the ability of these models is restricted by the availability of annotated data and the design of features. We propose a scalable semi-supervised feature engineering approach. In contrast to previous works using pre-defined taskspecific features with fixed values, we dynamically extract representations of label distributions from both an in-domain corpus and an out-of-domain corpus. We update the representation values with a semi-supervised approach. Experiments on the benchmark datasets show that our approach achieve good results and reach an f-score of 0.961. The feature engineering approach proposed here is a general iterative semi-supervised method and not limited to the word segmentation task.
Author: Xipeng Qiu ; Jiayi Zhao ; Xuanjing Huang
Abstract: Chinese word segmentation and part-ofspeech tagging (S&T;) are fundamental steps for more advanced Chinese language processing tasks. Recently, it has attracted more and more research interests to exploit heterogeneous annotation corpora for Chinese S&T.; In this paper, we propose a unified model for Chinese S&T; with heterogeneous annotation corpora. We first automatically construct a loose and uncertain mapping between two representative heterogeneous corpora, Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD) . Then we regard the Chinese S&T; with heterogeneous corpora as two “related” tasks and train our model on two heterogeneous corpora simultaneously. Experiments show that our method can boost the performances of both of the heterogeneous corpora by using the shared information, and achieves significant im- provements over the state-of-the-art methods.
5 0.099042602 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
Author: Micha Elsner ; Sharon Goldwater ; Naomi Feldman ; Frank Wood
Abstract: We present a cognitive model of early lexical acquisition which jointly performs word segmentation and learns an explicit model of phonetic variation. We define the model as a Bayesian noisy channel; we sample segmentations and word forms simultaneously from the posterior, using beam sampling to control the size of the search space. Compared to a pipelined approach in which segmentation is performed first, our model is qualitatively more similar to human learners. On data with vari- able pronunciations, the pipelined approach learns to treat syllables or morphemes as words. In contrast, our joint model, like infant learners, tends to learn multiword collocations. We also conduct analyses of the phonetic variations that the model learns to accept and its patterns of word recognition errors, and relate these to developmental evidence.
6 0.097266294 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation
7 0.067202263 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
8 0.062446669 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation
9 0.061974239 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
10 0.057618711 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech
11 0.05450435 58 emnlp-2013-Dependency Language Models for Sentence Completion
12 0.052823395 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
13 0.047145296 27 emnlp-2013-Authorship Attribution of Micro-Messages
14 0.046455145 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
15 0.043577563 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
16 0.042763673 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
17 0.040117543 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology
18 0.039367624 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
19 0.037623826 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
20 0.036544483 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
topicId topicWeight
[(0, -0.155), (1, -0.026), (2, -0.022), (3, -0.091), (4, -0.159), (5, -0.035), (6, 0.103), (7, 0.254), (8, -0.204), (9, 0.199), (10, 0.023), (11, -0.03), (12, 0.121), (13, -0.054), (14, -0.037), (15, 0.067), (16, -0.063), (17, -0.03), (18, 0.039), (19, -0.044), (20, -0.102), (21, 0.003), (22, -0.041), (23, -0.047), (24, 0.034), (25, -0.078), (26, -0.078), (27, 0.04), (28, -0.057), (29, -0.011), (30, 0.014), (31, -0.068), (32, -0.068), (33, -0.056), (34, 0.028), (35, 0.016), (36, -0.029), (37, -0.063), (38, -0.025), (39, -0.002), (40, 0.023), (41, -0.06), (42, -0.084), (43, -0.021), (44, 0.057), (45, 0.094), (46, -0.048), (47, -0.024), (48, -0.062), (49, -0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.95024282 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
Author: Fan Yang ; Paul Vozila
Abstract: In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. These two segmenters are initially trained with a small amount of segmented data, and then iteratively improve each other using the large amount of unlabelled data. Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.
2 0.9064244 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
Author: Longkai Zhang ; Houfeng Wang ; Xu Sun ; Mairgup Mansur
Abstract: Nowadays supervised sequence labeling models can reach competitive performance on the task of Chinese word segmentation. However, the ability of these models is restricted by the availability of annotated data and the design of features. We propose a scalable semi-supervised feature engineering approach. In contrast to previous works using pre-defined taskspecific features with fixed values, we dynamically extract representations of label distributions from both an in-domain corpus and an out-of-domain corpus. We update the representation values with a semi-supervised approach. Experiments on the benchmark datasets show that our approach achieve good results and reach an f-score of 0.961. The feature engineering approach proposed here is a general iterative semi-supervised method and not limited to the word segmentation task.
3 0.83539468 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu
Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.
Author: Xipeng Qiu ; Jiayi Zhao ; Xuanjing Huang
Abstract: Chinese word segmentation and part-ofspeech tagging (S&T;) are fundamental steps for more advanced Chinese language processing tasks. Recently, it has attracted more and more research interests to exploit heterogeneous annotation corpora for Chinese S&T.; In this paper, we propose a unified model for Chinese S&T; with heterogeneous annotation corpora. We first automatically construct a loose and uncertain mapping between two representative heterogeneous corpora, Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD) . Then we regard the Chinese S&T; with heterogeneous corpora as two “related” tasks and train our model on two heterogeneous corpora simultaneously. Experiments show that our method can boost the performances of both of the heterogeneous corpora by using the shared information, and achieves significant im- provements over the state-of-the-art methods.
5 0.68435001 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation
Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos
Abstract: Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models. 1 An Elephant in the Room Tokenization, the task of segmenting a text into words and sentences, is often regarded as a solved problem in natural language processing (Dridan and . Oepen, 2012), probably because many corpora are already in tokenized format. But like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the NLP pipeline affect further analysis negatively. And even though current tokenizers reach high performance, there are three issues that we feel haven’t been addressed satisfactorily so far: • • Most tokenizers are rule-based and therefore hard to maintain and hard to adapt to new domains and new languages (Silla Jr. and Kaestner, 2004); Word and sentence segmentation are often seen as separate tasks, but they obviously inform each other and it could be advantageous to view them as a combined task; 1422 bo s }@ rug .nl † g .chrupal a @ uvt .nl • Most tokenization methods provide no align- ment between raw and tokenized text, which makes mapping the tokenized version back onto the actual source hard or impossible. In short, we believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task? 2 Related Work Usually the text segmentation task is split into word tokenization and sentence boundary detection. Rulebased systems for finding word and sentence boundaries often are variations on matching hand-coded regular expressions (Grefenstette, 1999; Silla Jr. and Kaestner, 2004; Jurafsky and Martin, 2008; Dridan and Oepen, 2012). Several unsupervised systems have been proposed for sentence boundary detection. Kiss and Strunk (2006) present a language-independent, unsupervised approach and note that abbreviations form a major source of ambiguity in sentence boundary detection and use collocation detection to build a high-accuracy abbreviation detector. The resulting system reaches high accuracy, rivalling handcrafted rule-based and supervised systems. A similar system was proposed earlier by Mikheev (2002). Existing supervised learning approaches for sentence boundary detection use as features tokens preceding and following potential sentence boundary, part of speech, capitalization information and lists of abbreviations. Learning methods employed in Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t2ic2s–1426, these approaches include maximum entropy models (Reynar and Ratnaparkhi, 1997) decision trees (Riley, 1989), and neural networks (Palmer and Hearst, 1997). Closest to our work are approaches that present token and sentence splitters using conditional random fields (Tomanek et al., 2007; Fares et al., 2013). However, these previous approaches consider tokens (i.e. character sequences) as basic units for labeling, whereas we consider single characters. As a consequence, labeling is more resource-intensive, but it also gives us more expressive power. In fact, our approach kills two birds with one stone, as it allows us to integrate token and sentence boundaries detection into one task. 3 Method 3.1 IOB Tokenization IOB tagging is widely used in tasks identifying chunks of tokens. We use it to identify chunks of characters. Characters outside of tokens are labeled O, inside of tokens I. For characters at the beginning of tokens, we use S at sentence boundaries, otherwise T (for token). This scheme offers some nice features, like allowing for discontinuous tokens (e.g. hyphenated words at line breaks) and starting a new token in the middle of a typographic word if the tokenization scheme requires it, as e.g. in did|n ’t. An example ins given ien r Figure 1 i.t It didn ’ t matter i f the face s were male , S I I T I OT I I I IOT I OT I I OT I I I I OT I I I I OT I II I OT I TO female or tho se of chi ldren . Eighty T I I I I I I OT I I I I I I I OT OT I I OT I I I TOS I I I O III three percent o f people in the 3 0 -to-3 4 I I I I I I OT I I I I I I OT I I I I I I OT I I I OT I I OT OT I I I IO year old age range gave correct responses . T I I I OT I OT I I OT I I I I I OT I I I I T I OT I I II I OT I I I IIII Figure 1: Example of IOB-labeled characters 3.2 Datasets In our experiments we use three datasets to compare our method for different languages and for different domains: manually checked English newswire texts taken from the Groningen Meaning Bank, GMB (Basile et al., 2012), Dutch newswire texts, comprising two days from January 2000 extracted from the Twente News Corpus, TwNC (Ordelman et al., 1423 2007), and a random sample of Italian texts from the corpus (Borghetti et al., 2011). PAISA` Table 1: Datasets characteristics. NameLanguageDomainSentences Tokens TGNMCB EDnugtclihshNNeewwsswwiir ee492,,58387686 604,,644337 PAIItalianWeb/various42,674869,095 The data was converted into IOB format by inferring an alignment between the raw text and the segmented text. 3.3 Sequence labeling We apply the Wapiti implementation (Lavergne et al., 2010) of Conditional Random Fields (Lafferty et al., 2001), using as features the output label of each character, combined with 1) the character itself, 2) the output label on the previous character, 3) characters and/or their Unicode categories from context windows of varying sizes. For example, with a context size of 3, in Figure 1, features for the E in Eighty-three with the output label S would be E/S, O/S, /S, i/S, Space/S, Lowercase/S. The intuition is that the 3 1 existing Unicode categories can generalize across similar characters whereas character features can identify specific contexts such as abbreviations or contractions (e.g. didn ’t). The context window sizes we use are 0, 1, 3, 5, 7, 9, 11 and 13, centered around the focus character. 3.4 Deep learning of features Automatically learned word embeddings have been successfully used in NLP to reduce reliance on manual feature engineering and boost performance. We adapt this approach to the character level, and thus, in addition to hand-crafted features we use text representations induced in an unsupervised fashion from character strings. A complete discussion of our approach to learning text embeddings can be found in (Chrupała, 2013). Here we provide a brief overview. Our representations correspond to the activation of the hidden layer in a simple recurrent neural (SRN) network (Elman, 1990; Elman, 1991), implemented in a customized version of Mikolov (2010)’s RNNLM toolkit. The network is sequentially presented with a large amount of raw text and learns to predict the next character in the sequence. It uses the units in the hidden layer to store a generalized representation of the recent history. After training the network on large amounts on unlabeled text, we run it on the training and test data, and record the activation of the hidden layer at each position in the string as it tries to predict the next character. The vector of activations of the hidden layer provides additional features used to train and run the CRF. For each of the K = 10 most active units out of total J = 400 hidden units, we create features (f(1) . . . f(K)) defined as f(k) = 1if sj(k) > 0.5 and f(k) = 0 otherwise, where sj (k) returns the activation of the kth most active unit. For training the SRN only raw text is necessary. We trained on the entire GMB 2.0.0 (2.5M characters), the portion of TwNC corresponding to January 2000 (43M characters) and a sample of the PAISA` corpus (39M characters). 4 Results and Evaluation In order to evaluate the quality of the tokenization produced by our models we conducted several experiments with different combinations of features and context sizes. For these tests, the models are trained on an 80% portion of the data sets and tested on a 10% development set. Final results are obtained on a 10% test set. We report both absolute number of errors and error rates per thousand (‰). 4.1 Feature sets We experiment with two kinds of features at the character level, namely Unicode categories (31 dif- ferent ones), Unicode character codes, and a combination of them. Unicode categories are less sparse than the character codes (there are 88, 134, and 502 unique characters for English, Dutch and Italian, respectively), so the combination provide some generalization over just character codes. Table 2: Error rates obtained with different feature sets. Cat stands for Unicode category, Code for Unicode character code, and Cat-Code for a union of these features. Error rates per thousand (‰) Feature setEnglishDutchItalian C ao td-9eC-9ode-94568 ( 0 1. 241950) 1,7 4807243 ( 12 . 685078) 1,65 459872 ( 12 . 162470) 1424 From these results we see that categories alone perform worse than only codes. For English there is no gain from the combination over using only character codes. For Dutch and Italian there is an improvement, although it is only significant for Italian (p = 0.480 and p = 0.005 respectively, binomial exact test). We use this feature combination in the experiments that follow. Note that these models are trained using a symmetrical context of 9 characters (four left and four right of the current character). In the next section we show performance of models with different window sizes. 4.2 Context window We run an experiment to evaluate how the size of the context in the training phase impacts the classification. In Table 4.2 we show the results for symmetrical windows ranging in size from 1to 13. Table 3: Using different context window sizes. Feature setEngElisrhror rateDs puetrch thousandI (t‰al)ian C Ca t - C Co d e - 31957217830 ( 308 . 2635218) 4,39 2753742085(1 (017. 0956208 6) 92,1760 8516873 (1 (135. 31854617) CCaat - CCood e - 1 3198 ( 0 . 2 58) 7 561 ( 1 . 5 64) 6 9702 ( 1 . 1271) 4.3 SRN features We also tested the automatically learned features de- rived from the activation of the hidden layer of an SRN language model, as explained in Section 3. We combined these features with character code and Unicode category features in windows of different sizes. The results of this test are shown in Table 4. The first row shows the performance of SRN features on their own. The following rows show the combination of SRN features with the basic feature sets of varying window size. It can be seen that augmenting the feature sets with SRN features results in large reductions of error rates. The Cat-Code-1SRN setting has error rates comparable to Cat-Code9. The addition of SRN features to the two best previous models, Cat-Code-9 and Cat-Code-13, reduces the error rate by 83% resp. 81% for Dutch, and by 24% resp. 26% for Italian. All these differences are statistically significant according to the binomial test (p < 0.001). For English, there are too few errors to detect a statistically significant effect for Cat-Code-9 (p = 0.07), but for Cat-Code-13 we find p = 0.016. Table 4: Results obtained using different context window sizes and addition of SRN features. Error rates per thousand (‰) Feature setEnglishDutchItalian C SaRtN-C o d e -59173 S -R SN 27413( 0 . 2107635)12 7643251 (0 .42358697)45 90376489(01 .829631) In a final step, we selected the best models based on the development sets (Cat-Code-7-SRN for English and Dutch, Cat-Code-1 1-SRN for Italian), and checked their performance on the final test set. This resulted in 10 errors (0.27 ‰) for English (GMB corpus), 199 errors (0.35 ‰) for Dutch (TwNC corpus), and 454 errors (0.76 ‰) for Italian (PAISA` corpus). 5 Discussion It is interesting to examine what kind of errors the SRN features help avoid. In the English and Dutch datasets many errors are caused by failure to recognize personal titles and initials or misparsing of numbers. In the Italian data, a large fraction of errors is due to verbs with clitics, which are written as a single word, but treated as separate tokens. Table 5 shows examples of errors made by a simpler model that are fixed by adding SRN features. Table 6 shows the confusion matrices for the Cat-Code-7 and CatCode-7-SRN sets on the Dutch data. The mistake most improved by SRN features is T/I with 89% error reduction (see also Table 5). The is also the most common remaining mistake. A comparison with other approaches is hard because of the difference in datasets and task definition (combined word/sentence segmentation). Here we just compare our results for sentence segmentation (sentence F1 score) with Punkt, a state-of-the1425 Table 5: Positive impact of SRN features. Table 6: Confusion matrix for Dutch development set. GoTOSIld32P8r1e52d480iIc7te52d,3O0C4 at-32C So20d8e-47612T089P3r2e8d5ic43t1065Ied7,2C3Oa04 t-C3o1d2S0 e-78S1R0562TN038 art sentence boundary detection system (Kiss and Strunk, 2006). With its standard distributed models, Punkt achieves 98.51% on our English test set, 98.87% on Dutch and 98.34% on Italian, compared with 100%, 99.54% and 99.51% for our system. Our system benefits here from its ability to adapt to a new domain with relatively little (but annotated) training data. 6 What Elephant? Word and sentence segmentation can be recast as a combined tagging task. This way, tokenization is cast as a supervised learning task, causing a shift of labor from writing rules to manually correcting labels. Learning this task with CRF achieves high accuracy.1 Furthermore, our tagging method does not lose the connection between original text and tokens. In future work, we plan to broaden the scope of this work to other steps in document preparation, 1All software needed to replicate our experiments is available at http : / / gmb . let . rug . nl / e lephant / experiments . php such as normalization of punctuation, and their interaction with segmentation. We further plan to test our method on a wider range of datasets, allowing a more direct comparison with other approaches. Finally, we plan to explore the possibility of a statistical universal segmentation model for mutliple languages and domains. In a famous scene with a live elephant on stage, the comedian Jimmy Durante was asked about it by a policeman and surprisedly answered: “What elephant?” We feel we can say the same now as far as tokenization is concerned. References Valerio Basile, Johan Bos, Kilian Evang, and Noortje Venhuizen. 2012. Developing a large semantically annotated corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pages 3 196–3200, Istanbul, Turkey. Claudia Borghetti, Sara Castagnoli, and Marco Brunello. 2011. Itesti del web: una proposta di classificazione sulla base del corpus PAISA`. In M. Cerruti, E. Corino, and C. Onesti, editors, Formale e informale. La variazione di registro nella comunicazione elettronica, pages 147–170. Carocci, Roma. Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, USA. Rebecca Dridan and Stephan Oepen. 2012. Tokenization: Returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea. Association for Computational Linguistics. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive science, 14(2): 179–21 1. Jeffrey L. Elman. 1991 . Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2): 195–225. Murhaf Fares, Stephan Oepen, and Zhang Yi. 2013. Machine learning for high-quality tokenization - replicating variable tokenization schemes. In A. Gelbukh, editor, CICLING 2013, volume 7816 of Lecture Notes in Computer Science, pages 23 1–244, Berlin Heidelberg. Springer-Verlag. Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117–133. Kluwer Academic Publishers, Dordrecht. – –. 1426 Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2nd edition. Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485–525. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289. Thomas Lavergne, Olivier Capp e´, and Fran ¸cois Yvon. 2010. Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504–5 13, Uppsala, Sweden, July. Association for Computational Linguistics. Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics, 28(3):289–3 18. Tom a´ˇ s Mikolov, Martin Karafi´ at, Luk a´ˇ s Burget, Jan Cˇernock y´, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech. Roeland Ordelman, Franciska de Jong, Arjan van Hessen, and Hendri Hondorp. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsleter, 12(3/4):4–7. David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 16– 19, Washington, DC, USA. Association for Computational Linguistics. Michael D. Riley. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, HLT ’89, pages 339–352, Stroudsburg, PA, USA. Association for Computational Linguistics. Carlos N. Silla Jr. and Celso A. A. Kaestner. 2004. An analysis of sentence boundary detection systems for English and Portuguese documents. In Fifth International Conference on Intelligent Text Processing and Computational Linguistics, volume 2945 of Lecture Notes in Computer Science, pages 135–141. Springer. Katrin Tomanek, Joachim Wermter, and Udo Hahn. 2007. Sentence and token splitting based on conditional random fields. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49–57, Melbourne, Australia.
6 0.51973146 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
7 0.40303263 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
8 0.36729038 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
9 0.33545101 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
10 0.33379549 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
11 0.33138639 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
12 0.31006727 124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation
13 0.28872359 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
14 0.28041089 61 emnlp-2013-Detecting Promotional Content in Wikipedia
15 0.27887955 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech
16 0.27028632 27 emnlp-2013-Authorship Attribution of Micro-Messages
17 0.25514054 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
18 0.25345153 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations
19 0.24686639 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
topicId topicWeight
[(3, 0.055), (18, 0.041), (22, 0.061), (30, 0.079), (39, 0.275), (45, 0.024), (50, 0.024), (51, 0.182), (66, 0.031), (71, 0.033), (75, 0.024), (77, 0.015), (85, 0.012), (95, 0.012), (96, 0.019), (97, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.75993961 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
Author: Fan Yang ; Paul Vozila
Abstract: In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. These two segmenters are initially trained with a small amount of segmented data, and then iteratively improve each other using the large amount of unlabelled data. Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.
2 0.72626495 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution
Author: Fang Kong ; Hwee Tou Ng
Abstract: Coreference resolution plays a critical role in discourse analysis. This paper focuses on exploiting zero pronouns to improve Chinese coreference resolution. In particular, a simplified semantic role labeling framework is proposed to identify clauses and to detect zero pronouns effectively, and two effective methods (refining syntactic parser and refining learning example generation) are employed to exploit zero pronouns for Chinese coreference resolution. Evaluation on the CoNLL-2012 shared task data set shows that zero pronouns can significantly improve Chinese coreference resolution.
3 0.69345367 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
Author: Yangfeng Ji ; Jacob Eisenstein
Abstract: Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification. The key idea is that similarity in the latent space implies semantic relatedness. We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification. First, we design a new discriminative term-weighting metric called TF-KLD, which outperforms TF-IDF. Next, we show that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy. Finally, we combine latent features with fine-grained n-gram overlap features, yielding performance that is 3% more accurate than the prior state-of-the-art.
4 0.61828417 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu
Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.
5 0.615601 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
Author: Alla Rozovskaya ; Dan Roth
Abstract: State-of-the-art systems for grammatical error correction are based on a collection of independently-trained models for specific errors. Such models ignore linguistic interactions at the sentence level and thus do poorly on mistakes that involve grammatical dependencies among several words. In this paper, we identify linguistic structures with interacting grammatical properties and propose to address such dependencies via joint inference and joint learning. We show that it is possible to identify interactions well enough to facilitate a joint approach and, consequently, that joint methods correct incoherent predictions that independentlytrained classifiers tend to produce. Furthermore, because the joint learning model considers interacting phenomena during training, it is able to identify mistakes that require mak- ing multiple changes simultaneously and that standard approaches miss. Overall, our model significantly outperforms the Illinois system that placed first in the CoNLL-2013 shared task on grammatical error correction.
7 0.61236328 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
8 0.61209816 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
9 0.61136699 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
10 0.61106658 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
11 0.60911328 152 emnlp-2013-Predicting the Presence of Discourse Connectives
12 0.60838705 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types
13 0.60766977 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
14 0.60713834 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
15 0.60710889 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
16 0.60646677 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
17 0.60629588 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
18 0.60610378 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
19 0.60608429 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
20 0.60595477 69 emnlp-2013-Efficient Collective Entity Linking with Stacking