emnlp emnlp2011 emnlp2011-48 knowledge-graph by maker-knowledge-mining

48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data


Source: pdf

Author: Weiwei Sun ; Jia Xu

Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract This paper investigates improving supervised word segmentation accuracy with unlabeled data. [sent-5, score-0.577]

2 We present a unified solution to include features derived from unlabeled data to a discriminative learning model. [sent-7, score-0.411]

3 For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. [sent-8, score-0.288]

4 words, segmentation is a necessary initial step for Chinese language processing. [sent-16, score-0.31]

5 Previous research shows that word segmentation models trained on labeled data are reasonably accurate. [sent-17, score-0.402]

6 In this paper, we investigate improving supervised word segmentation with unlabeled data. [sent-18, score-0.577]

7 We distinguish three types of unlabeled data, namely large-scale in-domain data, out-of-domain data and small-scale document text. [sent-19, score-0.239]

8 In this situation, supervised learning can provide competitive results, and it is difficult to improve them any further by using extra unlabeled data. [sent-29, score-0.268]

9 Chinese word segmentation is one of this kind of tasks, since several large-scale manually annotated corpora are publicly available. [sent-30, score-0.351]

10 In this work, we first exploit unlabeled indomain data to improve strong supervised models. [sent-31, score-0.266]

11 We introduce the third type of unlabeled data with a transductive learning, document-level view. [sent-33, score-0.363]

12 Many applications of word segmentation involve processing a whole document, such as information retrieval. [sent-34, score-0.351]

13 In this work, we are also concerned with enhancing word segmentation with the document information. [sent-45, score-0.454]

14 We present a unified “feature engineering” approach for learning segmentation models from both labeled and unlabeled data. [sent-46, score-0.576]

15 First, we use unannotated corpus to extract string and document information, and then we use these information to construct new statisticsbased and document-based feature mapping for a discriminative word segmenter. [sent-48, score-0.442]

16 We are relying on the ability ofdiscriminative learning method to identify and explore informative features, which play central role to boost the segmentation performance. [sent-49, score-0.337]

17 In their implementations, word clusters derived from unlabeled data are imported as features to discriminative learning approaches. [sent-53, score-0.42]

18 This annotation style allows us to evaluate our transductive segmentation method. [sent-56, score-0.49]

19 Our experiments show that both statistics-based and document-based features are effective in the word segmentation application. [sent-57, score-0.422]

20 In general, the use of unlabeled data can be motivated by two concerns: First, given a fixed amount of labeled data, we might wish to leverage unlabeled data to improve the performance of a supervised model. [sent-58, score-0.485]

21 By conducting experiments on data sets of varying sizes, we demonstrate that for fixed levels of performance, the new features derived from unlabeled data can significantly reduce the need of labeled data. [sent-64, score-0.387]

22 1 Discriminative Character-based Word Segmentation The Character-based approach is a dominant word segmentation solution for Chinese text processing. [sent-71, score-0.351]

23 This approach treats word segmentation as a sequence tagging problem, assigning labels to the characters indicating whether a character locates at the beginning of, inside or at the end of a word. [sent-72, score-0.762]

24 赵B紫I阳E总B理E的S秘B密E日B记E Key to our approach is to allow informative fea- tures derived from unlabeled data to assist the segmenter. [sent-81, score-0.339]

25 These features are divided into two types: character features and word type features. [sent-86, score-0.429]

26 Note that the word type features are indicator functions that fire when the local character sequence matches a word uni-gram or bi-gram. [sent-87, score-0.399]

27 To conveniently illustrate, we denote a candidate character token ci with a context . [sent-89, score-0.318]

28 We use c[s:e] to express a string that starts at the position s and ends at the position e. [sent-96, score-0.229]

29 For example, c[i:i+1] expresses a character bi-gram cici+1 . [sent-97, score-0.246]

30 • • • The identity of the string c[s:i] (i − 6 < s < i), iTfh iet mideatnctihteys a wtheord st firnogm c the (li s −t o 6f uni-gram words; The identity of the string c[i:e] (i < e < i+ 6), Tifh hite m idaetnchtietys a word; multiple features could be generated. [sent-102, score-0.507]

31 If the string c[s:i] (s < i) matches an item from the idiom lexicon, the feature template receives a string value “E-IDIOM”. [sent-112, score-0.581]

32 3 Statistics-based Features In order to distill information from unlabeled data, we borrow ideas from some previous research on unsupervised word segmentation. [sent-116, score-0.224]

33 The statistical information acquired from a relatively large amount of unlabeled data are designed as features correlated with the position where a character locates in a word token. [sent-117, score-0.611]

34 We adopt this idea in our character-based segmentation model. [sent-125, score-0.31]

35 The empirical mutual information between two character bigrams is computed by counting how often they appear in the large-scale unlabeled corpus. [sent-126, score-0.506]

36 Chinese character string c[i−2:i+1] , the mutual information between substrings c[i−2:i−1] and c[i:i+1] is computed as: MI(c[i−2:i−1],c[i:i+1]) = logp(c[i−p(2:ci−[i−1]2):pi+(c1[]i):i+1]) For each character ci, we incorporate the MI of the character bi-grams into our model. [sent-132, score-0.992]

37 This principle is introduced as the accessor variety criterion for identifying meaningful Chinese words in (Feng et al. [sent-137, score-0.66]

38 This criterion evaluates how independently a string is used, and thus how likely it is that the string can be a word. [sent-139, score-0.391]

39 Given a string s, which consists of l (l ≥ 2) characters, we define the left accessor variety ≥of 2L)la vc (s) as ethrse, n wuem dbeefri noef dthiestin lecftt characters that precede s in a corpus. [sent-140, score-0.906]

40 Similarly, the right accessor variety Ralv (s) is defined as the number of distinct characters that succeed s. [sent-141, score-0.696]

41 We first extract all strings whose length are between 2 and 4 from the unlabeled data, and calculate their accessor variety values. [sent-142, score-0.954]

42 The preceding and succeeding strings of punctuations carry additional wordbreak information, since punctuations should be segmented as a word. [sent-148, score-0.685]

43 When a string appears many times preceding or succeeding punctuations, there tends to be wordbreaks succeeding or preceding that string. [sent-152, score-0.454]

44 To utilize the wordbreak information provided by punctuations, we extract all strings with length l(2 ≤ l ≤ 4) ownhs,ic wh precede or slu sctrcienegds punctuations (in2 t ≤he unlabeled data. [sent-153, score-0.588]

45 We define the left punctuation variety of Llpv(s) as the number of times a punctuation precedes s in a corpus. [sent-154, score-0.556]

46 Similarly, the right punctuation variety Rplv(s) is defined as the number of how many times a punctuation succeeds s. [sent-155, score-0.556]

47 We first gather all strings surrounding punctuations in the unlabeled data, and calculate their punctuation variety values. [sent-157, score-0.839]

48 The length of each string is also restricted between 2 and 4. [sent-158, score-0.214]

49 Our motivation to use the punctuation in- formation to assist a word segmenter is similiar to (Spitkovsky et al. [sent-161, score-0.359]

50 The reason is that these statistics actually behave non-linearly to predict character labels. [sent-169, score-0.273]

51 For each type of statistics, one weight alone cannot capture the relation between its value and the possibility that a string forms a word. [sent-170, score-0.232]

52 The integer part of each MI value is used as a string feature. [sent-173, score-0.232]

53 For the accessor variety and punctuation variety information, since their values are integer, we can directly use them as string features. [sent-174, score-1.172]

54 The accessor variety and punctuation variety could be very large, so we set thresholds to cut off large values to deal with the data sparse problem. [sent-175, score-0.995]

55 Specially, if an accessor variety value is greater than 50, it is incorporated as a feature “> 50”; if the value is greater than 30 but not greater than 50, it is incorporated as a feature “30 − 50”; else the value is individually incorporated as a string tfheeatu vrael. [sent-176, score-1.239]

56 u eFo isr example, liyf t ihneleft accessory variety of a character bi-gram c[i:i+1] is 29, the binary feature “L2av (c[i:i+1] )=29” set to 1, while other related binary features such as “L2av(c[i:i+1]) = 15” or “L2av(c[i:i+1]) > 50” will be set to 0. [sent-177, score-0.579]

57 Instead, we propose the following binary features which are based on the string count in the given doc- ument that is simply the number of times a given string appears in that document. [sent-184, score-0.454]

58 For each character ci, our document-based features include, • Whether the string count of c[s:i] is equal to that 974 of c[s:i+1] (i 3 ≤ s ≤ i). [sent-185, score-0.494]

59 Multiple features are generated −fo 3r d ≤iff sere ≤nt i string length. [sent-186, score-0.248]

60 Multiple features are generated ≤for e d ≤iffe ire +n t3 string length. [sent-188, score-0.248]

61 The string counts of c[s:i] and c[s:i+1] being equal means that when c[s:i] appears, it appears inside c[s:i+1] . [sent-190, score-0.254]

62 In this case, c[s:i] is not independently used in this document, and this feature suggests the segmenter not assign a “S” or “E” label to the character ci. [sent-191, score-0.481]

63 Similarly, the string counts of c[i:e] and c[i−1:e] being equal means c[i:e] is not independently used in this document, and this feature suggests segmenter not assign a “S” or “B” label to ci. [sent-192, score-0.412]

64 It is also an popular data set to evaluate word segmentation methods, such as (Jiang et al. [sent-198, score-0.351]

65 Note that, all idioms in our extra idiom lexicon are added into the in-vocabulary word list. [sent-220, score-0.292]

66 2 Main Results Table 3 summarizes the segmentation results on the development data with different configurations, representing a few choices between baseline, statisticsbased and document-based feature sets. [sent-227, score-0.435]

67 In this table, the symbol “+” means features of current configuration contains both the baseline features and new features for semi-supervised or transductive learning. [sent-228, score-0.393]

68 From this table, we can clearly see the impact of features derived from the large-scale unlabeled data and the current document. [sent-229, score-0.336]

69 Both good segmentation techniques and valuable labeled corpora have been developed, and pure supervised systems can provide strong performance. [sent-231, score-0.444]

70 There are significant increases when accessor variety features and punctuation variety features are 975 Devel. [sent-233, score-1.137]

71 For example, PU(2,3) means punctuation variety features of character bi-grams and tri-grams are added. [sent-281, score-0.689]

72 Extending the length of neighboring string helps a little from 2 to 3. [sent-283, score-0.214]

73 Table 4 shows the segmentation performance on the test data set. [sent-291, score-0.31]

74 Label E’’ Label I’’ Feature value Label B’’ Feature value Label S’’ recSo- 12- 31 05 05 0510 520 530erocS- 2 1- 1 50 5 0 510 520 530 ocreS - -12 12 5 0 5 0 5 1015202530erocS -1 -1 05 05 51015202530 Feature value Feature value Figure 2: Scatter plot of feature score against feature value. [sent-297, score-0.403]

75 For example, the performance of the model with extra features trained on 500k characters is slightly higher than the performance of the model with only baseline features trained on the whole labeled data. [sent-322, score-0.308]

76 In our experiment, when the accessor variety and punctuation variety information are integrated as numeric features, they do not contribute. [sent-327, score-1.046]

77 To show the non-linear way that these features contribute to the prediction problem, we present the scatter plots of the score of each feature (i. [sent-328, score-0.247]

78 Figure 2 shows the relation between the score and the value of the punctuation variety features. [sent-331, score-0.427]

79 1s9e plots iannddic aate p tihnet punctuation variety features contribute to the final model in a very complicated way. [sent-338, score-0.443]

80 The accessor variety features affect the model in the same way, so we do not give detailed discussions. [sent-340, score-0.694]

81 We only show the same scatter plot of the La2v (c[i:i+1] ) feature template in Figure 3. [sent-341, score-0.211]

82 (2008) presented a Bayesian semisupervised approach to derive task-oriented word 977 segmentation for machine translation (MT). [sent-344, score-0.388]

83 This method learns new word types and word distributions on unlabeled data by considering segmentation as a hidden variable in MT. [sent-345, score-0.575]

84 The accessor variety criterion is proposed to extract word types, i. [sent-354, score-0.701]

85 Different from their work, our method resolves the segmentation problem of running texts, in which this criterion is used to define features correlated with the character position labels. [sent-358, score-0.69]

86 Li and Sun (2009) observed that punctuations are perfect delimiters which provide useful information for segmentation. [sent-359, score-0.25]

87 Their method can be viewed as a self-training procedure, in which extra punctuation information is incorporated to filter out automatically predicted samples. [sent-360, score-0.268]

88 In our method, the counts of the preceding and succeeding strings of punctuations are incorporated directly as features into a supervised model. [sent-362, score-0.564]

89 In machine learning, transductive learning is a learning framework that typically makes use of unlabeled data. [sent-363, score-0.363]

90 The goal of transductive learning is to only infer labels for the unlabeled data points in the test set rather than to learn a general classification function that can be applied to any future data sets. [sent-364, score-0.363]

91 Although the idea to explore the document-level information in our work is similar to transductive learning, we do not use state-of-theart transductive learning algorithms which involve learning when they meet the test data. [sent-366, score-0.36]

92 5 Conclusion and Future Work In this paper, we have presented a simple yet effective approach to explore unlabeled data for Chinese word segmentation. [sent-368, score-0.224]

93 Especially, the informative features derived from unlabeled data lead to significant improvement of the recall of unknown words. [sent-371, score-0.405]

94 Our immediate concern for future work is to exploit the out-of-domain data to improve the robustness of current word segmentation systems. [sent-372, score-0.351]

95 The idea would be to extract domain information from unlabeled data and define them as features in our unified approach. [sent-373, score-0.286]

96 The word segmentation task is similar to constituency parsing, in the sense of finding boundaries of language units. [sent-377, score-0.387]

97 Automatic adaptation of annotation standards: Chinese word segmentation and pos tagging a case study. [sent-389, score-0.351]

98 Word-based and characterbased word segmentation models: Comparison and combination. [sent-428, score-0.351]

99 A stacked sub-word model for joint Chinese word segmentation and part-ofspeech tagging. [sent-433, score-0.351]

100 Bayesian semi-supervised Chinese word segmentation for statistical machine translation. [sent-452, score-0.351]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('accessor', 0.435), ('segmentation', 0.31), ('character', 0.246), ('variety', 0.188), ('punctuation', 0.184), ('unlabeled', 0.183), ('transductive', 0.18), ('string', 0.177), ('punctuations', 0.173), ('av', 0.166), ('chinese', 0.155), ('pu', 0.149), ('oov', 0.115), ('ctb', 0.111), ('idioms', 0.111), ('strings', 0.111), ('sun', 0.109), ('scatter', 0.102), ('doc', 0.1), ('idiom', 0.098), ('mi', 0.098), ('segmenter', 0.087), ('derived', 0.082), ('succeeding', 0.08), ('delimiters', 0.077), ('mutual', 0.077), ('label', 0.074), ('feature', 0.074), ('characters', 0.073), ('ci', 0.072), ('features', 0.071), ('jia', 0.066), ('weiwei', 0.066), ('document', 0.056), ('crfs', 0.055), ('value', 0.055), ('segmented', 0.053), ('numeric', 0.051), ('balanced', 0.051), ('aarl', 0.051), ('crfsuite', 0.051), ('statisticsbased', 0.051), ('wordbreak', 0.051), ('wsun', 0.051), ('labeled', 0.051), ('inside', 0.048), ('enhancing', 0.047), ('assist', 0.047), ('preceding', 0.044), ('dfki', 0.044), ('locates', 0.044), ('tseng', 0.044), ('discriminative', 0.043), ('supervised', 0.043), ('recall', 0.042), ('extra', 0.042), ('recognize', 0.042), ('incorporated', 0.042), ('feng', 0.041), ('identity', 0.041), ('cs', 0.041), ('word', 0.041), ('reductions', 0.04), ('pure', 0.04), ('committee', 0.04), ('indomain', 0.04), ('miller', 0.037), ('criterion', 0.037), ('derive', 0.037), ('length', 0.037), ('xu', 0.036), ('constituency', 0.036), ('jiang', 0.036), ('fields', 0.036), ('plot', 0.035), ('turian', 0.035), ('nianwen', 0.035), ('koo', 0.034), ('precede', 0.033), ('anchor', 0.033), ('sighan', 0.033), ('spitkovsky', 0.033), ('unified', 0.032), ('resolve', 0.032), ('iff', 0.031), ('article', 0.031), ('gigaword', 0.031), ('xue', 0.03), ('appears', 0.029), ('organizing', 0.029), ('german', 0.028), ('agency', 0.028), ('xinhua', 0.028), ('informative', 0.027), ('statistics', 0.027), ('nonetheless', 0.027), ('position', 0.026), ('observing', 0.026), ('separately', 0.025), ('wish', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

Author: Weiwei Sun ; Jia Xu

Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

2 0.20101565 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases

Author: Yugo Murawaki ; Sadao Kurohashi

Abstract: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer.

3 0.1007923 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni

Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp

4 0.088094071 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

Author: Zhenghua Li ; Min Zhang ; Wanxiang Che ; Ting Liu ; Wenliang Chen ; Haizhou Li

Abstract: Part-of-speech (POS) is an indispensable feature in dependency parsing. Current research usually models POS tagging and dependency parsing independently. This may suffer from error propagation problem. Our experiments show that parsing accuracy drops by about 6% when using automatic POS tags instead of gold ones. To solve this issue, this paper proposes a solution by jointly optimizing POS tagging and dependency parsing in a unique model. We design several joint models and their corresponding decoding algorithms to incorporate different feature sets. We further present an effective pruning strategy to reduce the search space of candidate POS tags, leading to significant improvement of parsing speed. Experimental results on Chinese Penn Treebank 5 show that our joint models significantly improve the state-of-the-art parsing accuracy by about 1.5%. Detailed analysis shows that the joint method is able to choose such POS tags that are more helpful and discriminative from parsing viewpoint. This is the fundamental reason of parsing accuracy improvement.

5 0.084008865 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation

Author: Anna Kazantseva ; Stan Szpakowicz

Abstract: This paper presents a new algorithm for linear text segmentation. It is an adaptation of Affinity Propagation, a state-of-the-art clustering algorithm in the framework of factor graphs. Affinity Propagation for Segmentation, or APS, receives a set of pairwise similarities between data points and produces segment boundaries and segment centres data points which best describe all other data points within the segment. APS iteratively passes messages in a cyclic factor graph, until convergence. Each iteration works with information on all available similarities, resulting in highquality results. APS scales linearly for realistic segmentation tasks. We derive the algorithm from the original Affinity Propagation formu– lation, and evaluate its performance on topical text segmentation in comparison with two state-of-the art segmenters. The results suggest that APS performs on par with or outperforms these two very competitive baselines.

6 0.082923472 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

7 0.079155169 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement

8 0.068157472 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words

9 0.063295558 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

10 0.061429296 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction

11 0.05927043 141 emnlp-2011-Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

12 0.05894237 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

13 0.058756806 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

14 0.057064526 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

15 0.056033835 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning

16 0.05580898 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

17 0.055377733 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

18 0.054738268 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser

19 0.054250624 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

20 0.053874724 96 emnlp-2011-Multilayer Sequence Labeling


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.207), (1, -0.029), (2, -0.018), (3, 0.019), (4, -0.038), (5, 0.02), (6, -0.09), (7, -0.06), (8, -0.2), (9, 0.039), (10, 0.019), (11, 0.022), (12, -0.041), (13, 0.129), (14, -0.132), (15, -0.09), (16, -0.1), (17, -0.053), (18, 0.219), (19, 0.109), (20, -0.018), (21, -0.005), (22, -0.159), (23, -0.018), (24, -0.102), (25, -0.112), (26, -0.078), (27, 0.138), (28, -0.036), (29, 0.161), (30, 0.013), (31, 0.077), (32, 0.131), (33, 0.028), (34, -0.076), (35, 0.076), (36, 0.127), (37, -0.078), (38, 0.037), (39, -0.033), (40, -0.028), (41, -0.068), (42, 0.16), (43, -0.121), (44, -0.02), (45, -0.045), (46, -0.093), (47, 0.054), (48, -0.093), (49, -0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94681048 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

Author: Weiwei Sun ; Jia Xu

Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

2 0.76400077 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases

Author: Yugo Murawaki ; Sadao Kurohashi

Abstract: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer.

3 0.51943988 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation

Author: Anna Kazantseva ; Stan Szpakowicz

Abstract: This paper presents a new algorithm for linear text segmentation. It is an adaptation of Affinity Propagation, a state-of-the-art clustering algorithm in the framework of factor graphs. Affinity Propagation for Segmentation, or APS, receives a set of pairwise similarities between data points and produces segment boundaries and segment centres data points which best describe all other data points within the segment. APS iteratively passes messages in a cyclic factor graph, until convergence. Each iteration works with information on all available similarities, resulting in highquality results. APS scales linearly for realistic segmentation tasks. We derive the algorithm from the original Affinity Propagation formu– lation, and evaluate its performance on topical text segmentation in comparison with two state-of-the art segmenters. The results suggest that APS performs on par with or outperforms these two very competitive baselines.

4 0.50322253 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words

Author: Nobuhiro Kaji ; Masaru Kitsuregawa

Abstract: Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.

5 0.41757983 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

Author: Yufan Guo ; Anna Korhonen ; Thierry Poibeau

Abstract: Documents Anna Korhonen Thierry Poibeau Computer Laboratory LaTTiCe, UMR8094 University of Cambridge, UK CNRS & ENS, France alk2 3 @ cam . ac .uk thierry .po ibeau @ ens . fr tific literature according to categories of information structure (or discourse, rhetorical, argumentative or Argumentative Zoning (AZ) analysis of the argumentative structure of a scientific paper has proved useful for a number of information access tasks. Current approaches to AZ rely on supervised machine learning (ML). – – Requiring large amounts of annotated data, these approaches are expensive to develop and port to different domains and tasks. A potential solution to this problem is to use weaklysupervised ML instead. We investigate the performance of four weakly-supervised classifiers on scientific abstract data annotated for multiple AZ classes. Our best classifier based on the combination of active learning and selftraining outperforms our best supervised classifier, yielding a high accuracy of 81% when using just 10% of the labeled data. This result suggests that weakly-supervised learning could be employed to improve the practical applicability and portability of AZ across different information access tasks.

6 0.390865 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

7 0.37938637 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

8 0.32342032 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement

9 0.305372 96 emnlp-2011-Multilayer Sequence Labeling

10 0.29740262 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

11 0.29009968 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

12 0.28350329 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

13 0.28170276 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

14 0.27970529 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

15 0.27941915 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

16 0.278292 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

17 0.27126494 129 emnlp-2011-Structured Sparsity in Structured Prediction

18 0.26152366 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article

19 0.24902773 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

20 0.24593329 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.644), (36, 0.016), (37, 0.016), (45, 0.062), (53, 0.014), (54, 0.013), (57, 0.015), (64, 0.014), (66, 0.023), (79, 0.034), (82, 0.015), (85, 0.011), (87, 0.011), (96, 0.021), (98, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99629903 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora

Author: Matteo Negri ; Luisa Bentivogli ; Yashar Mehdad ; Danilo Giampiccolo ; Alessandro Marchetti

Abstract: We address the creation of cross-lingual textual entailment corpora by means of crowdsourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.

2 0.99421406 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

Author: Zhifei Li ; Ziyuan Wang ; Jason Eisner ; Sanjeev Khudanpur ; Brian Roark

Abstract: Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.

same-paper 3 0.99334174 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

Author: Weiwei Sun ; Jia Xu

Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

4 0.99258006 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling

Author: Vivek Srikumar ; Dan Roth

Abstract: This paper presents a model that extends semantic role labeling. Existing approaches independently analyze relations expressed by verb predicates or those expressed as nominalizations. However, sentences express relations via other linguistic phenomena as well. Furthermore, these phenomena interact with each other, thus restricting the structures they articulate. In this paper, we use this intuition to define a joint inference model that captures the inter-dependencies between verb semantic role labeling and relations expressed using prepositions. The scarcity of jointly labeled data presents a crucial technical challenge for learning a joint model. The key strength of our model is that we use existing structure predictors as black boxes. By enforcing consistency constraints between their predictions, we show improvements in the performance of both tasks without retraining the individual models.

5 0.9894855 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization

Author: Rui Yan ; Liang Kong ; Congrui Huang ; Xiaojun Wan ; Xiaoming Li ; Yan Zhang

Abstract: We investigate an important and challenging problem in summary generation, i.e., Evolutionary Trans-Temporal Summarization (ETTS), which generates news timelines from massive data on the Internet. ETTS greatly facilitates fast news browsing and knowledge comprehension, and hence is a necessity. Given the collection oftime-stamped web documents related to the evolving news, ETTS aims to return news evolution along the timeline, consisting of individual but correlated summaries on each date. Existing summarization algorithms fail to utilize trans-temporal characteristics among these component summaries. We propose to model trans-temporal correlations among component summaries for timelines, using inter-date and intra-date sen- tence dependencies, and present a novel combination. We develop experimental systems to compare 5 rival algorithms on 6 instinctively different datasets which amount to 10251 documents. Evaluation results in ROUGE metrics indicate the effectiveness of the proposed approach based on trans-temporal information. 1

6 0.90255094 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

7 0.90152699 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

8 0.90010893 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

9 0.87495226 79 emnlp-2011-Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction

10 0.8678574 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

11 0.86222315 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

12 0.85985935 17 emnlp-2011-Active Learning with Amazon Mechanical Turk

13 0.85931003 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words

14 0.85820335 136 emnlp-2011-Training a Parser for Machine Translation Reordering

15 0.85403275 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

16 0.85333318 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

17 0.85113889 89 emnlp-2011-Linguistic Redundancy in Twitter

18 0.85096145 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation

19 0.8481279 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

20 0.84650666 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information