acl acl2013 acl2013-164 knowledge-graph by maker-knowledge-mining

164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

Source: pdf

Author: Xipeng Qiu ; Qi Zhang ; Xuanjing Huang

Abstract: The growing need for Chinese natural language processing (NLP) is largely in a range of research and commercial applications. However, most of the currently Chinese NLP tools or components still have a wide range of issues need to be further improved and developed. FudanNLP is an open source toolkit for Chinese natural language processing (NLP) , which uses statistics-based and rule-based methods to deal with Chinese NLP tasks, such as word segmentation, part-ofspeech tagging, named entity recognition, dependency parsing, time phrase recognition, anaphora resolution and so on.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, most of the currently Chinese NLP tools or components still have a wide range of issues need to be further improved and developed. [sent-8, score-0.116]

2 Similar to English, the main tasks in Chinese NLP include word segmentation (CWS) , part-of-speech (POS) tagging, named entity recognition (NER) , syntactic parsing, anaphora resolution (AR) , and so on. [sent-12, score-0.716]

3 There are also some toolkits to be used for NLP, such as Stanford CoreNLP1 , Apache OpenNLP2, Curator3 and NLTK4. [sent-15, score-0.054]

4 But these toolkits are developed mainly for English and not optimized for Chinese. [sent-16, score-0.054]

5 In order to customize an optimized system for Chinese language process, we implement an open source toolkit, FudanNLP5, which is written in Java. [sent-17, score-0.19]

6 Since most of the state-of-theart methods for NLP are based on statistical learning, the whole framework of our toolkit is established around statistics-based methods, supplemented by some rule-based methods. [sent-18, score-0.167]

7 However, we find that there are some drawbacks in currently most commonly used corpora, such as CTB (Xia, 2000) and CoNLL (Hajič et al. [sent-20, score-0.122]

8 And in CoNLL corpus, the head words are often interrogative particles and punctuations, which are unidiomatic in Chinese. [sent-23, score-0.124]

9 These drawbacks bring more challenges to further analyses, such as information extraction and semantic understanding. [sent-24, score-0.063]

10 Currently, our toolkit has been used by many universities and companies for various applications, such as the dialogue system, social computing, recommendation system and vertical search. [sent-41, score-0.167]

11 We first briefly describe our system and its main components in section 2. [sent-43, score-0.098]

12 2 System Overview The components of our system have three layers of structure: data preprocessing, machine learning and natural language processing, which is shown in Figure 1. [sent-47, score-0.098]

13 We will introduce these components in detail in the following subsections. [sent-48, score-0.057]

14 So we firstly need to preprocess the input texts and transform them to the required format. [sent-52, score-0.111]

15 Due to the fact that text data is usually discrete and sparse, the sparse vector structure is largely used. [sent-53, score-0.079]

16 Similar to Mallet (McCallum, 2002) , we use the pipeline structure for a flexible transformation of various data. [sent-54, score-0.145]

17 The pipeline consists of several serial or parallel modules. [sent-55, score-0.129]

18 For example, when we transform a sentence into a vector with “bag-of-words”, the transformation process would involve the following serial pipes: 1. [sent-57, score-0.126]

19 String2Token Pipe: to transform a string into word tokens. [sent-58, score-0.057]

20 With the pipeline structure, the data preprocessing component has good flexibility, extensibility and reusability. [sent-63, score-0.104]

21 2 Machine Learning Component The outputs of NLP are often structured, so the structured learning is our core module. [sent-65, score-0.075]

22 Structured learning is the task of assigning a structured label y to an input x. [sent-66, score-0.115]

23 The label y can be a discrete variable, a sequence, a tree or a more complex structure. [sent-67, score-0.12]

24 Thus, we can label x with a score function, yˆ = argmyaxF(w,Φ(x,y)), (1) where w is the parameter of function F(·) . [sent-69, score-0.04]

25 For example, in sequence labeling, both x = x1, . [sent-72, score-0.061]

26 For first-order Markov sequence labeling, the feature can be denoted as ϕk (yi−1 , yi, x, i) , where iis the position in the sequence. [sent-79, score-0.061]

27 F aosr example, einx pcoonnedni-tional random fields (CRFs) (Lafferty et al. [sent-83, score-0.041]

28 , 2006) was proposed for normal multi-class classification and can be easily extended to structure learning (Crammer et al. [sent-92, score-0.039]

29 In structured learning, the number of possible solutions is very huge, so dynamic programming or approximate approaches are often used for efficiency. [sent-99, score-0.075]

30 For NLP tasks, the most popular structure is sequence. [sent-100, score-0.039]

31 To label the sequence, we use Viterbi dynamic programming to solve the inference problem in Eq. [sent-101, score-0.04]

32 Our system can support any order of Viterbi decoding. [sent-103, score-0.041]

33 In addition, we also implement a constrained Viterbi algorithm to reduce the number of possible solutions by pre-defined rules. [sent-104, score-0.072]

34 It is very useful for CWS and POS tagging with sequence labeling. [sent-106, score-0.163]

35 3 Other Algorithms Apart from the core modules of structured learning, our system also includes several traditional machine learning algorithms, such as Perceptron, Adaboost, kNN, k-means, and so on. [sent-110, score-0.164]

36 3 Natural Language Processing Components Our toolkit provides the basic NLP functions, such as word segmentation, part-ofspeech tagging, named entity recognition, syntactic parsing, temporal phrase recognition, anaphora resolution, and so on. [sent-112, score-0.748]

37 1 Chinese Word Segmentation Different from English, Chinese sentences are written in a continuous sequence of characters without explicit delimiters such as the blank space. [sent-118, score-0.177]

38 Since the meanings of most Chinese characters are not complete, words are the basic syntactic and semantic units. [sent-119, score-0.061]

39 We implement a constrained Viterbi algorithm to allow users to add their own word dictionary. [sent-125, score-0.072]

40 2 POS tagging Chinese POS tagging is very different from that in English. [sent-128, score-0.204]

41 For example, there are different morphologies in English for the word “毁灭 (destroy) ”, such as “destroyed”, “destroying” and “destruction”. [sent-131, score-0.041]

42 There are two popular guidelines to tag the word’s POS: CTB (Xia, 2000) and PKU (Yu et al. [sent-133, score-0.058]

43 We take into account both the weaknesses and the strengths of these two guidelines, and propose our guideline for better subsequent analyses, such as parser and named entity recognition. [sent-135, score-0.341]

44 For example, the proper name is labeled as “NR” in CTB, while we label it with one of four categories: person, 51 Input: 1980 年。 John is from Washington, and he was born in 1980. [sent-136, score-0.04]

45 Table 1 Example of the output representation of our toolkit : location, organization and other proper name. [sent-138, score-0.126]

46 Conversely, we merge the “VC” and “VE” into “VV” since there is no link verb in Chinese. [sent-139, score-0.039]

47 Since a POS tag is assigned to each word, not to each character, Chinese POS tagging has two ways: pipeline method or joint method. [sent-141, score-0.162]

48 Currently, the joint method is more popular and effective because it uses more flexible features and can reduce the error propagation (Ng and Low, 2004) . [sent-142, score-0.046]

49 In our system, we implement both methods for POS tagging. [sent-143, score-0.072]

50 Besides, we also use some knowledge to improve the performance, such as Chinese surname and the common suffixes of the names of locations and organizations. [sent-144, score-0.084]

51 3 Named Entity Recognition In Chinese named entity recognition (NER) , there are usually three kinds of named entities (NEs) to be dealt with: names of persons (PER) , locations (LOC) and organizations (ORG) . [sent-147, score-0.452]

52 The internal structures are also different for different kinds of NEs, so it is difficult to build a unified model for named entity recognition. [sent-149, score-0.231]

53 Our NER is based on the results of POS tagging and uses some customize features to detect NEs. [sent-150, score-0.179]

54 First, the number of NEs is very large and the new NEs are endlessly emerging, so it is impossible to store them in dictionary. [sent-151, score-0.041]

55 Since the internal structures are relatively more important, we use language models to capture the internal structures. [sent-152, score-0.104]

56 Second, we merge the continuous NEs with some rulebased strategies. [sent-153, score-0.094]

57 For example, we combine the continuous words “人民/NN 大会堂/NN” into “ 人民大会堂/LOC”. [sent-154, score-0.055]

58 4 Dependency parsing Our syntactic parser is currently a dependency parser, which is implemented with the shift-reduce deterministic algorithm based on the work in (Yamada and Matsumoto, 2003) . [sent-157, score-0.223]

59 The syntactic structure of Chinese is more complex than that of English, and semantic meaning is more dominant than syntax in Chinese sentences. [sent-158, score-0.039]

60 So we select the dependency parser to avoid the minutiae in syntactic constituents and wish to pay more attention to the subsequent semantic analysis. [sent-159, score-0.121]

61 Since the structure of the Chinese language is quite different from that of English, we use more effective features according to the characteristics of Chinese sentences. [sent-160, score-0.039]

62 The common used corpus for Chinese dependency parsing is CoNLL corpus (Hajič et al. [sent-161, score-0.117]

63 For example, the head words are often interrogative particles and punctuations. [sent-164, score-0.124]

64 Our guideline is based on common understanding for Chinese grammar. [sent-165, score-0.115]

65 The Chinese syntactic components usually include subject, predicate, object, attribute, adverbial modifier and complement. [sent-166, score-0.057]

66 Table 2 shows some 52 primary dependency relations in our guideline. [sent-168, score-0.074]

67 5 Temporal Phrase Recognition and Normalization Chinese temporal phrases is more flexible than English. [sent-218, score-0.194]

68 Firstly, there are two calendars: Gregorian and lunar calendars. [sent-219, score-0.041]

69 Secondly, the forms of same temporal phrase are various, which often consists of Chinese characters, Arabic numerals and English letters, such as “ 早上 10 点 and “ 10:00 PM” . [sent-221, score-0.212]

70 ” Different from the general process based on machine learning, we implement the time phrase recognizer with a rule-based method. [sent-222, score-0.136]

71 After recognizing the temporal phrases, we normalize them with a standard time format. [sent-224, score-0.148]

72 For a phrase indicating a relative time , such as “一年后” and “ 一小时后”, we first find the base time in the context. [sent-225, score-0.118]

73 If no base time is found, or there is also no temporal phrase to indicate the base time (such as “ 明天”) , we set the base time to the current system time. [sent-226, score-0.415]

74 Table 3 gives examples for our temporal phrase recognition module. [sent-227, score-0.314]

75 ，今天我很忙，晚上 9 点才能下班。周日也要加班。 I’m busy today, and have to come off duty after 9:00 PM. [sent-230, score-0.041]

76 Output: 08七周晚今月天日上年二8(9t十点h号2o0ids七a(8AyS9日)u:0ng(duJPastylM)821 20 10 82 - 728- 28 6721 :0 1 The base time is 2012-02-22 10:00AM. [sent-232, score-0.054]

77 6 Anaphora Resolution Anaphora resolution is to detect the pronouns and find what they are referring to. [sent-235, score-0.165]

78 We first find all pronouns and entity names, then use a classifier to predict whether there is a relation between each pair of pronoun and entity name. [sent-236, score-0.225]

79 Table 4 gives examples for our anaphora resolution module. [sent-237, score-0.355]

80 The university has nurtured a lot of good students. [sent-240, score-0.041]

81 Secondly, users can also invoke the main NLP modules to process the inputs (strings or files) from the command line directly. [sent-253, score-0.048]

82 Thirdly, the web services are provided for platform-independent and language- independent use. [sent-254, score-0.044]

83 We use a REST (Representational State Transfer) architecture, in which the web services are viewed as resources and can be identified by their URLs. [sent-255, score-0.044]

84 5 Conclusions In this demonstration, we have described the system, FudanNLP, which is a Java-based open source toolkit for Chinese natural language processing. [sent-256, score-0.126]

85 Besides, we will also optimize the algorithms and codes to improve the system performances. [sent-258, score-0.041]

86 Conditional ran- dom fields: Probabilistic models for segmenting and labeling sequence data. [sent-307, score-0.103]

87 Chinese segmentation and new word detection using conditional random fields. [sent-330, score-0.08]

88 The part-of-speech tagging guidelines for the penn chinese treebank (3. [sent-334, score-0.605]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('chinese', 0.445), ('fudannlp', 0.249), ('anaphora', 0.231), ('pipe', 0.166), ('nes', 0.164), ('temporal', 0.148), ('pos', 0.134), ('toolkit', 0.126), ('resolution', 0.124), ('guideline', 0.115), ('ctb', 0.11), ('tagging', 0.102), ('recognition', 0.102), ('performances', 0.101), ('cws', 0.099), ('conll', 0.097), ('argmyaxf', 0.094), ('crammer', 0.093), ('entity', 0.092), ('xia', 0.088), ('named', 0.087), ('viterbi', 0.086), ('ner', 0.082), ('segmentation', 0.08), ('haji', 0.077), ('hownet', 0.077), ('customize', 0.077), ('structured', 0.075), ('dependency', 0.074), ('nlp', 0.072), ('implement', 0.072), ('dong', 0.069), ('serial', 0.069), ('olympic', 0.069), ('oxford', 0.067), ('og', 0.066), ('phrase', 0.064), ('drawbacks', 0.063), ('particles', 0.063), ('sequence', 0.061), ('interrogative', 0.061), ('nv', 0.061), ('characters', 0.061), ('pipeline', 0.06), ('currently', 0.059), ('guidelines', 0.058), ('components', 0.057), ('transform', 0.057), ('continuous', 0.055), ('games', 0.055), ('apache', 0.055), ('base', 0.054), ('firstly', 0.054), ('toolkits', 0.054), ('internal', 0.052), ('mallet', 0.051), ('peng', 0.05), ('wan', 0.049), ('modules', 0.048), ('parser', 0.047), ('flexible', 0.046), ('yi', 0.045), ('services', 0.044), ('preprocessing', 0.044), ('functions', 0.043), ('parsing', 0.043), ('besides', 0.043), ('yamada', 0.043), ('locations', 0.043), ('labeling', 0.042), ('nurtured', 0.041), ('busy', 0.041), ('tpr', 0.041), ('uo', 0.041), ('morphologies', 0.041), ('argmyax', 0.041), ('calendars', 0.041), ('aosr', 0.041), ('depar', 0.041), ('destroying', 0.041), ('endlessly', 0.041), ('fudan', 0.041), ('lunar', 0.041), ('patternbased', 0.041), ('rquez', 0.041), ('supplemented', 0.041), ('xjhuang', 0.041), ('xpqiu', 0.041), ('zhangheng', 0.041), ('zw', 0.041), ('names', 0.041), ('system', 0.041), ('pronouns', 0.041), ('secondly', 0.04), ('discrete', 0.04), ('label', 0.04), ('tree', 0.04), ('perceptron', 0.039), ('merge', 0.039), ('structure', 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

Author: Xipeng Qiu ; Qi Zhang ; Xuanjing Huang

2 0.33524767 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.

3 0.28227365 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Author: Longkai Zhang ; Li Li ; Zhengyan He ; Houfeng Wang ; Ni Sun

Abstract: Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctuation information of unlabeled micro-blog data by introducing characters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training framework to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, especially in OOV-recall. 1 INTRODUCTION Micro-blog (also known as tweets in English) is a new kind of broadcast medium in the form of blogging. A micro-blog differs from a traditional blog in that it is typically smaller in size. Furthermore, texts in micro-blogs tend to be informal and new words occur more frequently. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data. The poor segmentation results will hurt subsequent analysis on micro-blog text. ∗Corresponding author Manually labeling the texts of micro-blog is time consuming. Luckily, punctuations provide useful information because they are used as indicators of the end of previous sentence and the beginning of the next one, which also indicate the start and the end of a word. These ”natural boundaries” appear so frequently in micro-blog texts that we can easily make good use of them. TABLE 1 shows some statistics of the news corpus vs. the micro-blogs. Besides, English letters and digits are also more than those in news corpus. They all are natural delimiters of Chinese characters and we treat them just the same as punctuations. We propose a method to enlarge the training corpus by using punctuation information. We build a semi-supervised learning (SSL) framework which can iteratively incorporate newly labeled instances from unlabeled micro-blog data during the training process. We test our method on microblog texts and experiments show good results. This paper is organized as follows. In section 1 we introduce the problem. Section 2 gives detailed description of our approach. We show the experi- ment and analyze the results in section 3. Section 4 gives the related works and in section 5 we conclude the whole work. 2 Our method 2.1 Punctuations Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. To be simple, one can use label ’B’ to indicate a character is the beginning of a word, and use ’N’ to indicate a character is not the beginning of a word. We also use the 2-tag in our work. Other tag sets like the ’BIES’ tag set are not suiteable because the puctuation information cannot decide whether a character after punctuation should be labeled as ’B’ or ’S’(word with Single 177 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 177–182, micNreow-bslogC68h56i. n73e%%seE10n1.g6.8l%i%shN20u. m76%%berPu1n13c9.t u03a%%tion Table 1: Percentage of Chinese, English, number, punctuation in the news corpus vs. the micro-blogs. character). Punctuations can serve as implicit labels for the characters before and after them. The character right after punctuations must be the first character of a word, meanwhile the character right before punctuations must be the last character of a word. An example is given in TABLE 2. 2.2 Algorithm Our algorithm “ADD-N” is shown in TABLE 3. The initially selected character instances are those right after punctuations. By definition they are all labeled with ’B’ . In this case, the number of training instances with label ’B’ is increased while the number with label ’N’ remains unchanged. Because of this, the model trained on this unbalanced corpus tends to be biased. This problem can become even worse when there is inexhaustible supply of texts from the target domain. We assume that labeled corpus of the source domain can be treated as a balanced reflection of different labels. Therefore we choose to estimate the balanced point by counting characters labeling ’B’ and ’N’ and calculate the ratio which we denote as η . We assume the enlarged corpus is also balanced if and only if the ratio of ’B’ to ’N’ is just the same to η of the source domain. Our algorithm uses data from source domain to make the labels balanced. When enlarging corpus using characters behind punctuations from texts in target domain, only characters labeling ’B’ are added. We randomly reuse some characters labeling ’N’ from labeled data until ratio η is reached. We do not use characters ahead of punctuations, because the single-character words ahead of punctuations take the label of ’B’ instead of ’N’ . In summary our algorithm tackles the problem by duplicating labeled data in source domain. We denote our algorithm as ”ADD-N”. We also use baseline feature templates include the features described in previous works (Sun and Xu, 2011; Sun et al., 2012). Our algorithm is not necessarily limited to a specific tagger. For simplicity and reliability, we use a simple MaximumEntropy tagger. 3 Experiment 3.1 Data set We evaluate our method using the data from weibo.com, which is the biggest micro-blog service in China. We use the API provided by weibo.com1 to crawl 500,000 micro-blog texts of weibo.com, which contains 24,243,772 characters. To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data, which contain 2,420,037 characters. We manually segment 2038 randomly selected microblogs.We follow the segmentation standard as the PKU corpus. In micro-blog texts, the user names and URLs have fixed format. User names start with ’ @ ’, followed by Chinese characters, English letters, numbers and ’ ’, and terminated when meeting punctuations or blanks. URLs also match fixed patterns, which are shortened using ”http : / /t . cn /” plus six random English letters or numbers. Thus user names and URLs can be pre-processed separately. We follow this principle in following experiments. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. We choose the PKU data in our experiment because our baseline methods use the same segmentation standard. We compare our method with three baseline methods. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. ICTCLAS, on the other hand, is a HMMbased Chinese word segmenter. Another baseline is Li and Sun (2009), which also uses punctuation in their semi-supervised framework. F-score 1http : / / open . we ibo .com/wiki 2http : / /www . s ighan .org/bakeo f f2 0 0 5 / 3http : / / i c l .org/ ct as 4http : / / nlp . st an ford . edu /pro j ect s / chine s e-nlp . shtml \ # cws 178 评B论-是-风-格-，-评B论-是-能-力-。- BNBBNBBNBBNB Table 2: The first line represents the original text. The second line indicates whether each character is the Beginning of sentence. The third line is the tag sequence using ”BN” tag set. is used as the accuracy measure. The recall of out-of-vocabulary is also taken into consideration, which measures the ability of the model to correctly segment out of vocabulary words. 3.2 Main results methods on the development data. TABLE 4 summarizes the segmentation results. In TABLE 4, Li-Sun is the method in Li and Sun (2009). Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated. The next 4 methods all require a 100 iteration of self-training. No-punc is the method that only uses self-training while no punctuation information is added. Nobalance is similar to ADD N. The only difference between No-balance and ADD-N is that the former does not balance label ’B’ and label ’N’ . The comparison of Maxent and No-punctuation shows that naively adding confident unlabeled instances does not guarantee to improve performance. The writing style and word formation of the source domain is different from target domain. When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set. The comparison of Maxent, No-balance and ADD-N shows that considering punctuation as well as self-training does improve performance. Both the f-score and OOV-recall increase. By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV-Recall. However, considering it will improve OOV-Recall by about +1.6% and the fscore +0.2%. We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data. TABLE 5 shows different f-scores and OOV-Recalls on different unlabeled data set. We note that when the number of texts changes from 0 to 50,000, the f-score and OOV both are improved. However, when unlabeled data changes to 200,000, the performance is a bit decreased, while still better than not using unlabeled data. This result comes from the fact that the method ’ADD-N’ only uses characters behind punctua179 Tabl152S0eiz 0:Segm0.8nP67ta245ion0p.8Rer6745f9om0a.8nF57c6e1witOh0 .d7Vi65f-2394Rernt size of unlabeled data tions from target domain. Taking more texts into consideration means selecting more characters labeling ’N’ from source domain to simulate those in target domain. If too many ’N’s are introduced, the training data will be biased against the true distribution of target domain. 3.3 Characters ahead of punctuations In the ”BN” tagging method mentioned above, we incorporate characters after punctuations from texts in micro-blog to enlarge training set.We also try an opposite approach, ”EN” tag, which uses ’E’ to represent ”End of word”, and ’N’ to rep- resent ”Not the end of word”. In this contrasting method, we only use charactersjust ahead ofpunctuations. We find that the two methods show similar results. Experiment results with ADD-N are shown in TABLE 6 . 5DU0an0lt a b0Tsiealzbe lde6:0.C8Fo7”m5BNpa”rO0itsOa.o7gVn7-3oRfBN0.8aFn”7E0dNEN”Ot0.aO.g7V6-3R 4 Related Work Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). These supervised methods show good results, however, are unable to incorporate information from new domain, where OOV problem is a big challenge for the research community. On the other hand unsupervised word segmentation Peng and Schuurmans (2001); Goldwater et al. (2006); Jin and Tanaka-Ishii (2006); Feng et al. (2004); Maosong et al. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. However, they usually are less accurate and more complicated than supervised ones. Meanwhile semi-supervised methods have been applied into NLP applications. Bickel et al. (2007) learns a scaling factor from data of source domain and use the distribution to resemble target domain distribution. Wu et al. (2009) uses a Domain adaptive bootstrapping (DAB) framework, which shows good results on Named Entity Recognition. Similar semi-supervised applications include Shen et al. (2004); Daum e´ III and Marcu (2006); Jiang and Zhai (2007); Weinberger et al. (2006). Besides, Sun and Xu (201 1) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation. There are previous works using punctuations as implicit annotations. Riley (1989) uses it in sentence boundary detection. Li and Sun (2009) proposed a compromising solution to by using a clas- sifier to select the most confident characters. We do not follow this approach because the initial errors will dramatically harm the performance. Instead, we only add the characters after punctuations which are sure to be the beginning of words (which means labeling ’B’) into our training set. Sun and Xu (201 1) uses punctuation information as discrete feature in a sequence labeling framework, which shows improvement compared to the pure sequence labeling approach. Our method is different from theirs. We use characters after punctuations directly. 5 Conclusion In this paper we have presented an effective yet simple approach to Chinese word segmentation on micro-blog texts. In our approach, punctuation information of unlabeled micro-blog data is used, as well as a self-training framework to incorporate confident instances. Experiments show that our approach improves performance, especially in OOV-recall. Both the punctuation information and the self-training phase contribute to this improve- ment. Acknowledgments This research was partly supported by National High Technology Research and Development Program of China (863 Program) (No. 2012AA01 1101), National Natural Science Foundation of China (No.91024009) and Major National Social Science Fund of China(No. 12&ZD227;). 180 References Bickel, S., Br¨ uckner, M., and Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings ofthe 24th international conference on Machine learning, pages 81–88. ACM. Chen, W., Zhang, Y., and Isahara, H. (2006). Chinese named entity recognition with conditional random fields. In 5th SIGHAN Workshop on Chinese Language Processing, Australia. Daum e´ III, H. and Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1): 101–126. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004). Accessor variety criteria for chinese word extraction. Computational Linguistics, 30(1):75–93. Goldwater, S., Griffiths, T., and Johnson, M. (2006). Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 673–680. Association for Computational Linguistics. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Annual Meeting-Association For Computational Linguistics, volume 45, page 264. Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 428–435. Association for Computational Linguistics. Li, Z. and Sun, M. (2009). Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35(4):505– 512. Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 164. Jeju Island, Korea. Maosong, S., Dayang, S., and Tsou, B. (1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 1 7th international conference on Computational linguistics-Volume 2, pages 1265–1271 . Association for Computational Linguistics. Pan, S. and Yang, Q. (2010). A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10): 1345–1359. Peng, F. and Schuurmans, D. (2001). Selfsupervised chinese word segmentation. Advances in Intelligent Data Analysis, pages 238– 247. Riley, M. (1989). Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, pages 339–352. Association for Computational Linguistics. Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C. (2004). Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 589. Association for Computational Linguistics. Sun, W. and Xu, J. (201 1). Enhancing chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 970–979. Association for Computational Linguistics. Sun, X., Wang, H., and Li, W. (2012). Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 253–262, Jeju Island, Korea. Association for Computational Linguistics. Weinberger, K., Blitzer, J., and Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In In NIPS. Citeseer. Wu, D., Lee, W., Ye, N., and Chieu, H. (2009). Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1523–1532. Association for Computational Linguistics. Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. Zhao, H., Huang, C., and Li, M. (2006a). An improved chinese word segmentation system with 181 conditional random field. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July. Zhao, H., Huang, C., Li, M., and Lu, B. (2006b). Effective tag set selection in chinese word segmentation via conditional random field modeling. In Proceedings pages of PACLIC, volume 20, 87–94. 182

4 0.25476289 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

Author: Jonathan K. Kummerfeld ; Daniel Tse ; James R. Curran ; Dan Klein

Abstract: Aspects of Chinese syntax result in a distinctive mix of parsing challenges. However, the contribution of individual sources of error to overall difficulty is not well understood. We conduct a comprehensive automatic analysis of error types made by Chinese parsers, covering a broad range of error types for large sets of sentences, enabling the first empirical ranking of Chinese error types by their performance impact. We also investigate which error types are resolved by using gold part-of-speech tags, showing that improving Chinese tagging only addresses certain error types, leaving substantial outstanding challenges.

5 0.24808027 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

Author: Wenbin Jiang ; Meng Sun ; Yajuan Lu ; Yating Yang ; Qun Liu

Abstract: Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and enables a classifier to evolve on the large-scaled and real-time updated web text. With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves sig- nificant improvement on a series of testing sets from different domains, even with a single classifier and local features.

6 0.23428226 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

7 0.20981784 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

8 0.18979248 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

9 0.18517165 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

10 0.16761331 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

11 0.15490738 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

12 0.15418148 97 acl-2013-Cross-lingual Projections between Languages from Different Families

13 0.14634195 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora

14 0.13004139 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions

15 0.12610285 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

16 0.12228815 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

17 0.12017834 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction

18 0.11932104 177 acl-2013-GuiTAR-based Pronominal Anaphora Resolution in Bengali

19 0.11816476 339 acl-2013-Temporal Signals Help Label Temporal Relations

20 0.11637081 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.286), (1, -0.131), (2, -0.288), (3, 0.043), (4, 0.264), (5, 0.025), (6, -0.117), (7, 0.052), (8, 0.051), (9, 0.093), (10, -0.054), (11, -0.018), (12, 0.021), (13, -0.008), (14, 0.017), (15, 0.004), (16, 0.002), (17, 0.006), (18, -0.024), (19, 0.029), (20, -0.027), (21, -0.048), (22, -0.024), (23, 0.023), (24, -0.037), (25, -0.006), (26, 0.028), (27, -0.031), (28, -0.003), (29, -0.05), (30, 0.053), (31, -0.041), (32, 0.115), (33, -0.112), (34, 0.046), (35, 0.041), (36, -0.034), (37, -0.051), (38, 0.099), (39, -0.032), (40, -0.118), (41, -0.021), (42, -0.027), (43, -0.061), (44, -0.047), (45, 0.022), (46, -0.03), (47, -0.038), (48, 0.061), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97207004 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

Author: Xipeng Qiu ; Qi Zhang ; Xuanjing Huang

2 0.85523319 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

Author: Aobo Wang ; Min-Yen Kan

Abstract: We address the problem of informal word recognition in Chinese microblogs. A key problem is the lack of word delimiters in Chinese. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation, we propose to model the two tasks jointly. Our joint inference method significantly outperforms baseline systems that conduct the tasks individually or sequentially.

3 0.83844322 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue

Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1

4 0.83398455 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

Author: Jonathan K. Kummerfeld ; Daniel Tse ; James R. Curran ; Dan Klein

5 0.8300243 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

6 0.79231256 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

7 0.77535683 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

8 0.69136161 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

9 0.63658857 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

10 0.63505417 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

11 0.61000866 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

12 0.58200341 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

13 0.55207604 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

14 0.51878381 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora

15 0.51520884 138 acl-2013-Enriching Entity Translation Discovery using Selective Temporality

16 0.49387008 301 acl-2013-Resolving Entity Morphs in Censored Data

17 0.49293816 205 acl-2013-Joint Apposition Extraction with Syntactic and Semantic Constraints

18 0.48980033 288 acl-2013-Punctuation Prediction with Transition-based Parsing

19 0.48512357 163 acl-2013-From Natural Language Specifications to Program Input Parsers

20 0.48376757 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.084), (6, 0.026), (11, 0.083), (24, 0.044), (26, 0.077), (34, 0.096), (35, 0.06), (42, 0.079), (48, 0.08), (64, 0.011), (70, 0.078), (71, 0.039), (77, 0.012), (88, 0.037), (90, 0.011), (95, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93645906 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

Author: Xipeng Qiu ; Qi Zhang ; Xuanjing Huang

2 0.9131093 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

Author: Lu Wang ; Hema Raghavan ; Vittorio Castelli ; Radu Florian ; Claire Cardie

Abstract: We consider the problem of using sentence compression techniques to facilitate queryfocused multi-document summarization. We present a sentence-compression-based framework for the task, and design a series of learning-based compression models built on parse trees. An innovative beam search decoder is proposed to efficiently find highly probable compressions. Under this framework, we show how to integrate various indicative metrics such as linguistic motivation and query relevance into the compression process by deriving a novel formulation of a compression scoring function. Our best model achieves statistically significant improvement over the state-of-the-art systems on several metrics (e.g. 8.0% and 5.4% improvements in ROUGE-2 respectively) for the DUC 2006 and 2007 summarization task. ,

3 0.91067696 21 acl-2013-A Statistical NLG Framework for Aggregated Planning and Realization

Author: Ravi Kondadadi ; Blake Howald ; Frank Schilder

Abstract: We present a hybrid natural language generation (NLG) system that consolidates macro and micro planning and surface realization tasks into one statistical learning process. Our novel approach is based on deriving a template bank automatically from a corpus of texts from a target domain. First, we identify domain specific entity tags and Discourse Representation Structures on a per sentence basis. Each sentence is then organized into semantically similar groups (representing a domain specific concept) by k-means clustering. After this semi-automatic processing (human review of cluster assignments), a number of corpus–level statistics are compiled and used as features by a ranking SVM to develop model weights from a training corpus. At generation time, a set of input data, the collection of semantically organized templates, and the model weights are used to select optimal templates. Our system is evaluated with automatic, non–expert crowdsourced and expert evaluation metrics. We also introduce a novel automatic metric syntactic variability that represents linguistic variation as a measure of unique template sequences across a collection of automatically generated documents. The metrics for generated weather and biography texts fall within acceptable ranges. In sum, we argue that our statistical approach to NLG reduces the need for complicated knowledge-based architectures and readily adapts to different domains with reduced development time. – – *∗Ravi Kondadadi is now affiliated with Nuance Communications, Inc.

4 0.88333946 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

5 0.87963498 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue

6 0.87487108 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

7 0.87358564 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

8 0.87343132 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

9 0.87155616 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

10 0.86929893 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

11 0.86866081 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

12 0.86746228 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

13 0.8664009 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation