acl acl2013 acl2013-288 knowledge-graph by maker-knowledge-mining

288 acl-2013-Punctuation Prediction with Transition-based Parsing

Source: pdf

Author: Dongdong Zhang ; Shuangzhi Wu ; Nan Yang ; Mu Li

Abstract: Punctuations are not available in automatic speech recognition outputs, which could create barriers to many subsequent text processing tasks. This paper proposes a novel method to predict punctuation symbols for the stream of words in transcribed speech texts. Our method jointly performs parsing and punctuation prediction by integrating a rich set of syntactic features when processing words from left to right. It can exploit a global view to capture long-range dependencies for punctuation prediction with linear complexity. The experimental results on the test data sets of IWSLT and TDT4 show that our method can achieve high-level performance in punctuation prediction over the stream of words in transcribed speech text. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper proposes a novel method to predict punctuation symbols for the stream of words in transcribed speech texts. [sent-3, score-1.254]

2 Our method jointly performs parsing and punctuation prediction by integrating a rich set of syntactic features when processing words from left to right. [sent-4, score-1.056]

3 It can exploit a global view to capture long-range dependencies for punctuation prediction with linear complexity. [sent-5, score-0.836]

4 The experimental results on the test data sets of IWSLT and TDT4 show that our method can achieve high-level performance in punctuation prediction over the stream of words in transcribed speech text. [sent-6, score-1.332]

5 They neither perform a proper segmentation of the output into sentences, nor predict punctuation symbols. [sent-8, score-0.676]

6 The unavailable punctuations and sentence boundaries in transcribed speech texts create barriers to many subsequent processing tasks, such as summarization, information extraction, question answering and machine translation. [sent-9, score-0.704]

7 For example, in speech-to-speech translation, continuously transcribed speech texts need to be segmented before being fed into subsequent machine translation systems (Takezawa et al. [sent-11, score-0.428]

8 The punctuation prediction problem has attracted research interest in both the speech processing community and the natural language processing community. [sent-14, score-0.921]

9 Naturally, global contexts are required to model the punctuation prediction, especially for long-range dependencies. [sent-20, score-0.693]

10 There has been some work trying to incorporate syntactic features to broaden the view of hypotheses in the punctuation prediction models (Roark et al. [sent-22, score-0.814]

11 In their methods, the punctuation prediction is treated as a separated post-procedure of parsing, which may suffer from the problem of error propagation. [sent-25, score-0.814]

12 In addition, these approaches are not able to incrementally process inputs and are not efficient for very long inputs, especially in the cases of long transcribed speech texts from presentations where the number of streaming words could be larger than hundreds or thousands. [sent-26, score-0.482]

13 In this paper, we propose jointly performing punctuation prediction and transition-based dependency parsing over transcribed speech text. [sent-27, score-1.449]

14 When the transition-based parsing consumes the stream of words left to right with the shift-reduce decoding algorithm, punctuation symbols are predicted for each word based on the contexts of the parsing tree. [sent-28, score-1.404]

15 Two models are proposed to cause the punctuation prediction to interact with the transition actions in parsing. [sent-29, score-1.087]

16 One is to conduct transition actions of parsing followed by punctuation predictions in a cascaded way. [sent-30, score-1.146]

17 The other is to associate the conventional transition actions of parsing with punctuation perditions, so that predicted punctuations are directly inferred from the 752 anyway you may find your favorite if you go through the menu so could you tell me your choice (a). [sent-31, score-1.472]

18 The transcribed speech text without punctuations adnvsumbojdauxdpo bsj advclmarknsubjprep doebtj adavumxnodsubjiobjdobpjos a ny,w ay yoNu mNa y fiNn d yNou r favNor ite Nif yoNu gNo t hroNu g h thNe me. [sent-32, score-0.558]

19 Transition-based parsing trees and predicted punctuations over transcribed text anyway, you may find your favorite if you go through the menu. [sent-35, score-0.764]

20 Two segmentations are formed when inserting the predicted punctuation symbols into the transcribed text Figure 1. [sent-38, score-1.111]

21 In addition, the computation of models use a rich set of syntactic features, which can improve the complicated punctuation predictions from a global view, especially for the long range dependencies. [sent-42, score-0.726]

22 Figure 1 shows an example of how parsing helps punctuation prediction over the transcribed speech text. [sent-43, score-1.365]

23 Eventually, two segmentations are formed according to the punctuation prediction results, shown in Figure 1(c). [sent-47, score-0.862]

24 The training data used for our models is adapted from Treebank data by excluding all punctuations but keeping the punctuation contexts, so that it can simulate the unavailable annotated transcribed speech texts. [sent-48, score-1.231]

25 In decoding, beam search is used to get optimal punctuation prediction results. [sent-49, score-0.881]

26 We explain our approach to predicting punctuations for transcribed speech texts in Section 4. [sent-54, score-0.592]

27 2 Related Work Sentence boundary detection and punctuation prediction have been extensively studied in the speech processing field and have attracted research interest in the natural language processing field as well. [sent-57, score-0.951]

28 ) to predict punctuation symbols during speech recognition, where Huang and Zweig (2002) uses a maximum entropy model, Christensen et al. [sent-63, score-0.865]

29 (2006) integrate segmentation features into the log-linear model in the statistical machine translation (SMT) framework to improve the translation performance when translating transcribed speech texts. [sent-68, score-0.478]

30 The above work only ex753 ploits local features, so they were limited to capturing long range dependencies for punctuation prediction. [sent-71, score-0.643]

31 It is natural to incorporate global knowledge, such as syntactic information, to improve punctuation prediction performance. [sent-72, score-0.836]

32 The punctuation prediction in these works is performed as a post-procedure step of parsing, where a parse tree needs to be built in advance. [sent-77, score-0.863]

33 As their parsing over the stream of words in transcribed speech text is exponentially complex, their approaches are only feasible for short input processing. [sent-78, score-0.708]

34 Unlike these works, we incorporate punctuation prediction into the parsing which process left to right input without length limitations. [sent-79, score-1.026]

35 Starting with the work from (Zhang and Nivre, 2011), in this paper we extend transition-based dependency parsing from the sentence-level to the stream of words and integrate the parsing with punctuation prediction. [sent-83, score-1.219]

36 3 Transition-based dependency parsing In a typical transition-based dependency parsing process, the shift-reduce decoding algorithm is applied and a queue and stack are maintained (Zhang and Nivre, 2011). [sent-91, score-0.672]

37 The queue stores the stream of transcribed speech words, the front of which is indexed as the current word. [sent-92, score-0.569]

38 When words in the queue are consumed from left to right, a set of transition actions is applied to build a parse tree. [sent-94, score-0.396]

39 There are four kinds of transition actions conducted in the parsing process (Zhang and Nivre, 2011), as described in Table 1. [sent-95, score-0.465]

40 The choice of each transition action during the parsing is scored by a linear model that can be trained over a rich set of non-local features extracted from the contexts of the stack, the queue and the set of dependency labels. [sent-96, score-0.587]

41 1 Our method Model In the task of punctuation prediction, we are given × a stream of words from an automatic transcription of speech text, denoted by ? [sent-102, score-0.864]

42 We are asked to output a sequence of punctuation symbols ? [sent-109, score-0.796]

43 We model the search of the best sequence of predicted punctuation symbols ? [sent-127, score-0.869]

44 to guide the punctuation prediction in Model (2), where parsing trees are constructed over the transcribed text while containing no punctuations. [sent-135, score-1.3]

45 ) (2) Rather than enumerate all possible parsing trees, we jointly optimize the punctuation prediction model and the transition-based parsing model with the form: × (? [sent-149, score-1.265]

46 ) (4) It is noted that a partial parsing tree uniquely corresponds to a sequence of transition actions, and vice versa. [sent-188, score-0.464]

47 Thus, we synchronize the punctuation prediction with the application of Shift and RightArc during the parsing, which is explained by Model (5). [sent-214, score-0.814]

48 procedure for both parsing and punctuation prediction over the rest of words in the stream. [sent-245, score-1.006]

49 ) × (6) With different computation of Model (6), we induce two joint models for punctuation prediction: the cascaded punctuation prediction model and the unified punctuation prediction model. [sent-298, score-2.393]

50 2 Cascaded punctuation prediction model (CPP) In Model (6), the computation of two sub-models is independent. [sent-300, score-0.872]

51 The first sub-model is computed based on the context of words and partial trees without any punctuation knowledge, while the c o m p u t a ti o n o f t h e s e c ond sub-model is conditional on the context from the partially built parsing tree ? [sent-301, score-0.953]

52 As the words in the stream are consumed, each computation of transition actions is followed by a computation of punctuation prediction. [sent-306, score-1.106]

53 Thus, the two sub-models are computed in a cascaded way, until the optimal parsing tree and optimal punctuation symbols are generated. [sent-307, score-1.118]

54 We call this model the cascaded punctuation prediction model (CPP). [sent-308, score-0.915]

55 3 Unified punctuation (UPP) prediction model In Model (6), if the punctuation symbols can be deterministically inferred from the partial tree, ? [sent-310, score-1.621]

56 , 2011;Bohnet and Nivre, 2012), we propose attaching the punctuation prediction onto the parsing tree by embedding ? [sent-325, score-1.055]

57 Thus, we extend the conventional transition actions illustrated in Table 1 to a new set of transition actions for the parsing, denoted by ? [sent-330, score-0.546]

58 } where Q is the set of punctuation symbols to be predicted, ? [sent-362, score-0.758]

59 is a punctuation symbol belonging to Q, Shift(s) is an action that attaches s to the current word on the basis of original Shift action in parsing, RightArc(s) attaches ? [sent-363, score-0.786]

60 ) (7) Here, the computation of parsing tree and punctuation prediction is unified into one model where the sequence of transition action outputs uniquely determines the punctuations attached to the words. [sent-393, score-1.633]

61 We refer to it as the unified punctuation prediction model (UPP). [sent-394, score-0.86]

62 Parsing tree and attached punctuation symbols Shift(,), Shift(N), Shift(N), LeftArc, LeftArc, LeftArc, Shift(N), RightArc(? [sent-397, score-0.853]

63 An example of punctuation prediction using the UPP model, where N is a null type punctuation symbol denoting no need to attach any punctuation to the word. [sent-400, score-2.054]

64 Given an input “so could you tell me”, the optimal sequence of transition actions in Figure 2(b) is calculated based on the UPP model to produce the parsing tree in Figure 2(a). [sent-402, score-0.676]

65 According to the sequence of actions, we can determine the sequence of predicted punctuation symbols like “,NNN? [sent-403, score-0.887]

66 The final segmentation with the predicted punctuation insertion could be “so, could you tell me? [sent-405, score-0.806]

67 The training data for both the CPP and UPP models need to contain parsing trees and punctuation information. [sent-441, score-0.854]

68 To do this, we remove all types of syntactic information related to punctuation symbols from the raw Treebank data, but record what punctuation symbols are attached to the words. [sent-443, score-1.562]

69 We normalize various punctuation symbols into two types: Middle-paused punctuation (M) and Full-stop punctuation (F). [sent-444, score-1.998]

70 Plus null type (N), there are three kinds of punctuation symbols attached to the words. [sent-445, score-0.804]

71 In the experiments, we did not further distinguish the type among full-stop punctuation because the question mark and the exclamation mark have very low frequency in Treebank data. [sent-447, score-0.62]

72 756 But our CPP and UPP models are both independent regarding the number of punctuation types to be predicted. [sent-448, score-0.62]

73 In decoding, beam search is performed to get the optimal sequence of transition actions in CPP and UPP, and the optimal punctuation symbols in CPP. [sent-451, score-1.165]

74 To ensure each segment decided by a fullstop punctuation corresponds to a single parsing tree, two constraints are applied in decoding for the pruning of deficient search paths. [sent-452, score-0.889]

75 (1) Proceeding-constraint: If the partial parsing result is not a single tree, the full-stop punctuation prediction in CPP cannot be performed. [sent-453, score-1.035]

76 (2) Succeeding-constraint: If the full-stop punctuation is predicted in CPP, or Shift(F) and RightArc(F) are performed in UPP, the following transition actions must be a sequence of Reduce actions until the stack becomes empty. [sent-455, score-1.181]

77 To simulate the transcribed speech text, all words in dependency trees are lowercased and punctuations are excluded before model training. [sent-462, score-0.707]

78 In addition, every ten dependency trees are concatenated sequentially to simulate a parsing result of a stream of words in the model training. [sent-463, score-0.478]

79 In the decoding, the beam size of both the transition-based parsing and punctuation prediction is set to 5. [sent-467, score-1.044]

80 The former data is used to evaluate the capability of punctuation prediction of our algorithm regardless of the noises from speech data, as our model training data come from formal text instead of transcribed speech data. [sent-476, score-1.323]

81 In addition, we also evaluate the quality of our transition-based parsing, as its performance could have a big influence on the quality of punctuation prediction. [sent-478, score-0.664]

82 757 correctly recognized text We achieved good performance on full-stop punctuation compared to the baseline, which shows our method can efficiently process sen- tence segmentation because each segment is decided by the structure of a single parsing tree. [sent-485, score-1.016]

83 The performance of middle-paused punctuation prediction is fairly low between all methods, which shows predicting middle-paused punctuations is a difficult task. [sent-487, score-1.035]

84 We also note that UPP yields slightly better performance than CPP on full-stop and mixed punctuation prediction, and much better performance on middle-paused punctuation prediction. [sent-491, score-1.284]

85 This could be because the interaction of parsing and punctuation prediction is closer together in UPP than in CPP. [sent-492, score-1.028]

86 2 Performance on automatically recog- nized text Table 5 shows the experimental results of punctuation prediction on automatically recognized text from TDT4 data that is recognized using SRI’s English broadcast news ASR system where the word error rate is estimated to be 18%. [sent-495, score-1.052]

87 As the annotation of middle-paused punctuations in TDT4 is not available, we can only evaluate the performance of full-stop punctuation prediction (i. [sent-496, score-1.035]

88 This is in line with Table 4, which consistently illustrates CPP can get higher recall on fullstop punctuation prediction for both correctly recognized and automatically recognized texts. [sent-507, score-1.077]

89 3 Performance of transition-based parsing Performance of parsing affects the quality of punctuation prediction in our work. [sent-510, score-1.198]

90 758 We first evaluate the performance of our transition-based parsing over texts containing punctuations (TCP). [sent-517, score-0.447]

91 Secondly, we evaluate our parsing model in CPP over the texts without punctuations (TOP). [sent-519, score-0.445]

92 These results illustrate that the performance of transition-based parsing in our method does not degrade after being integrated with punctuation prediction. [sent-522, score-0.834]

93 As a by-product of the punctuation prediction task, the outputs of parsing trees can benefit the subsequent text processing tasks. [sent-523, score-1.083]

94 UAS=unlabeled attachment score; LAS=labeled attachment score 6 Conclusion and Future Work In this paper, we proposed a novel method for punctuation prediction of transcribed speech texts. [sent-525, score-1.173]

95 Our approach jointly performs parsing and punctuation prediction by integrating a rich set of syntactic features. [sent-526, score-1.056]

96 It can not only yield parse trees, but also determine sentence boundaries and predict punctuation symbols from a global view of the in2 https://sites. [sent-527, score-0.806]

97 In addition, the performance of the parsing over the stream of transcribed words is state-ofthe-art, which can benefit many subsequent text processing tasks. [sent-532, score-0.638]

98 We would also like to test the MT performance over transcribed speech texts with punctuation symbols inserted based on our method proposed in this paper. [sent-534, score-1.173]

99 The use of prosody in a combined system for punctuation generation and speech recognition. [sent-605, score-0.766]

100 Automatic sentence segmentation and punctuation prediction for spoken language translation. [sent-640, score-0.87]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('punctuation', 0.62), ('upp', 0.288), ('transcribed', 0.252), ('cpp', 0.252), ('punctuations', 0.199), ('prediction', 0.194), ('parsing', 0.192), ('symbols', 0.138), ('actions', 0.137), ('stream', 0.137), ('transition', 0.136), ('rightarc', 0.126), ('speech', 0.107), ('recognized', 0.101), ('shift', 0.087), ('queue', 0.073), ('nivre', 0.072), ('argmax', 0.065), ('cascaded', 0.061), ('stack', 0.06), ('favre', 0.059), ('dependency', 0.057), ('segmentation', 0.056), ('action', 0.055), ('christensen', 0.054), ('predicted', 0.053), ('consumed', 0.05), ('shriberg', 0.05), ('tree', 0.049), ('leftarc', 0.048), ('zweig', 0.048), ('attached', 0.046), ('matusov', 0.044), ('iwslt', 0.043), ('zhang', 0.043), ('trees', 0.042), ('decoding', 0.041), ('hatori', 0.041), ('icslp', 0.041), ('prosody', 0.039), ('sequence', 0.038), ('beam', 0.038), ('computation', 0.038), ('broadcast', 0.036), ('conversional', 0.036), ('fullstop', 0.036), ('hakkanitur', 0.036), ('takezawa', 0.036), ('yonu', 0.036), ('subsequent', 0.035), ('treebank', 0.035), ('transitionbased', 0.034), ('asr', 0.034), ('texts', 0.034), ('tell', 0.033), ('icassp', 0.032), ('stolcke', 0.032), ('contexts', 0.031), ('roark', 0.031), ('bohnet', 0.03), ('simulate', 0.03), ('boundary', 0.03), ('anyway', 0.029), ('advmod', 0.029), ('partial', 0.029), ('optimal', 0.029), ('barriers', 0.028), ('prosodic', 0.028), ('attaches', 0.028), ('pos', 0.027), ('jointly', 0.027), ('formed', 0.027), ('boundaries', 0.026), ('favorite', 0.026), ('unified', 0.026), ('correctly', 0.025), ('menu', 0.025), ('newsgroups', 0.025), ('long', 0.023), ('huang', 0.023), ('rich', 0.023), ('noises', 0.023), ('unavailable', 0.023), ('perceptron', 0.023), ('streams', 0.022), ('global', 0.022), ('could', 0.022), ('performance', 0.022), ('conditional', 0.021), ('integrate', 0.021), ('segmentations', 0.021), ('streaming', 0.021), ('email', 0.021), ('adverbial', 0.021), ('uas', 0.021), ('uniquely', 0.02), ('weblogs', 0.02), ('china', 0.02), ('input', 0.02), ('model', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 288 acl-2013-Punctuation Prediction with Transition-based Parsing

Author: Dongdong Zhang ; Shuangzhi Wu ; Nan Yang ; Mu Li

2 0.25527054 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Author: Longkai Zhang ; Li Li ; Zhengyan He ; Houfeng Wang ; Ni Sun

Abstract: Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctuation information of unlabeled micro-blog data by introducing characters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training framework to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, especially in OOV-recall. 1 INTRODUCTION Micro-blog (also known as tweets in English) is a new kind of broadcast medium in the form of blogging. A micro-blog differs from a traditional blog in that it is typically smaller in size. Furthermore, texts in micro-blogs tend to be informal and new words occur more frequently. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data. The poor segmentation results will hurt subsequent analysis on micro-blog text. ∗Corresponding author Manually labeling the texts of micro-blog is time consuming. Luckily, punctuations provide useful information because they are used as indicators of the end of previous sentence and the beginning of the next one, which also indicate the start and the end of a word. These ”natural boundaries” appear so frequently in micro-blog texts that we can easily make good use of them. TABLE 1 shows some statistics of the news corpus vs. the micro-blogs. Besides, English letters and digits are also more than those in news corpus. They all are natural delimiters of Chinese characters and we treat them just the same as punctuations. We propose a method to enlarge the training corpus by using punctuation information. We build a semi-supervised learning (SSL) framework which can iteratively incorporate newly labeled instances from unlabeled micro-blog data during the training process. We test our method on microblog texts and experiments show good results. This paper is organized as follows. In section 1 we introduce the problem. Section 2 gives detailed description of our approach. We show the experi- ment and analyze the results in section 3. Section 4 gives the related works and in section 5 we conclude the whole work. 2 Our method 2.1 Punctuations Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. To be simple, one can use label ’B’ to indicate a character is the beginning of a word, and use ’N’ to indicate a character is not the beginning of a word. We also use the 2-tag in our work. Other tag sets like the ’BIES’ tag set are not suiteable because the puctuation information cannot decide whether a character after punctuation should be labeled as ’B’ or ’S’(word with Single 177 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 177–182, micNreow-bslogC68h56i. n73e%%seE10n1.g6.8l%i%shN20u. m76%%berPu1n13c9.t u03a%%tion Table 1: Percentage of Chinese, English, number, punctuation in the news corpus vs. the micro-blogs. character). Punctuations can serve as implicit labels for the characters before and after them. The character right after punctuations must be the first character of a word, meanwhile the character right before punctuations must be the last character of a word. An example is given in TABLE 2. 2.2 Algorithm Our algorithm “ADD-N” is shown in TABLE 3. The initially selected character instances are those right after punctuations. By definition they are all labeled with ’B’ . In this case, the number of training instances with label ’B’ is increased while the number with label ’N’ remains unchanged. Because of this, the model trained on this unbalanced corpus tends to be biased. This problem can become even worse when there is inexhaustible supply of texts from the target domain. We assume that labeled corpus of the source domain can be treated as a balanced reflection of different labels. Therefore we choose to estimate the balanced point by counting characters labeling ’B’ and ’N’ and calculate the ratio which we denote as η . We assume the enlarged corpus is also balanced if and only if the ratio of ’B’ to ’N’ is just the same to η of the source domain. Our algorithm uses data from source domain to make the labels balanced. When enlarging corpus using characters behind punctuations from texts in target domain, only characters labeling ’B’ are added. We randomly reuse some characters labeling ’N’ from labeled data until ratio η is reached. We do not use characters ahead of punctuations, because the single-character words ahead of punctuations take the label of ’B’ instead of ’N’ . In summary our algorithm tackles the problem by duplicating labeled data in source domain. We denote our algorithm as ”ADD-N”. We also use baseline feature templates include the features described in previous works (Sun and Xu, 2011; Sun et al., 2012). Our algorithm is not necessarily limited to a specific tagger. For simplicity and reliability, we use a simple MaximumEntropy tagger. 3 Experiment 3.1 Data set We evaluate our method using the data from weibo.com, which is the biggest micro-blog service in China. We use the API provided by weibo.com1 to crawl 500,000 micro-blog texts of weibo.com, which contains 24,243,772 characters. To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data, which contain 2,420,037 characters. We manually segment 2038 randomly selected microblogs.We follow the segmentation standard as the PKU corpus. In micro-blog texts, the user names and URLs have fixed format. User names start with ’ @ ’, followed by Chinese characters, English letters, numbers and ’ ’, and terminated when meeting punctuations or blanks. URLs also match fixed patterns, which are shortened using ”http : / /t . cn /” plus six random English letters or numbers. Thus user names and URLs can be pre-processed separately. We follow this principle in following experiments. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. We choose the PKU data in our experiment because our baseline methods use the same segmentation standard. We compare our method with three baseline methods. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. ICTCLAS, on the other hand, is a HMMbased Chinese word segmenter. Another baseline is Li and Sun (2009), which also uses punctuation in their semi-supervised framework. F-score 1http : / / open . we ibo .com/wiki 2http : / /www . s ighan .org/bakeo f f2 0 0 5 / 3http : / / i c l .org/ ct as 4http : / / nlp . st an ford . edu /pro j ect s / chine s e-nlp . shtml \ # cws 178 评B论-是-风-格-，-评B论-是-能-力-。- BNBBNBBNBBNB Table 2: The first line represents the original text. The second line indicates whether each character is the Beginning of sentence. The third line is the tag sequence using ”BN” tag set. is used as the accuracy measure. The recall of out-of-vocabulary is also taken into consideration, which measures the ability of the model to correctly segment out of vocabulary words. 3.2 Main results methods on the development data. TABLE 4 summarizes the segmentation results. In TABLE 4, Li-Sun is the method in Li and Sun (2009). Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated. The next 4 methods all require a 100 iteration of self-training. No-punc is the method that only uses self-training while no punctuation information is added. Nobalance is similar to ADD N. The only difference between No-balance and ADD-N is that the former does not balance label ’B’ and label ’N’ . The comparison of Maxent and No-punctuation shows that naively adding confident unlabeled instances does not guarantee to improve performance. The writing style and word formation of the source domain is different from target domain. When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set. The comparison of Maxent, No-balance and ADD-N shows that considering punctuation as well as self-training does improve performance. Both the f-score and OOV-recall increase. By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV-Recall. However, considering it will improve OOV-Recall by about +1.6% and the fscore +0.2%. We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data. TABLE 5 shows different f-scores and OOV-Recalls on different unlabeled data set. We note that when the number of texts changes from 0 to 50,000, the f-score and OOV both are improved. However, when unlabeled data changes to 200,000, the performance is a bit decreased, while still better than not using unlabeled data. This result comes from the fact that the method ’ADD-N’ only uses characters behind punctua179 Tabl152S0eiz 0:Segm0.8nP67ta245ion0p.8Rer6745f9om0a.8nF57c6e1witOh0 .d7Vi65f-2394Rernt size of unlabeled data tions from target domain. Taking more texts into consideration means selecting more characters labeling ’N’ from source domain to simulate those in target domain. If too many ’N’s are introduced, the training data will be biased against the true distribution of target domain. 3.3 Characters ahead of punctuations In the ”BN” tagging method mentioned above, we incorporate characters after punctuations from texts in micro-blog to enlarge training set.We also try an opposite approach, ”EN” tag, which uses ’E’ to represent ”End of word”, and ’N’ to rep- resent ”Not the end of word”. In this contrasting method, we only use charactersjust ahead ofpunctuations. We find that the two methods show similar results. Experiment results with ADD-N are shown in TABLE 6 . 5DU0an0lt a b0Tsiealzbe lde6:0.C8Fo7”m5BNpa”rO0itsOa.o7gVn7-3oRfBN0.8aFn”7E0dNEN”Ot0.aO.g7V6-3R 4 Related Work Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). These supervised methods show good results, however, are unable to incorporate information from new domain, where OOV problem is a big challenge for the research community. On the other hand unsupervised word segmentation Peng and Schuurmans (2001); Goldwater et al. (2006); Jin and Tanaka-Ishii (2006); Feng et al. (2004); Maosong et al. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. However, they usually are less accurate and more complicated than supervised ones. Meanwhile semi-supervised methods have been applied into NLP applications. Bickel et al. (2007) learns a scaling factor from data of source domain and use the distribution to resemble target domain distribution. Wu et al. (2009) uses a Domain adaptive bootstrapping (DAB) framework, which shows good results on Named Entity Recognition. Similar semi-supervised applications include Shen et al. (2004); Daum e´ III and Marcu (2006); Jiang and Zhai (2007); Weinberger et al. (2006). Besides, Sun and Xu (201 1) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation. There are previous works using punctuations as implicit annotations. Riley (1989) uses it in sentence boundary detection. Li and Sun (2009) proposed a compromising solution to by using a clas- sifier to select the most confident characters. We do not follow this approach because the initial errors will dramatically harm the performance. Instead, we only add the characters after punctuations which are sure to be the beginning of words (which means labeling ’B’) into our training set. Sun and Xu (201 1) uses punctuation information as discrete feature in a sequence labeling framework, which shows improvement compared to the pure sequence labeling approach. Our method is different from theirs. We use characters after punctuations directly. 5 Conclusion In this paper we have presented an effective yet simple approach to Chinese word segmentation on micro-blog texts. In our approach, punctuation information of unlabeled micro-blog data is used, as well as a self-training framework to incorporate confident instances. Experiments show that our approach improves performance, especially in OOV-recall. Both the punctuation information and the self-training phase contribute to this improve- ment. Acknowledgments This research was partly supported by National High Technology Research and Development Program of China (863 Program) (No. 2012AA01 1101), National Natural Science Foundation of China (No.91024009) and Major National Social Science Fund of China(No. 12&ZD227;). 180 References Bickel, S., Br¨ uckner, M., and Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings ofthe 24th international conference on Machine learning, pages 81–88. ACM. Chen, W., Zhang, Y., and Isahara, H. (2006). Chinese named entity recognition with conditional random fields. In 5th SIGHAN Workshop on Chinese Language Processing, Australia. Daum e´ III, H. and Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1): 101–126. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004). Accessor variety criteria for chinese word extraction. Computational Linguistics, 30(1):75–93. Goldwater, S., Griffiths, T., and Johnson, M. (2006). Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 673–680. Association for Computational Linguistics. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Annual Meeting-Association For Computational Linguistics, volume 45, page 264. Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 428–435. Association for Computational Linguistics. Li, Z. and Sun, M. (2009). Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35(4):505– 512. Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 164. Jeju Island, Korea. Maosong, S., Dayang, S., and Tsou, B. (1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 1 7th international conference on Computational linguistics-Volume 2, pages 1265–1271 . Association for Computational Linguistics. Pan, S. and Yang, Q. (2010). A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10): 1345–1359. Peng, F. and Schuurmans, D. (2001). Selfsupervised chinese word segmentation. Advances in Intelligent Data Analysis, pages 238– 247. Riley, M. (1989). Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, pages 339–352. Association for Computational Linguistics. Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C. (2004). Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 589. Association for Computational Linguistics. Sun, W. and Xu, J. (201 1). Enhancing chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 970–979. Association for Computational Linguistics. Sun, X., Wang, H., and Li, W. (2012). Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 253–262, Jeju Island, Korea. Association for Computational Linguistics. Weinberger, K., Blitzer, J., and Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In In NIPS. Citeseer. Wu, D., Lee, W., Ye, N., and Chieu, H. (2009). Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1523–1532. Association for Computational Linguistics. Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. Zhao, H., Huang, C., and Li, M. (2006a). An improved chinese word segmentation system with 181 conditional random field. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July. Zhao, H., Huang, C., Li, M., and Lu, B. (2006b). Effective tag set selection in chinese word segmentation via conditional random field modeling. In Proceedings pages of PACLIC, volume 20, 87–94. 182

3 0.18413213 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

Author: Jinho D. Choi ; Andrew McCallum

Abstract: We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach. Selectional branching is guaranteed to perform a fewer number of transitions than beam search yet performs as accurately. We also present a new transition-based dependency parsing algorithm that gives a complexity of O(n) for projective parsing and an expected linear time speed for non-projective parsing. With the standard setup, our parser shows an unlabeled attachment score of 92.96% and a parsing speed of 9 milliseconds per sentence, which is faster and more accurate than the current state-of-the-art transitionbased parser that uses beam search.

4 0.15715899 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu

Abstract: Shift-reduce dependency parsers give comparable accuracies to their chartbased counterparts, yet the best shiftreduce constituent parsers still lag behind the state-of-the-art. One important reason is the existence of unary nodes in phrase structure trees, which leads to different numbers of shift-reduce actions between different outputs for the same input. This turns out to have a large empirical impact on the framework of global training and beam search. We propose a simple yet effective extension to the shift-reduce process, which eliminates size differences between action sequences in beam-search. Our parser gives comparable accuracies to the state-of-the-art chart parsers. With linear run-time complexity, our parser is over an order of magnitude faster than the fastest chart parser.

5 0.15681204 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.

6 0.14257327 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

7 0.13811813 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

8 0.13128513 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

9 0.10105815 133 acl-2013-Efficient Implementation of Beam-Search Incremental Parsers

10 0.097401954 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

11 0.095585749 94 acl-2013-Coordination Structures in Dependency Treebanks

12 0.090281337 37 acl-2013-Adaptive Parser-Centric Text Normalization

13 0.083212584 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

14 0.083045863 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

15 0.081954792 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

16 0.074612722 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

17 0.071854219 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

18 0.070927605 205 acl-2013-Joint Apposition Extraction with Syntactic and Semantic Constraints

19 0.070234478 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

20 0.067102082 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.172), (1, -0.099), (2, -0.206), (3, 0.072), (4, 0.019), (5, -0.033), (6, 0.054), (7, -0.012), (8, 0.021), (9, -0.01), (10, -0.029), (11, 0.068), (12, -0.03), (13, -0.017), (14, 0.096), (15, -0.008), (16, -0.084), (17, -0.036), (18, 0.007), (19, -0.003), (20, -0.001), (21, -0.002), (22, 0.036), (23, -0.018), (24, 0.003), (25, 0.027), (26, -0.03), (27, -0.026), (28, 0.042), (29, 0.005), (30, -0.059), (31, -0.005), (32, 0.005), (33, 0.031), (34, 0.008), (35, 0.007), (36, -0.0), (37, 0.032), (38, -0.07), (39, 0.098), (40, 0.016), (41, 0.007), (42, -0.017), (43, 0.019), (44, -0.037), (45, -0.037), (46, 0.046), (47, -0.005), (48, -0.092), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95014113 288 acl-2013-Punctuation Prediction with Transition-based Parsing

Author: Dongdong Zhang ; Shuangzhi Wu ; Nan Yang ; Mu Li

2 0.81154889 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

Author: Ji Ma ; Jingbo Zhu ; Tong Xiao ; Nan Yang

Abstract: In this paper, we combine easy-first dependency parsing and POS tagging algorithms with beam search and structured perceptron. We propose a simple variant of “early-update” to ensure valid update in the training process. The proposed solution can also be applied to combine beam search and structured perceptron with other systems that exhibit spurious ambiguity. On CTB, we achieve 94.01% tagging accuracy and 86.33% unlabeled attachment score with a relatively small beam width. On PTB, we also achieve state-of-the-art performance. 1

3 0.80959302 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu

4 0.80303502 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

Author: Jinho D. Choi ; Andrew McCallum

5 0.7885375 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

Author: Francesco Sartorio ; Giorgio Satta ; Joakim Nivre

Abstract: We present a novel transition-based, greedy dependency parser which implements a flexible mix of bottom-up and top-down strategies. The new strategy allows the parser to postpone difficult decisions until the relevant information becomes available. The novel parser has a ∼12% error reduction in unlabeled attach∼ment score over an arc-eager parser, with a slow-down factor of 2.8.

6 0.76329637 133 acl-2013-Efficient Implementation of Beam-Search Incremental Parsers

7 0.67415947 80 acl-2013-Chinese Parsing Exploiting Characters

8 0.66509545 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

9 0.64077407 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

10 0.6344381 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

11 0.62597615 335 acl-2013-Survey on parsing three dependency representations for English

12 0.59125739 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

13 0.5783056 331 acl-2013-Stop-probability estimates computed on a large corpus improve Unsupervised Dependency Parsing

14 0.55849069 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

15 0.55324137 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies

16 0.54241902 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

17 0.54071397 163 acl-2013-From Natural Language Specifications to Program Input Parsers

18 0.52493036 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

19 0.51430154 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

20 0.513843 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.053), (6, 0.042), (11, 0.077), (24, 0.05), (26, 0.057), (28, 0.01), (35, 0.059), (42, 0.069), (47, 0.21), (48, 0.044), (70, 0.079), (88, 0.024), (90, 0.024), (95, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83553231 380 acl-2013-VSEM: An open library for visual semantics representation

Author: Elia Bruni ; Ulisse Bordignon ; Adam Liska ; Jasper Uijlings ; Irina Sergienya

Abstract: VSEM is an open library for visual semantics. Starting from a collection of tagged images, it is possible to automatically construct an image-based representation of concepts by using off-theshelf VSEM functionalities. VSEM is entirely written in MATLAB and its objectoriented design allows a large flexibility and reusability. The software is accompanied by a website with supporting documentation and examples.

same-paper 2 0.79174125 288 acl-2013-Punctuation Prediction with Transition-based Parsing

Author: Dongdong Zhang ; Shuangzhi Wu ; Nan Yang ; Mu Li

3 0.71039414 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez

Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1

4 0.68564171 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

5 0.68504989 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu

6 0.68049735 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

7 0.6753369 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

8 0.6738621 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

9 0.67322624 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

10 0.66838312 333 acl-2013-Summarization Through Submodularity and Dispersion

11 0.66504031 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions

12 0.66488552 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

13 0.66475487 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study