emnlp emnlp2010 emnlp2010-2 knowledge-graph by maker-knowledge-mining

2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model


Source: pdf

Author: Yue Zhang ; Stephen Clark

Abstract: We show that the standard beam-search algorithm can be used as an efficient decoder for the global linear model of Zhang and Clark (2008) for joint word segmentation and POS-tagging, achieving a significant speed improvement. Such decoding is enabled by: (1) separating full word features from partial word features so that feature templates can be instantiated incrementally, according to whether the current character is separated or appended; (2) deciding the POS-tag of a potential word when its first character is processed. Early-update is used with perceptron training so that the linear model gives a high score to a correct partial candidate as well as a full output. Effective scoring of partial structures allows the decoder to give high accuracy with a small beam-size of 16. In our 10-fold crossvalidation experiments with the Chinese Tree- . bank, our system performed over 10 times as fast as Zhang and Clark (2008) with little accuracy loss. The accuracy of our system on the standard CTB 5 test was competitive with the best in the literature. 1 Introduction and Motivation Several approaches have been proposed to solve word segmentation and POS-tagging jointly, including the reranking approach (Shi and Wang, 2007; Jiang et al., 2008b), the hybrid approach (Nakagawa and Uchimoto, 2007; Jiang et al., 2008a), and the single-model approach (Ng and Low, 2004; Zhang and Clark, 2008; Kruengkrai et al., 2009). These methods led to accuracy improvements over the traditional, pipelined segmentation and POS-tagging . . . 843 clark} @ cl cam ac uk baseline by avoiding segmentation error propagation and making use of part-of-speech information to improve segmentation. The single-model approach to joint segmentation and POS-tagging offers consistent training of all in- formation, concerning words, characters and partsof-speech. However, exact inference with dynamic programming can be infeasible if features are defined over a large enough range of the output, such as over a two-word history. In our previous work (Zhang and Clark, 2008), which we refer to as Z&C08; from now on, we used an approximate decoding algorithm that keeps track of a set of partially built structures for each character, which can be seen as a dynamic programming chart which is greatly reduced by pruning. In this paper we follow the line of single-model research, in particular the global linear model of Z&C08.; We show that effective decoding can be achieved with standard beam-search, which gives significant speed improvements compared to the decoding algorithm of Z&C08;, and achieves accuracies that are competitive with the state-of-the-art. Our research is also in line with recent research on improving the speed of NLP systems with little or no accuracy loss (Charniak et al., 2006; Roark and Hollingshead, 2008). Our speed improvement is achieved by the use of a single-beam decoder. Given an input sentence, candidate outputs are built incrementally, one character at a time. When each character is processed, it is combined with existing candidates in all possible ways to generate new candidates, and an agenda is used to keep the N-best candidate outputs from ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 8t4ic3s–852, the begining of the sentence to the current character. Compared to the multiple-beam search algorithm of Z&C08;, the use of a single beam can lead to an order of magnitude faster decoding speed. 1.1 The processing of partial words An important problem that we solve in this paper is the handling of partial words with a single beam decoder for the global model. As we pointed out in Z&C08;, it is very difficult to score partial words properly when they are compared with full words, although such comparison is necessary for incremental decoding with a single-beam. To allow comparisons with full words, partial words can either be treated as full words, or handled differently. We showed in Z&C08; that a naive single-beam decoder which treats partial words in the same way as full words failed to give a competitive accu- racy. An important reason for the low accuracy is over-segmentation during beam-search. Consider the three characters “ 自 来 水 (tap water)”. The first two characters do not make sense when put together as a single word. Rather, when treated as two singlecharacter words, they can make sense in a sentence such as “请 (please) 自 (self) 来 (come) 取 (take)”. Therefore, when using single-beam search to process “ 自 来 水 (tap water)”, the two-character word candidate “ 自 来” is likely to have been thrown off the agenda before the third character “水” is considered, leading to an unrecoverable segmentation error. This problem is even more severe for a joint segmentor and POS-tagger than for a pure word segmentor, since the POS-tags and POS-tag bigram of “ 自 and “来” further supports them being separated when ”来” is considered. The multiple-beam search decoder we proposed in Z&C08; can be seen as a means to ensure that the three characters “ 自 来 水” always have a chance to be considered as a single word. It explores candidate segmentations from the beginning of the sentence until each character, and avoids the problem of processing partial words by considering only full words. However, since it ex- ” plores a larger part of the search space than a singlebeam decoder, its time complexity is correspondingly higher. In this paper, we treat partial words differently from full words, so that in the previous example, 844 the decoder can take the first two characters in “ 自 来 水 (tap water)” as a partial word, and keep it in the beam before the third character is processed. One challenge is the representation of POS-tags for partial words. The POS of a partial word is undefined without the corresponding full word information. Though a partial word can make sense with a particular POS-tag when it is treated as a complete word, this POS-tag is not necessarily the POS of the full word which contains the partial word. Take the three-character sequence “下 雨 天” as an example. The first character “下” represents a singlecharacter word “below”, for which the POS can be LC or VV. The first two characters “下 雨” represent a two-character word “rain”, for which the POS can be VV. Moreover, all three characters when put together make the word “rainy day”, for which the POS is NN. As discussed above, assigning POS tags to partial words as if they were full words leads to low accuracy. An obvious solution to the above problem is not to assign a POS to a partial word until it becomes a full word. However, lack of POS information for partial words makes them less competitive compared to full words in the beam, since the scores of full words are futher supported by POS and POS ngram information. Therefore, not assigning POS to partial words potentially leads to over segmentation. In our experiments, this method did not give comparable accuracies to our Z&C08; system. In this paper, we take a different approach, and assign a POS-tag to a partial word when its first character is separated from the final character of the previous word. When more characters are appended to a partial word, the POS is not changed. The idea is to use the POS of a partial word as the predicted POS of the full word it will become. Possible predictions are made with the first character of the word, and the likely ones will be kept in the beam for the next processing steps. For example, with the three characters “下 雨 天”, we try to keep two partial words (besides full words) in the beam when the first word “下” is processed, with the POS being VV and NN, respectively. The first POS predicts the two-character word “下 雨” , and the second the three-character word “下 雨 天”. Now when the second character is processed, we still need to maintain the possible POS NN in the agenda, which predicts the three-character word “下 雨 天”. As a main contribution of this paper, we show that the mechanism ofpredicting the POS at the first character gives competitive accuracy. This mechanism can be justified theoretically. Unlike alphabetical languages, each Chinese character represents some specific meanings. Given a character, it is natural for a human speaker to know immediately what types of words it can start. The allows the knowledge of possible POS-tags of words that a character can start, using information about the character from the training data. Moreover, the POS of the previous words to the current word are also useful in deciding possible POS for the word.1 The mechanism of first-character decision of POS also boosts the efficiency, since the enumeration of POS is unecessary when a character is appended to the end of an existing word. As a result, the complexity of each processing step is reduce by half compared to a method without POS prediction. Finally, an intuitive way to represent the status of a partial word is using a flag explicitly, which means an early decision of the segmentation of the next incoming character. We take a simpler alternative approach, and treat every word as a partial word until the next incoming character is separated from the last character of this word. Before a word is confirmed as a full word, we only apply to it features that represent its current partial status, such as character bigrams, its starting character and its part-ofspeech, etc. Full word features, including the first and last characters of a word, are applied immediately after a word is confirmed as complete. An important component for our proposed system is the training process, which needs to ensure that the model scores a partial word with predicted POS properly. We use the averaged perceptron (Collins, 2002) for training, together with the “early update” mechanism of Collins and Roark (2004). Rather than updating the parameters after decoding is com- plete, the modified algorithm updates parameters at any processing step if the correct partial candidate falls out of the beam. In our experiments using the Chinese Treebank 1The next incoming characters are also a useful source of information for predicting the POS. However, our system achieved competitive accuracy with Z&C08; without such character lookahead features. 845 data, our system ran an order of magnitude faster than our Z&C08; system with little loss of accuracy. The accuracy of our system was competitive with other recent models. 2 Model and Feature Templates We use a linear model to score both partial and full candidate outputs. Given an input x, the score of a candidate output y is computed as: Score(y) = Φ(y) · where Φ(y) is the global feature vector extracted from y, and is the parameter vector of the model. Figure 1 shows the feature templates for the model, where templates 1 14 contain only segmentation information and templates 15 29 contain w~ , w~ – – both segmentation and POS information. Each template is instantiated according to the current character in the decoding process. Row “For” shows the conditions for template instantiation, where “s” indicates that the corresponding template is instantiated when the current character starts a new word, and “a” indicates that the corresponding template is instantiated when the current character does not start a new word. In the row for feature templates, w, t and c are used to represent a word, a POS-tag and a character, respectively. The subscripts are based on the current character, where w−1 represents the first word to the left of the current character, and p−2 represents the POS-tag on the second word to the left of the current character, and so on. As an example, feature template 1is instantiated when the current character starts a new word, and the resulting feature value is the word to the left of this character. start(w), end(w) and len(w) represent the first character, the last character and the length of word w, respectively. The length of a word is normalized to 16 if it is larger than 16. cat(c) represents the POS category of character c, which is the set of POS-tags seen on character c, as we used in Z&C08.; Given a partial or complete candidate y, its global feature vector Φ(y) is computed by instantiating all applicable feature templates from Table 1 for each character in y, according to whether or not the character is separated from the previous character. The feature templates are mostly taken from, or inspired by, the feature templates of Z&C08.; Templates 1, 2, 3, 4, 5, 8, 10, 12, 13, 14, 15, 19, 20, Feature templateFor 24, 27 and 29 concern complete word information, and they are used in the model to differentiate correct and incorrect output structures in the same way as our Z&C08; model. Templates 6, 7, 9, 16, 17, 18, 21, 22, 23, 25, 26 and 28 concern partial word information, whose role in the model is to indicate the likelihood that the partial word including the current character will become a correct full word. They act as guidance for the action to take for the cur846 function DECODE(sent, agenda): CLEAR(agenda) ADDITEM(agenda, “”) for index in [0..LEN(sent)]: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) retaugrenn BEST(agenda) Figure 1: The incremental beam-search decoder. rent character according to the context, and are the crucial reason for the effectiveness of the algorithm with a small beam-size. 2.1 Decoding The decoding algorithm builds an output candidate incrementally, one character at a time. Each character can either be attached to the current word or separated as the start a new word. When the current character starts a new word, a POS-tag is assigned to the new word. An agenda is used by the decoder to keep the N-best candidates during the incremental process. Before decoding starts, the agenda is initialized with an empty sentence. When a character is processed, existing candidates are removed from the agenda and extended with the current character in all possible ways, and the N-best newly generated candidates are put back onto the agenda. After all input characters have been processed, the highest-scored candidate from the agenda is taken as the output. Pseudo code for the decoder is shown in Figure 1. CLEAR removes all items from the agenda, ADDITEM adds a new item onto the agenda, N-BEST returns the N highest-scored items from the agenda, and BEST returns the highest-scored item from the agenda. LEN returns the number of characters in a sentence, and sent[i] returns the ith character from the sentence. APPEND appends a character to the last word in a candidate, and SEP joins a character as the start of a new word in a candidate, assigning a POS-tag to the new word. Both our decoding algorithm and the decoding algorithm of Z&C08; run in linear time. However, in order to generate possible candidates for each character, Z&C08; uses an extra loop to search for possible words that end with the current character. A restriction to the maximum word length is applied to limit the number of iterations in this loop, without which the algorithm would have quadratic time complexity. In contrast, our decoder does not search backword for the possible starting character of any word. Segmentation ambiguities are resolved by binary choices between the actions append or separate for each character, and no POS enumeration is required when the character is appended. This improves the speed by a significant factor. 2.2 Training The learning algorithm is based on the generalized perceptron (Collins, 2002), but parameter adjustments can be performed at any character during the decoding process, using the “early update” mechanism of Collins and Roark (2004). The parameter vector of the model is initialized as all zeros before training, and used to decode training examples. Each training example is turned into the raw input format, and processed in the same way as decoding. After each character is processed, partial candidates in the agenda are compared to the corresponding gold-standard output for the same characters. If none of the candidates in the agenda are correct, the decoding is stopped and the parameter vector is updated by adding the global feature vector of the gold-standard partial output and subtracting the global feature vector of the highest-scored partial candidate in the agenda. The training process then moves on to the next example. However, if any item in the agenda is the same as the corresponding gold-standard, the decoding process moves to the next character, without any change to the parameter values. After all characters are processed, the decoder prediction is compared with the training example. If the prediction is correct, the parameter vector is not changed; otherwise it is updated by adding the global feature vector of the training example and subtracting the global feature vector of the decoder prediction, just as the perceptron algorithm does. The same training examples can be used to train the model for multiple iterations. We use 847 the averaged parameter vector (Collins, 2002) as the final model. Pseudocode for the training algorithm is shown in Figure 2. It is based on the decoding algorithm in Figure 1, and the main differences are: (1) the training algorithm takes the gold-standard output and the parameter vector as two additional arguments; (2) the training algorithm does not return a prediction, but modifies the parameter vector when necessary; (3) lines 11to 20 are additional lines of code for parameter updates. Without lines 11 to 16, the training algorithm is exactly the same as the generalized perceptron algorithm. These lines are added to ensure that the agenda contains highly probable candidates during the whole beam-search process, and they are crucial to the high accuracy of the system. As stated earlier, the decoder relies on proper scoring of partial words to maintain a set of high quality candidates in the agenda. Updating the value of the parameter vector for partial outputs can be seen as a means to ensure correct scoring of partial candidates at any character. 2.3 Pruning We follow Z&C08; and use several pruning methods, most of which serve to to improve the accuracy by removing irrelevant candidates from the beam. First, the system records the maximum number of characters that a word with a particular POS-tag can have. For example, from the Chinese Treebank that we used for our experiments, most POS are associated with only with one- or two-character words. The only POS-tags that are seen with words over ten characters long are NN (noun), NR (proper noun) and CD (numbers). The maximum word length information is initialized as all ones, and updated according to each training example before it is processed. Second, a tag dictionary is used to record POStags associated with each word. During decoding, frequent words and words with “closed set” tags2 are only allowed POS-tags according to the tag dictionary, while other words are allowed every POS-tag to make candidate outputs. Whether a word is a frequent word is decided by the number of times it has been seen in the training process. Denoting the num2“Closed set” tags are the set of POS-tags which are only associated with a fixed set of words, according to the Penn Chinese Treebank specifications (Xia, 2000). function TRAIN(sent, agenda, gold-standard, w~ ): 01: CLEAR(agenda) 02: ADDITEM(agenda, “”) 03: for index in [0..LEN(sent)]: 04: 05: 06: 07: 08: 09: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) faogre cnadnad ← ←in agenda: if cand = gold-standard[0:index] : CONTINUE w~ ← w~ + Φ(gold-standard[0:index]) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ if BEST(agenda) gold-standard: w~ ← a ~wg + Φ(gold-standard) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ return = Figure 2: The incremental learning function. ber of times the most frequent word has been seen with M, a word is a frequent word if it has been seen more than M/5000 5 times. The threshold value is taken from Z&C08;, and we did not adjust it during development. Word frequencies are initialized as zeros and updated according to each training example before it is processed; the tag dictionary is initialized as empty and updated according to each training example before it is processed. Third, we make an additional record of the initial characters for words with “closed set” tags. During decoding, when the current character is added as the start of a new word, “closed set” tags are only assigned to the word if it is consistent with the record. This type of pruning is used in addition to the tag + dictionary to prune invalid partial words, while the tag dictionary is used to prune complete words. The record for initial character and POS is initially empty, and udpated according to each training example before it is processed. Finally, at any decoding step, we group partial 848 candidates that are generated by separating the current character as the start of a new word by the signature p0p−1w−1, and keep only the best among those having the same p0p−1w−1. The signature p0p−1w−1 is decided by the feature templates we use: it can be shown that if two candidates cand1 and cand2 generated at the same step have the same signature, and the score of cand1 is higher than the score of cand2, then at any future step, the highest scored candidate generated from cand1 will always have a higher score than the highest scored candidate generated from cand2. From the above pruning methods, only the third was not used by Z&C08.; It can be seen as an extra mechanism to help keep likely partial words in the agenda and improve the accuracy, but which does not give our system a speed advantage over Z&C08.; 3 Experiments We used the Chinese Treebank (CTB) data to perform one set of development tests and two sets of fi- Training iteration Figure 3: The influence of beam-sizes, and the convergence of the perceptron. nal tests. The CTB 4 was split into two parts, with the CTB 3 being used for a 10-fold cross validation test to compare speed and accuracies with Z&C08;, and the rest being used for development. The CTB 5 was used to perform the additional set of experiments to compare accuracies with other recent work. We use the standard F-measure to evaluate output accuracies. For word segmentation, precision is defined as the number of correctly segmented words divided by the total number of words in the output, and recall is defined as the number of correctly segmented words divided by the total number of words in the gold-standard output. For joint segmentation and POS-tagging, precision is defined as the number of correctly segmented and POS-tagged words divided by the total number of words from the output, and recall is defined as the correctly segmented and POS-tagged words divided by the total number of words in the gold-standard output. All our experiments were performed on a Linux platform, and a single 2.66GHz Intel Core 2 CPU. 3.1 Development tests Our development data consists of 150K words in 4798 sentences. 80% of the data were randomly chosen as the development training data, while the rest were used as the development test data. Our development tests were mainly used to decide the size ofthe beam, the number oftraining iterations, the ef- fect of partial features in beam-search decoding, and the effect of incremental learning (i.e. early update). 849 Figure 3 shows the accuracy curves for joint segmentation and POS-tagging by the number of training iterations, using different beam sizes. With the size of the beam increasing from 1to 32, the accuracies generally increase, while the amount of increase becomes small when the size of the beam becomes 16. After the 10th iteration, a beam size of 32 does not always give better accuracies than a beam size of 16. We therefore chose 16 as the size of the beam for our system. The testing times for each beam size between 1 and 32 are 7.16s, 11.90s, 18.42s, 27.82s, 46.77s and 89.21s, respectively. The corresponding speeds in the number of sentences per second are 111.45, 67.06, 43.32, 28.68, 17.06 and 8.95, respectively. Figure 3 also shows that the accuracy increases with an increased number of training iterations, but the amount of increase becomes small after the 25th iteration. We chose 29 as the number of iterations to train our system. The effect of incremental training: We compare the accuracies by incremental training using early update and normal perceptron training. In the normal perceptron training case, lines 11to 16 are taken out of the training algorithm in Figure 2. The algorithm reached the best performance at the 22nd iteration, with the segmentation F-score being 90.58% and joint F-score being 83.38%. In the incremental training case, the algorithm reached the best accuracy at the 30th training iteration, obtaining a segmentation F-score of 91.14% and a joint F-score of 84.06%. 3.2 Final tests using CTB 3 CTB 3 consists of 150K words in 10364 sentences. We follow Z&C08; and split it into 10 equal-sized parts. In each test, one part is taken as the test data and the other nine are combined together as the training data. We compare the speed and accuracy with the joint segmentor and tagger of Z&C08;, which is publicly available as the ZPar system, version 0.23. The results are shown in Table 2, where each row shows one cross validation test. The column head- ings “sf”, “jf”, “time” and “speed” refer to segmentation F-measure, joint F-measure, testing time (in 3http://www.sourceforge.net/projects/zpar #sZf&C08jftimespeed; tshfis papjefrtimespeed seconds) and testing speed (in the number of sentences per second), respectively. Our system gave a joint segmentation and POStagging F-score of 91.37%, which is only 0.04% lower than that of ZPar 0.2. The speed of our system was over 10 times as fast as ZPar 0.2. 3.3 Final tests using CTB 5 We follow Kruengkrai et al. (2009) and split the CTB 5 into training, development testing and testing sets, as shown in Table 3. We ignored the development test data since our system had been developed in previous experiments. Kruengkrai et al. (2009) made use of character type knowledge for spaces, numerals, symbols, alphabets, Chinese and other characters. In the previous experiments, our system did not use any knowledge beyond the training data. To make the comparison fairer, we included knowledge of English letters and Arabic numbers in this experiment. During both training and decoding, English letters and Arabic numbers are segmented using simple rules, treating consecutive English letters or Arabic numbers as a single word. The results are shown in Table 4, where row “N07” refers to the model of Nakagawa and Uchimoto (2007), rows “J08a” and “b” refer to the models of Jiang et al. (2008a) and Jiang et al. (2008b), and row “K09” refers to the models of Kruengkrai et al. (2009). Columns “sf” and “jf” refer to segmentation and joint accuracies, respectively. Our system 850 SectionsSentencesWords T Daerbsvltien3:gTrain14230i–7n021 g–71,3d90–21e035v 1elopm1e385n40t,8and5tes a648t,903o2n,18C92TB5. TJoKNab0ul8re79abs4(y:rtAesomcl-indurea)vycom9 pa7 r.i87s34o59n w3 i.t64h2710recntsudio sfjf CTB 5. gave comparable accuracies to these recent works, obtaining the best (same as the error-driven version of K09) joint F-score. 4 Related Work The effectiveness of our beam-search decoder showed that the joint segmentation and tagging problem may be less complex than previously perceived (Zhang and Clark, 2008; Jiang et al., 2008a). At the very least, the single model approach with a simple decoder achieved competitive accuracies to what has been achieved so far by the reranking (Shi and Wang, 2007; Jiang et al., 2008b) models and an ensemble model using machine-translation techniques (Jiang et al., 2008a). This may shed new light on joint segmentation and POS-tagging methods. Kruengkrai et al. (2009) and Zhang and Clark (2008) are the most similar to our system among related work. Both systems use a discriminatively trained linear model to score candidate outputs. The work of Kruengkrai et al. (2009) is based on Nakagawa and Uchimoto (2007), which separates the processing of known words and unknown words, and uses a set of segmentation tags to represent the segmentation of characters. In contrast, our model is conceptually simpler, and does not differentiate known words and unknown words. Moreover, our model is based on our previous work, in line with Zhang and Clark (2007), which does not treat word segmentation as character sequence labeling. Our learning and decoding algorithms are also different from Kruengkrai et al. (2009). While Kruengkrai et al. (2009) perform dynamic programming and MIRA learning, we use beam-search to perform incremental decoding, and the early-update version of the perceptron algorithm to train the model. Dynamic programming is exact inference, for which the time complexity is decided by the locality of feature templates. In contrast, beam-search is approximate and can run in linear time. The parameter updating for our algorithm is conceptually and computationally simpler than MIRA, though its performance can be slightly lower. However, the earlyupdate mechanism we use is consistent with our incremental approach, and improves the learning of the beam-search process. 5 Conclusion We showed that a simple beam-search decoding algorithm can be effectively applied to the decoding problem for a global linear model for joint word segmentation and POS-tagging. By guiding search with partial word information and performing learning for partial candidates, our system achieved sig- nificantly faster speed with little accuracy loss compared to the system of Z&C08.; The source code of our joint segmentor and POStagger can be found at: www.sourceforge.net/projects/zpar, version 0.4. 851 Acknowledgements We thank Canasai Kruengkrai for discussion on efficiency issues, and the anonymous reviewers for their suggestions. Yue Zhang and Stephen Clark are supported by the European Union Seventh Framework Programme (FP7-ICT-2009-4) under grant agreement no. 247762. References Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R. Shrivaths, Jeremy Moore, Michael Pozar, and Theresa Vu. 2006. Multilevel coarse-to-fine PCFG parsing. In Proceedings of HLT/NAACL, pages 168– 175, New York City, USA, June. Association for Computational Linguistics. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of ACL, pages 111–1 18, Barcelona, Spain, July. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP, pages 1–8, Philadelphia, USA, July. Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan L u¨. 2008a. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of ACL/HLT, pages 897–904, Columbus, Ohio, June. Wenbin Jiang, Haitao Mi, and Qun Liu. 2008b. Word lattice reranking for Chinese word segmentation and part-of-speech tagging. In Proceedings of COLING, pages 385–392, Manchester, UK, August. Canasai Kruengkrai, Kiyotaka Uchimoto, Jun’ichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of ACL/AFNLP, pages 5 13– 521, Suntec, Singapore, August. Tetsuji Nakagawa and Kiyotaka Uchimoto. 2007. A hybrid approach to word segmentation and POS tagging. In Proceedings of ACL Demo and Poster Session, Prague, Czech Republic, June. Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-ofspeech tagging: One-at-a-time or all-at-once? word- based or character-based? In Proceedings of EMNLP, Barcelona, Spain. Brian Roark and Kristy Hollingshead. 2008. Classifying chart cells for quadratic complexity context-free inference. In Proceedings of COLING, pages 745– 752, Manchester, UK, August. Coling 2008 Organizing Committee. Yanxin Shi and Mengqiu Wang. 2007. A dual-layer CRF based joint decoding method for cascade segmentation and labelling tasks. In Proceedings of IJCAI, Hyderabad, India. Fei Xia, 2000. The part-of-speech tagging guidelines for the Chinese Treebank (3.0). Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of ACL, pages 840–847, Prague, Czech Republic, June. Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL/HLT, pages 888–896, Columbus, Ohio, June. 852

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 zhang , stephen Abstract We show that the standard beam-search algorithm can be used as an efficient decoder for the global linear model of Zhang and Clark (2008) for joint word segmentation and POS-tagging, achieving a significant speed improvement. [sent-2, score-0.624]

2 Early-update is used with perceptron training so that the linear model gives a high score to a correct partial candidate as well as a full output. [sent-4, score-0.57]

3 Effective scoring of partial structures allows the decoder to give high accuracy with a small beam-size of 16. [sent-5, score-0.492]

4 The single-model approach to joint segmentation and POS-tagging offers consistent training of all in- formation, concerning words, characters and partsof-speech. [sent-17, score-0.345]

5 ; We show that effective decoding can be achieved with standard beam-search, which gives significant speed improvements compared to the decoding algorithm of Z&C08;, and achieves accuracies that are competitive with the state-of-the-art. [sent-21, score-0.472]

6 Given an input sentence, candidate outputs are built incrementally, one character at a time. [sent-25, score-0.457]

7 When each character is processed, it is combined with existing candidates in all possible ways to generate new candidates, and an agenda is used to keep the N-best candidate outputs from ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0. [sent-26, score-1.102]

8 Compared to the multiple-beam search algorithm of Z&C08;, the use of a single beam can lead to an order of magnitude faster decoding speed. [sent-29, score-0.27]

9 1 The processing of partial words An important problem that we solve in this paper is the handling of partial words with a single beam decoder for the global model. [sent-31, score-0.978]

10 As we pointed out in Z&C08;, it is very difficult to score partial words properly when they are compared with full words, although such comparison is necessary for incremental decoding with a single-beam. [sent-32, score-0.609]

11 To allow comparisons with full words, partial words can either be treated as full words, or handled differently. [sent-33, score-0.445]

12 We showed in Z&C08; that a naive single-beam decoder which treats partial words in the same way as full words failed to give a competitive accu- racy. [sent-34, score-0.572]

13 Therefore, when using single-beam search to process “ 自 来 水 (tap water)”, the two-character word candidate “ 自 来” is likely to have been thrown off the agenda before the third character “水” is considered, leading to an unrecoverable segmentation error. [sent-39, score-1.187]

14 The multiple-beam search decoder we proposed in Z&C08; can be seen as a means to ensure that the three characters “ 自 来 水” always have a chance to be considered as a single word. [sent-41, score-0.264]

15 It explores candidate segmentations from the beginning of the sentence until each character, and avoids the problem of processing partial words by considering only full words. [sent-42, score-0.466]

16 In this paper, we treat partial words differently from full words, so that in the previous example, 844 the decoder can take the first two characters in “ 自 来 水 (tap water)” as a partial word, and keep it in the beam before the third character is processed. [sent-44, score-1.53]

17 One challenge is the representation of POS-tags for partial words. [sent-45, score-0.339]

18 The POS of a partial word is undefined without the corresponding full word information. [sent-46, score-0.46]

19 Though a partial word can make sense with a particular POS-tag when it is treated as a complete word, this POS-tag is not necessarily the POS of the full word which contains the partial word. [sent-47, score-0.799]

20 The first character “下” represents a singlecharacter word “below”, for which the POS can be LC or VV. [sent-49, score-0.454]

21 As discussed above, assigning POS tags to partial words as if they were full words leads to low accuracy. [sent-52, score-0.392]

22 An obvious solution to the above problem is not to assign a POS to a partial word until it becomes a full word. [sent-53, score-0.426]

23 However, lack of POS information for partial words makes them less competitive compared to full words in the beam, since the scores of full words are futher supported by POS and POS ngram information. [sent-54, score-0.502]

24 Therefore, not assigning POS to partial words potentially leads to over segmentation. [sent-55, score-0.339]

25 In this paper, we take a different approach, and assign a POS-tag to a partial word when its first character is separated from the final character of the previous word. [sent-57, score-1.193]

26 When more characters are appended to a partial word, the POS is not changed. [sent-58, score-0.49]

27 The idea is to use the POS of a partial word as the predicted POS of the full word it will become. [sent-59, score-0.46]

28 Possible predictions are made with the first character of the word, and the likely ones will be kept in the beam for the next processing steps. [sent-60, score-0.526]

29 For example, with the three characters “下 雨 天”, we try to keep two partial words (besides full words) in the beam when the first word “下” is processed, with the POS being VV and NN, respectively. [sent-61, score-0.719]

30 Now when the second character is processed, we still need to maintain the possible POS NN in the agenda, which predicts the three-character word “下 雨 天”. [sent-63, score-0.417]

31 As a main contribution of this paper, we show that the mechanism ofpredicting the POS at the first character gives competitive accuracy. [sent-64, score-0.497]

32 Unlike alphabetical languages, each Chinese character represents some specific meanings. [sent-66, score-0.383]

33 The allows the knowledge of possible POS-tags of words that a character can start, using information about the character from the training data. [sent-68, score-0.766]

34 1 The mechanism of first-character decision of POS also boosts the efficiency, since the enumeration of POS is unecessary when a character is appended to the end of an existing word. [sent-70, score-0.506]

35 Finally, an intuitive way to represent the status of a partial word is using a flag explicitly, which means an early decision of the segmentation of the next incoming character. [sent-72, score-0.629]

36 We take a simpler alternative approach, and treat every word as a partial word until the next incoming character is separated from the last character of this word. [sent-73, score-1.275]

37 Before a word is confirmed as a full word, we only apply to it features that represent its current partial status, such as character bigrams, its starting character and its part-ofspeech, etc. [sent-74, score-1.224]

38 An important component for our proposed system is the training process, which needs to ensure that the model scores a partial word with predicted POS properly. [sent-76, score-0.373]

39 Rather than updating the parameters after decoding is com- plete, the modified algorithm updates parameters at any processing step if the correct partial candidate falls out of the beam. [sent-78, score-0.571]

40 However, our system achieved competitive accuracy with Z&C08; without such character lookahead features. [sent-80, score-0.47]

41 2 Model and Feature Templates We use a linear model to score both partial and full candidate outputs. [sent-83, score-0.492]

42 Figure 1 shows the feature templates for the model, where templates 1 14 contain only segmentation information and templates 15 29 contain w~ , w~ – – both segmentation and POS information. [sent-85, score-0.703]

43 Each template is instantiated according to the current character in the decoding process. [sent-86, score-0.649]

44 Row “For” shows the conditions for template instantiation, where “s” indicates that the corresponding template is instantiated when the current character starts a new word, and “a” indicates that the corresponding template is instantiated when the current character does not start a new word. [sent-87, score-1.108]

45 As an example, feature template 1is instantiated when the current character starts a new word, and the resulting feature value is the word to the left of this character. [sent-90, score-0.58]

46 start(w), end(w) and len(w) represent the first character, the last character and the length of word w, respectively. [sent-91, score-0.417]

47 cat(c) represents the POS category of character c, which is the set of POS-tags seen on character c, as we used in Z&C08. [sent-93, score-0.793]

48 ; Given a partial or complete candidate y, its global feature vector Φ(y) is computed by instantiating all applicable feature templates from Table 1 for each character in y, according to whether or not the character is separated from the previous character. [sent-94, score-1.423]

49 The feature templates are mostly taken from, or inspired by, the feature templates of Z&C08. [sent-95, score-0.234]

50 Templates 6, 7, 9, 16, 17, 18, 21, 22, 23, 25, 26 and 28 concern partial word information, whose role in the model is to indicate the likelihood that the partial word including the current character will become a correct full word. [sent-97, score-1.214]

51 LEN(sent)]: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) retaugrenn BEST(agenda) Figure 1: The incremental beam-search decoder. [sent-100, score-0.88]

52 rent character according to the context, and are the crucial reason for the effectiveness of the algorithm with a small beam-size. [sent-101, score-0.383]

53 1 Decoding The decoding algorithm builds an output candidate incrementally, one character at a time. [sent-103, score-0.584]

54 Each character can either be attached to the current word or separated as the start a new word. [sent-104, score-0.503]

55 When the current character starts a new word, a POS-tag is assigned to the new word. [sent-105, score-0.439]

56 An agenda is used by the decoder to keep the N-best candidates during the incremental process. [sent-106, score-0.858]

57 Before decoding starts, the agenda is initialized with an empty sentence. [sent-107, score-0.716]

58 When a character is processed, existing candidates are removed from the agenda and extended with the current character in all possible ways, and the N-best newly generated candidates are put back onto the agenda. [sent-108, score-1.496]

59 After all input characters have been processed, the highest-scored candidate from the agenda is taken as the output. [sent-109, score-0.708]

60 LEN returns the number of characters in a sentence, and sent[i] returns the ith character from the sentence. [sent-112, score-0.545]

61 APPEND appends a character to the last word in a candidate, and SEP joins a character as the start of a new word in a candidate, assigning a POS-tag to the new word. [sent-113, score-0.834]

62 Both our decoding algorithm and the decoding algorithm of Z&C08; run in linear time. [sent-114, score-0.28]

63 In contrast, our decoder does not search backword for the possible starting character of any word. [sent-117, score-0.506]

64 Segmentation ambiguities are resolved by binary choices between the actions append or separate for each character, and no POS enumeration is required when the character is appended. [sent-118, score-0.459]

65 2 Training The learning algorithm is based on the generalized perceptron (Collins, 2002), but parameter adjustments can be performed at any character during the decoding process, using the “early update” mechanism of Collins and Roark (2004). [sent-121, score-0.67]

66 After each character is processed, partial candidates in the agenda are compared to the corresponding gold-standard output for the same characters. [sent-124, score-1.331]

67 If none of the candidates in the agenda are correct, the decoding is stopped and the parameter vector is updated by adding the global feature vector of the gold-standard partial output and subtracting the global feature vector of the highest-scored partial candidate in the agenda. [sent-125, score-1.772]

68 However, if any item in the agenda is the same as the corresponding gold-standard, the decoding process moves to the next character, without any change to the parameter values. [sent-127, score-0.729]

69 After all characters are processed, the decoder prediction is compared with the training example. [sent-128, score-0.237]

70 If the prediction is correct, the parameter vector is not changed; otherwise it is updated by adding the global feature vector of the training example and subtracting the global feature vector of the decoder prediction, just as the perceptron algorithm does. [sent-129, score-0.485]

71 These lines are added to ensure that the agenda contains highly probable candidates during the whole beam-search process, and they are crucial to the high accuracy of the system. [sent-135, score-0.667]

72 As stated earlier, the decoder relies on proper scoring of partial words to maintain a set of high quality candidates in the agenda. [sent-136, score-0.551]

73 Updating the value of the parameter vector for partial outputs can be seen as a means to ensure correct scoring of partial candidates at any character. [sent-137, score-0.858]

74 Word frequencies are initialized as zeros and updated according to each training example before it is processed; the tag dictionary is initialized as empty and updated according to each training example before it is processed. [sent-153, score-0.257]

75 During decoding, when the current character is added as the start of a new word, “closed set” tags are only assigned to the word if it is consistent with the record. [sent-155, score-0.449]

76 This type of pruning is used in addition to the tag + dictionary to prune invalid partial words, while the tag dictionary is used to prune complete words. [sent-156, score-0.421]

77 The record for initial character and POS is initially empty, and udpated according to each training example before it is processed. [sent-157, score-0.412]

78 Finally, at any decoding step, we group partial 848 candidates that are generated by separating the current character as the start of a new word by the signature p0p−1w−1, and keep only the best among those having the same p0p−1w−1. [sent-158, score-1.088]

79 ; It can be seen as an extra mechanism to help keep likely partial words in the agenda and improve the accuracy, but which does not give our system a speed advantage over Z&C08. [sent-161, score-1.058]

80 Our development tests were mainly used to decide the size ofthe beam, the number oftraining iterations, the ef- fect of partial features in beam-search decoding, and the effect of incremental learning (i. [sent-174, score-0.463]

81 849 Figure 3 shows the accuracy curves for joint segmentation and POS-tagging by the number of training iterations, using different beam sizes. [sent-177, score-0.404]

82 With the size of the beam increasing from 1to 32, the accuracies generally increase, while the amount of increase becomes small when the size of the beam becomes 16. [sent-178, score-0.368]

83 After the 10th iteration, a beam size of 32 does not always give better accuracies than a beam size of 16. [sent-179, score-0.368]

84 The effect of incremental training: We compare the accuracies by incremental training using early update and normal perceptron training. [sent-197, score-0.4]

85 In the incremental training case, the algorithm reached the best accuracy at the 30th training iteration, obtaining a segmentation F-score of 91. [sent-202, score-0.296]

86 The column head- ings “sf”, “jf”, “time” and “speed” refer to segmentation F-measure, joint F-measure, testing time (in 3http://www. [sent-212, score-0.231]

87 Our system gave a joint segmentation and POStagging F-score of 91. [sent-215, score-0.231]

88 (2009) made use of character type knowledge for spaces, numerals, symbols, alphabets, Chinese and other characters. [sent-226, score-0.383]

89 Columns “sf” and “jf” refer to segmentation and joint accuracies, respectively. [sent-234, score-0.231]

90 4 Related Work The effectiveness of our beam-search decoder showed that the joint segmentation and tagging problem may be less complex than previously perceived (Zhang and Clark, 2008; Jiang et al. [sent-240, score-0.354]

91 At the very least, the single model approach with a simple decoder achieved competitive accuracies to what has been achieved so far by the reranking (Shi and Wang, 2007; Jiang et al. [sent-242, score-0.262]

92 This may shed new light on joint segmentation and POS-tagging methods. [sent-245, score-0.231]

93 (2009) is based on Nakagawa and Uchimoto (2007), which separates the processing of known words and unknown words, and uses a set of segmentation tags to represent the segmentation of characters. [sent-250, score-0.352]

94 Moreover, our model is based on our previous work, in line with Zhang and Clark (2007), which does not treat word segmentation as character sequence labeling. [sent-252, score-0.593]

95 5 Conclusion We showed that a simple beam-search decoding algorithm can be effectively applied to the decoding problem for a global linear model for joint word segmentation and POS-tagging. [sent-261, score-0.579]

96 By guiding search with partial word information and performing learning for partial candidates, our system achieved sig- nificantly faster speed with little accuracy loss compared to the system of Z&C08. [sent-262, score-0.821]

97 A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. [sent-286, score-0.291]

98 An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. [sent-294, score-0.29]

99 A hybrid approach to word segmentation and POS tagging. [sent-298, score-0.235]

100 A dual-layer CRF based joint decoding method for cascade segmentation and labelling tasks. [sent-312, score-0.358]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('agenda', 0.52), ('character', 0.383), ('partial', 0.339), ('kruengkrai', 0.187), ('segmentation', 0.176), ('ctb', 0.145), ('beam', 0.143), ('pos', 0.139), ('additem', 0.131), ('cand', 0.131), ('decoding', 0.127), ('decoder', 0.123), ('templates', 0.117), ('characters', 0.114), ('incremental', 0.09), ('candidates', 0.089), ('sent', 0.085), ('accuracies', 0.082), ('speed', 0.079), ('perceptron', 0.078), ('clark', 0.077), ('candidate', 0.074), ('processed', 0.074), ('yue', 0.072), ('instantiated', 0.067), ('nakagawa', 0.064), ('segmentor', 0.064), ('jiang', 0.064), ('zhang', 0.062), ('mechanism', 0.057), ('competitive', 0.057), ('zpar', 0.056), ('joint', 0.055), ('chinese', 0.055), ('separated', 0.054), ('full', 0.053), ('index', 0.051), ('roark', 0.05), ('uchimoto', 0.05), ('incoming', 0.048), ('signature', 0.048), ('tap', 0.048), ('water', 0.048), ('append', 0.047), ('updated', 0.045), ('ww', 0.045), ('initialized', 0.044), ('sep', 0.043), ('template', 0.04), ('vector', 0.039), ('segmented', 0.037), ('canasai', 0.037), ('singlecharacter', 0.037), ('appended', 0.037), ('keep', 0.036), ('collins', 0.036), ('stephen', 0.035), ('word', 0.034), ('uk', 0.034), ('closed', 0.034), ('global', 0.034), ('tests', 0.034), ('shi', 0.033), ('jf', 0.032), ('urn', 0.032), ('early', 0.032), ('current', 0.032), ('incrementally', 0.031), ('updating', 0.031), ('accuracy', 0.03), ('record', 0.029), ('decided', 0.029), ('enumeration', 0.029), ('moves', 0.029), ('subtracting', 0.029), ('wenbin', 0.029), ('update', 0.028), ('letters', 0.028), ('item', 0.028), ('pruning', 0.028), ('lines', 0.028), ('seen', 0.027), ('dictionary', 0.027), ('row', 0.027), ('kiyotaka', 0.027), ('len', 0.027), ('xia', 0.027), ('zeros', 0.027), ('iterations', 0.026), ('linear', 0.026), ('hybrid', 0.025), ('treebank', 0.025), ('decode', 0.025), ('parameter', 0.025), ('arabic', 0.025), ('empty', 0.025), ('starts', 0.024), ('nn', 0.024), ('programming', 0.024), ('returns', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

Author: Yue Zhang ; Stephen Clark

Abstract: We show that the standard beam-search algorithm can be used as an efficient decoder for the global linear model of Zhang and Clark (2008) for joint word segmentation and POS-tagging, achieving a significant speed improvement. Such decoding is enabled by: (1) separating full word features from partial word features so that feature templates can be instantiated incrementally, according to whether the current character is separated or appended; (2) deciding the POS-tag of a potential word when its first character is processed. Early-update is used with perceptron training so that the linear model gives a high score to a correct partial candidate as well as a full output. Effective scoring of partial structures allows the decoder to give high accuracy with a small beam-size of 16. In our 10-fold crossvalidation experiments with the Chinese Tree- . bank, our system performed over 10 times as fast as Zhang and Clark (2008) with little accuracy loss. The accuracy of our system on the standard CTB 5 test was competitive with the best in the literature. 1 Introduction and Motivation Several approaches have been proposed to solve word segmentation and POS-tagging jointly, including the reranking approach (Shi and Wang, 2007; Jiang et al., 2008b), the hybrid approach (Nakagawa and Uchimoto, 2007; Jiang et al., 2008a), and the single-model approach (Ng and Low, 2004; Zhang and Clark, 2008; Kruengkrai et al., 2009). These methods led to accuracy improvements over the traditional, pipelined segmentation and POS-tagging . . . 843 clark} @ cl cam ac uk baseline by avoiding segmentation error propagation and making use of part-of-speech information to improve segmentation. The single-model approach to joint segmentation and POS-tagging offers consistent training of all in- formation, concerning words, characters and partsof-speech. However, exact inference with dynamic programming can be infeasible if features are defined over a large enough range of the output, such as over a two-word history. In our previous work (Zhang and Clark, 2008), which we refer to as Z&C08; from now on, we used an approximate decoding algorithm that keeps track of a set of partially built structures for each character, which can be seen as a dynamic programming chart which is greatly reduced by pruning. In this paper we follow the line of single-model research, in particular the global linear model of Z&C08.; We show that effective decoding can be achieved with standard beam-search, which gives significant speed improvements compared to the decoding algorithm of Z&C08;, and achieves accuracies that are competitive with the state-of-the-art. Our research is also in line with recent research on improving the speed of NLP systems with little or no accuracy loss (Charniak et al., 2006; Roark and Hollingshead, 2008). Our speed improvement is achieved by the use of a single-beam decoder. Given an input sentence, candidate outputs are built incrementally, one character at a time. When each character is processed, it is combined with existing candidates in all possible ways to generate new candidates, and an agenda is used to keep the N-best candidate outputs from ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 8t4ic3s–852, the begining of the sentence to the current character. Compared to the multiple-beam search algorithm of Z&C08;, the use of a single beam can lead to an order of magnitude faster decoding speed. 1.1 The processing of partial words An important problem that we solve in this paper is the handling of partial words with a single beam decoder for the global model. As we pointed out in Z&C08;, it is very difficult to score partial words properly when they are compared with full words, although such comparison is necessary for incremental decoding with a single-beam. To allow comparisons with full words, partial words can either be treated as full words, or handled differently. We showed in Z&C08; that a naive single-beam decoder which treats partial words in the same way as full words failed to give a competitive accu- racy. An important reason for the low accuracy is over-segmentation during beam-search. Consider the three characters “ 自 来 水 (tap water)”. The first two characters do not make sense when put together as a single word. Rather, when treated as two singlecharacter words, they can make sense in a sentence such as “请 (please) 自 (self) 来 (come) 取 (take)”. Therefore, when using single-beam search to process “ 自 来 水 (tap water)”, the two-character word candidate “ 自 来” is likely to have been thrown off the agenda before the third character “水” is considered, leading to an unrecoverable segmentation error. This problem is even more severe for a joint segmentor and POS-tagger than for a pure word segmentor, since the POS-tags and POS-tag bigram of “ 自 and “来” further supports them being separated when ”来” is considered. The multiple-beam search decoder we proposed in Z&C08; can be seen as a means to ensure that the three characters “ 自 来 水” always have a chance to be considered as a single word. It explores candidate segmentations from the beginning of the sentence until each character, and avoids the problem of processing partial words by considering only full words. However, since it ex- ” plores a larger part of the search space than a singlebeam decoder, its time complexity is correspondingly higher. In this paper, we treat partial words differently from full words, so that in the previous example, 844 the decoder can take the first two characters in “ 自 来 水 (tap water)” as a partial word, and keep it in the beam before the third character is processed. One challenge is the representation of POS-tags for partial words. The POS of a partial word is undefined without the corresponding full word information. Though a partial word can make sense with a particular POS-tag when it is treated as a complete word, this POS-tag is not necessarily the POS of the full word which contains the partial word. Take the three-character sequence “下 雨 天” as an example. The first character “下” represents a singlecharacter word “below”, for which the POS can be LC or VV. The first two characters “下 雨” represent a two-character word “rain”, for which the POS can be VV. Moreover, all three characters when put together make the word “rainy day”, for which the POS is NN. As discussed above, assigning POS tags to partial words as if they were full words leads to low accuracy. An obvious solution to the above problem is not to assign a POS to a partial word until it becomes a full word. However, lack of POS information for partial words makes them less competitive compared to full words in the beam, since the scores of full words are futher supported by POS and POS ngram information. Therefore, not assigning POS to partial words potentially leads to over segmentation. In our experiments, this method did not give comparable accuracies to our Z&C08; system. In this paper, we take a different approach, and assign a POS-tag to a partial word when its first character is separated from the final character of the previous word. When more characters are appended to a partial word, the POS is not changed. The idea is to use the POS of a partial word as the predicted POS of the full word it will become. Possible predictions are made with the first character of the word, and the likely ones will be kept in the beam for the next processing steps. For example, with the three characters “下 雨 天”, we try to keep two partial words (besides full words) in the beam when the first word “下” is processed, with the POS being VV and NN, respectively. The first POS predicts the two-character word “下 雨” , and the second the three-character word “下 雨 天”. Now when the second character is processed, we still need to maintain the possible POS NN in the agenda, which predicts the three-character word “下 雨 天”. As a main contribution of this paper, we show that the mechanism ofpredicting the POS at the first character gives competitive accuracy. This mechanism can be justified theoretically. Unlike alphabetical languages, each Chinese character represents some specific meanings. Given a character, it is natural for a human speaker to know immediately what types of words it can start. The allows the knowledge of possible POS-tags of words that a character can start, using information about the character from the training data. Moreover, the POS of the previous words to the current word are also useful in deciding possible POS for the word.1 The mechanism of first-character decision of POS also boosts the efficiency, since the enumeration of POS is unecessary when a character is appended to the end of an existing word. As a result, the complexity of each processing step is reduce by half compared to a method without POS prediction. Finally, an intuitive way to represent the status of a partial word is using a flag explicitly, which means an early decision of the segmentation of the next incoming character. We take a simpler alternative approach, and treat every word as a partial word until the next incoming character is separated from the last character of this word. Before a word is confirmed as a full word, we only apply to it features that represent its current partial status, such as character bigrams, its starting character and its part-ofspeech, etc. Full word features, including the first and last characters of a word, are applied immediately after a word is confirmed as complete. An important component for our proposed system is the training process, which needs to ensure that the model scores a partial word with predicted POS properly. We use the averaged perceptron (Collins, 2002) for training, together with the “early update” mechanism of Collins and Roark (2004). Rather than updating the parameters after decoding is com- plete, the modified algorithm updates parameters at any processing step if the correct partial candidate falls out of the beam. In our experiments using the Chinese Treebank 1The next incoming characters are also a useful source of information for predicting the POS. However, our system achieved competitive accuracy with Z&C08; without such character lookahead features. 845 data, our system ran an order of magnitude faster than our Z&C08; system with little loss of accuracy. The accuracy of our system was competitive with other recent models. 2 Model and Feature Templates We use a linear model to score both partial and full candidate outputs. Given an input x, the score of a candidate output y is computed as: Score(y) = Φ(y) · where Φ(y) is the global feature vector extracted from y, and is the parameter vector of the model. Figure 1 shows the feature templates for the model, where templates 1 14 contain only segmentation information and templates 15 29 contain w~ , w~ – – both segmentation and POS information. Each template is instantiated according to the current character in the decoding process. Row “For” shows the conditions for template instantiation, where “s” indicates that the corresponding template is instantiated when the current character starts a new word, and “a” indicates that the corresponding template is instantiated when the current character does not start a new word. In the row for feature templates, w, t and c are used to represent a word, a POS-tag and a character, respectively. The subscripts are based on the current character, where w−1 represents the first word to the left of the current character, and p−2 represents the POS-tag on the second word to the left of the current character, and so on. As an example, feature template 1is instantiated when the current character starts a new word, and the resulting feature value is the word to the left of this character. start(w), end(w) and len(w) represent the first character, the last character and the length of word w, respectively. The length of a word is normalized to 16 if it is larger than 16. cat(c) represents the POS category of character c, which is the set of POS-tags seen on character c, as we used in Z&C08.; Given a partial or complete candidate y, its global feature vector Φ(y) is computed by instantiating all applicable feature templates from Table 1 for each character in y, according to whether or not the character is separated from the previous character. The feature templates are mostly taken from, or inspired by, the feature templates of Z&C08.; Templates 1, 2, 3, 4, 5, 8, 10, 12, 13, 14, 15, 19, 20, Feature templateFor 24, 27 and 29 concern complete word information, and they are used in the model to differentiate correct and incorrect output structures in the same way as our Z&C08; model. Templates 6, 7, 9, 16, 17, 18, 21, 22, 23, 25, 26 and 28 concern partial word information, whose role in the model is to indicate the likelihood that the partial word including the current character will become a correct full word. They act as guidance for the action to take for the cur846 function DECODE(sent, agenda): CLEAR(agenda) ADDITEM(agenda, “”) for index in [0..LEN(sent)]: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) retaugrenn BEST(agenda) Figure 1: The incremental beam-search decoder. rent character according to the context, and are the crucial reason for the effectiveness of the algorithm with a small beam-size. 2.1 Decoding The decoding algorithm builds an output candidate incrementally, one character at a time. Each character can either be attached to the current word or separated as the start a new word. When the current character starts a new word, a POS-tag is assigned to the new word. An agenda is used by the decoder to keep the N-best candidates during the incremental process. Before decoding starts, the agenda is initialized with an empty sentence. When a character is processed, existing candidates are removed from the agenda and extended with the current character in all possible ways, and the N-best newly generated candidates are put back onto the agenda. After all input characters have been processed, the highest-scored candidate from the agenda is taken as the output. Pseudo code for the decoder is shown in Figure 1. CLEAR removes all items from the agenda, ADDITEM adds a new item onto the agenda, N-BEST returns the N highest-scored items from the agenda, and BEST returns the highest-scored item from the agenda. LEN returns the number of characters in a sentence, and sent[i] returns the ith character from the sentence. APPEND appends a character to the last word in a candidate, and SEP joins a character as the start of a new word in a candidate, assigning a POS-tag to the new word. Both our decoding algorithm and the decoding algorithm of Z&C08; run in linear time. However, in order to generate possible candidates for each character, Z&C08; uses an extra loop to search for possible words that end with the current character. A restriction to the maximum word length is applied to limit the number of iterations in this loop, without which the algorithm would have quadratic time complexity. In contrast, our decoder does not search backword for the possible starting character of any word. Segmentation ambiguities are resolved by binary choices between the actions append or separate for each character, and no POS enumeration is required when the character is appended. This improves the speed by a significant factor. 2.2 Training The learning algorithm is based on the generalized perceptron (Collins, 2002), but parameter adjustments can be performed at any character during the decoding process, using the “early update” mechanism of Collins and Roark (2004). The parameter vector of the model is initialized as all zeros before training, and used to decode training examples. Each training example is turned into the raw input format, and processed in the same way as decoding. After each character is processed, partial candidates in the agenda are compared to the corresponding gold-standard output for the same characters. If none of the candidates in the agenda are correct, the decoding is stopped and the parameter vector is updated by adding the global feature vector of the gold-standard partial output and subtracting the global feature vector of the highest-scored partial candidate in the agenda. The training process then moves on to the next example. However, if any item in the agenda is the same as the corresponding gold-standard, the decoding process moves to the next character, without any change to the parameter values. After all characters are processed, the decoder prediction is compared with the training example. If the prediction is correct, the parameter vector is not changed; otherwise it is updated by adding the global feature vector of the training example and subtracting the global feature vector of the decoder prediction, just as the perceptron algorithm does. The same training examples can be used to train the model for multiple iterations. We use 847 the averaged parameter vector (Collins, 2002) as the final model. Pseudocode for the training algorithm is shown in Figure 2. It is based on the decoding algorithm in Figure 1, and the main differences are: (1) the training algorithm takes the gold-standard output and the parameter vector as two additional arguments; (2) the training algorithm does not return a prediction, but modifies the parameter vector when necessary; (3) lines 11to 20 are additional lines of code for parameter updates. Without lines 11 to 16, the training algorithm is exactly the same as the generalized perceptron algorithm. These lines are added to ensure that the agenda contains highly probable candidates during the whole beam-search process, and they are crucial to the high accuracy of the system. As stated earlier, the decoder relies on proper scoring of partial words to maintain a set of high quality candidates in the agenda. Updating the value of the parameter vector for partial outputs can be seen as a means to ensure correct scoring of partial candidates at any character. 2.3 Pruning We follow Z&C08; and use several pruning methods, most of which serve to to improve the accuracy by removing irrelevant candidates from the beam. First, the system records the maximum number of characters that a word with a particular POS-tag can have. For example, from the Chinese Treebank that we used for our experiments, most POS are associated with only with one- or two-character words. The only POS-tags that are seen with words over ten characters long are NN (noun), NR (proper noun) and CD (numbers). The maximum word length information is initialized as all ones, and updated according to each training example before it is processed. Second, a tag dictionary is used to record POStags associated with each word. During decoding, frequent words and words with “closed set” tags2 are only allowed POS-tags according to the tag dictionary, while other words are allowed every POS-tag to make candidate outputs. Whether a word is a frequent word is decided by the number of times it has been seen in the training process. Denoting the num2“Closed set” tags are the set of POS-tags which are only associated with a fixed set of words, according to the Penn Chinese Treebank specifications (Xia, 2000). function TRAIN(sent, agenda, gold-standard, w~ ): 01: CLEAR(agenda) 02: ADDITEM(agenda, “”) 03: for index in [0..LEN(sent)]: 04: 05: 06: 07: 08: 09: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) faogre cnadnad ← ←in agenda: if cand = gold-standard[0:index] : CONTINUE w~ ← w~ + Φ(gold-standard[0:index]) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ if BEST(agenda) gold-standard: w~ ← a ~wg + Φ(gold-standard) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ return = Figure 2: The incremental learning function. ber of times the most frequent word has been seen with M, a word is a frequent word if it has been seen more than M/5000 5 times. The threshold value is taken from Z&C08;, and we did not adjust it during development. Word frequencies are initialized as zeros and updated according to each training example before it is processed; the tag dictionary is initialized as empty and updated according to each training example before it is processed. Third, we make an additional record of the initial characters for words with “closed set” tags. During decoding, when the current character is added as the start of a new word, “closed set” tags are only assigned to the word if it is consistent with the record. This type of pruning is used in addition to the tag + dictionary to prune invalid partial words, while the tag dictionary is used to prune complete words. The record for initial character and POS is initially empty, and udpated according to each training example before it is processed. Finally, at any decoding step, we group partial 848 candidates that are generated by separating the current character as the start of a new word by the signature p0p−1w−1, and keep only the best among those having the same p0p−1w−1. The signature p0p−1w−1 is decided by the feature templates we use: it can be shown that if two candidates cand1 and cand2 generated at the same step have the same signature, and the score of cand1 is higher than the score of cand2, then at any future step, the highest scored candidate generated from cand1 will always have a higher score than the highest scored candidate generated from cand2. From the above pruning methods, only the third was not used by Z&C08.; It can be seen as an extra mechanism to help keep likely partial words in the agenda and improve the accuracy, but which does not give our system a speed advantage over Z&C08.; 3 Experiments We used the Chinese Treebank (CTB) data to perform one set of development tests and two sets of fi- Training iteration Figure 3: The influence of beam-sizes, and the convergence of the perceptron. nal tests. The CTB 4 was split into two parts, with the CTB 3 being used for a 10-fold cross validation test to compare speed and accuracies with Z&C08;, and the rest being used for development. The CTB 5 was used to perform the additional set of experiments to compare accuracies with other recent work. We use the standard F-measure to evaluate output accuracies. For word segmentation, precision is defined as the number of correctly segmented words divided by the total number of words in the output, and recall is defined as the number of correctly segmented words divided by the total number of words in the gold-standard output. For joint segmentation and POS-tagging, precision is defined as the number of correctly segmented and POS-tagged words divided by the total number of words from the output, and recall is defined as the correctly segmented and POS-tagged words divided by the total number of words in the gold-standard output. All our experiments were performed on a Linux platform, and a single 2.66GHz Intel Core 2 CPU. 3.1 Development tests Our development data consists of 150K words in 4798 sentences. 80% of the data were randomly chosen as the development training data, while the rest were used as the development test data. Our development tests were mainly used to decide the size ofthe beam, the number oftraining iterations, the ef- fect of partial features in beam-search decoding, and the effect of incremental learning (i.e. early update). 849 Figure 3 shows the accuracy curves for joint segmentation and POS-tagging by the number of training iterations, using different beam sizes. With the size of the beam increasing from 1to 32, the accuracies generally increase, while the amount of increase becomes small when the size of the beam becomes 16. After the 10th iteration, a beam size of 32 does not always give better accuracies than a beam size of 16. We therefore chose 16 as the size of the beam for our system. The testing times for each beam size between 1 and 32 are 7.16s, 11.90s, 18.42s, 27.82s, 46.77s and 89.21s, respectively. The corresponding speeds in the number of sentences per second are 111.45, 67.06, 43.32, 28.68, 17.06 and 8.95, respectively. Figure 3 also shows that the accuracy increases with an increased number of training iterations, but the amount of increase becomes small after the 25th iteration. We chose 29 as the number of iterations to train our system. The effect of incremental training: We compare the accuracies by incremental training using early update and normal perceptron training. In the normal perceptron training case, lines 11to 16 are taken out of the training algorithm in Figure 2. The algorithm reached the best performance at the 22nd iteration, with the segmentation F-score being 90.58% and joint F-score being 83.38%. In the incremental training case, the algorithm reached the best accuracy at the 30th training iteration, obtaining a segmentation F-score of 91.14% and a joint F-score of 84.06%. 3.2 Final tests using CTB 3 CTB 3 consists of 150K words in 10364 sentences. We follow Z&C08; and split it into 10 equal-sized parts. In each test, one part is taken as the test data and the other nine are combined together as the training data. We compare the speed and accuracy with the joint segmentor and tagger of Z&C08;, which is publicly available as the ZPar system, version 0.23. The results are shown in Table 2, where each row shows one cross validation test. The column head- ings “sf”, “jf”, “time” and “speed” refer to segmentation F-measure, joint F-measure, testing time (in 3http://www.sourceforge.net/projects/zpar #sZf&C08jftimespeed; tshfis papjefrtimespeed seconds) and testing speed (in the number of sentences per second), respectively. Our system gave a joint segmentation and POStagging F-score of 91.37%, which is only 0.04% lower than that of ZPar 0.2. The speed of our system was over 10 times as fast as ZPar 0.2. 3.3 Final tests using CTB 5 We follow Kruengkrai et al. (2009) and split the CTB 5 into training, development testing and testing sets, as shown in Table 3. We ignored the development test data since our system had been developed in previous experiments. Kruengkrai et al. (2009) made use of character type knowledge for spaces, numerals, symbols, alphabets, Chinese and other characters. In the previous experiments, our system did not use any knowledge beyond the training data. To make the comparison fairer, we included knowledge of English letters and Arabic numbers in this experiment. During both training and decoding, English letters and Arabic numbers are segmented using simple rules, treating consecutive English letters or Arabic numbers as a single word. The results are shown in Table 4, where row “N07” refers to the model of Nakagawa and Uchimoto (2007), rows “J08a” and “b” refer to the models of Jiang et al. (2008a) and Jiang et al. (2008b), and row “K09” refers to the models of Kruengkrai et al. (2009). Columns “sf” and “jf” refer to segmentation and joint accuracies, respectively. Our system 850 SectionsSentencesWords T Daerbsvltien3:gTrain14230i–7n021 g–71,3d90–21e035v 1elopm1e385n40t,8and5tes a648t,903o2n,18C92TB5. TJoKNab0ul8re79abs4(y:rtAesomcl-indurea)vycom9 pa7 r.i87s34o59n w3 i.t64h2710recntsudio sfjf CTB 5. gave comparable accuracies to these recent works, obtaining the best (same as the error-driven version of K09) joint F-score. 4 Related Work The effectiveness of our beam-search decoder showed that the joint segmentation and tagging problem may be less complex than previously perceived (Zhang and Clark, 2008; Jiang et al., 2008a). At the very least, the single model approach with a simple decoder achieved competitive accuracies to what has been achieved so far by the reranking (Shi and Wang, 2007; Jiang et al., 2008b) models and an ensemble model using machine-translation techniques (Jiang et al., 2008a). This may shed new light on joint segmentation and POS-tagging methods. Kruengkrai et al. (2009) and Zhang and Clark (2008) are the most similar to our system among related work. Both systems use a discriminatively trained linear model to score candidate outputs. The work of Kruengkrai et al. (2009) is based on Nakagawa and Uchimoto (2007), which separates the processing of known words and unknown words, and uses a set of segmentation tags to represent the segmentation of characters. In contrast, our model is conceptually simpler, and does not differentiate known words and unknown words. Moreover, our model is based on our previous work, in line with Zhang and Clark (2007), which does not treat word segmentation as character sequence labeling. Our learning and decoding algorithms are also different from Kruengkrai et al. (2009). While Kruengkrai et al. (2009) perform dynamic programming and MIRA learning, we use beam-search to perform incremental decoding, and the early-update version of the perceptron algorithm to train the model. Dynamic programming is exact inference, for which the time complexity is decided by the locality of feature templates. In contrast, beam-search is approximate and can run in linear time. The parameter updating for our algorithm is conceptually and computationally simpler than MIRA, though its performance can be slightly lower. However, the earlyupdate mechanism we use is consistent with our incremental approach, and improves the learning of the beam-search process. 5 Conclusion We showed that a simple beam-search decoding algorithm can be effectively applied to the decoding problem for a global linear model for joint word segmentation and POS-tagging. By guiding search with partial word information and performing learning for partial candidates, our system achieved sig- nificantly faster speed with little accuracy loss compared to the system of Z&C08.; The source code of our joint segmentor and POStagger can be found at: www.sourceforge.net/projects/zpar, version 0.4. 851 Acknowledgements We thank Canasai Kruengkrai for discussion on efficiency issues, and the anonymous reviewers for their suggestions. Yue Zhang and Stephen Clark are supported by the European Union Seventh Framework Programme (FP7-ICT-2009-4) under grant agreement no. 247762. References Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R. Shrivaths, Jeremy Moore, Michael Pozar, and Theresa Vu. 2006. Multilevel coarse-to-fine PCFG parsing. In Proceedings of HLT/NAACL, pages 168– 175, New York City, USA, June. Association for Computational Linguistics. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of ACL, pages 111–1 18, Barcelona, Spain, July. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP, pages 1–8, Philadelphia, USA, July. Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan L u¨. 2008a. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of ACL/HLT, pages 897–904, Columbus, Ohio, June. Wenbin Jiang, Haitao Mi, and Qun Liu. 2008b. Word lattice reranking for Chinese word segmentation and part-of-speech tagging. In Proceedings of COLING, pages 385–392, Manchester, UK, August. Canasai Kruengkrai, Kiyotaka Uchimoto, Jun’ichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of ACL/AFNLP, pages 5 13– 521, Suntec, Singapore, August. Tetsuji Nakagawa and Kiyotaka Uchimoto. 2007. A hybrid approach to word segmentation and POS tagging. In Proceedings of ACL Demo and Poster Session, Prague, Czech Republic, June. Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-ofspeech tagging: One-at-a-time or all-at-once? word- based or character-based? In Proceedings of EMNLP, Barcelona, Spain. Brian Roark and Kristy Hollingshead. 2008. Classifying chart cells for quadratic complexity context-free inference. In Proceedings of COLING, pages 745– 752, Manchester, UK, August. Coling 2008 Organizing Committee. Yanxin Shi and Mengqiu Wang. 2007. A dual-layer CRF based joint decoding method for cascade segmentation and labelling tasks. In Proceedings of IJCAI, Hyderabad, India. Fei Xia, 2000. The part-of-speech tagging guidelines for the Chinese Treebank (3.0). Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of ACL, pages 840–847, Prague, Czech Republic, June. Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL/HLT, pages 888–896, Columbus, Ohio, June. 852

2 0.1938419 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

3 0.18792263 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

Author: Baobao Chang ; Dongxu Han

Abstract: Almost all Chinese language processing tasks involve word segmentation of the language input as their first steps, thus robust and reliable segmentation techniques are always required to make sure those tasks wellperformed. In recent years, machine learning and sequence labeling models such as Conditional Random Fields (CRFs) are often used in segmenting Chinese texts. Compared with traditional lexicon-driven models, machine learned models achieve higher F-measure scores. But machine learned models heavily depend on training materials. Although they can effectively process texts from the same domain as the training texts, they perform relatively poorly when texts from new domains are to be processed. In this paper, we propose to use χ2 statistics when training an SVM-HMM based segmentation model to im- prove its ability to recall OOV words and then use bootstrapping strategies to maintain its ability to recall IV words. Experiments show the approach proposed in this paper enhances the domain portability of the Chinese word segmentation model and prevents drastic decline in performance when processing texts across domains.

4 0.13096514 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

Author: Liang Huang ; Haitao Mi

Abstract: Syntax-based translation models should in principle be efficient with polynomially-sized search space, but in practice they are often embarassingly slow, partly due to the cost of language model integration. In this paper we borrow from phrase-based decoding the idea to generate a translation incrementally left-to-right, and show that for tree-to-string models, with a clever encoding of derivation history, this method runs in averagecase polynomial-time in theory, and lineartime with beam search in practice (whereas phrase-based decoding is exponential-time in theory and quadratic-time in practice). Experiments show that, with comparable translation quality, our tree-to-string system (in Python) can run more than 30 times faster than the phrase-based system Moses (in C++).

5 0.11462383 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

6 0.075960636 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

7 0.070117466 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

8 0.064143762 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

9 0.063813202 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

10 0.062770471 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

11 0.057411727 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

12 0.057191566 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

13 0.055453505 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

14 0.05530861 61 emnlp-2010-Improving Gender Classification of Blog Authors

15 0.052629814 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

16 0.050063122 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

17 0.048719857 95 emnlp-2010-SRL-Based Verb Selection for ESL

18 0.04795891 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

19 0.04741611 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

20 0.047070622 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.187), (1, 0.054), (2, 0.095), (3, -0.035), (4, -0.267), (5, 0.071), (6, 0.001), (7, -0.118), (8, -0.235), (9, -0.066), (10, 0.005), (11, -0.026), (12, -0.01), (13, -0.199), (14, -0.04), (15, 0.087), (16, 0.195), (17, 0.022), (18, 0.109), (19, -0.145), (20, 0.105), (21, -0.03), (22, -0.064), (23, 0.113), (24, 0.116), (25, 0.107), (26, 0.123), (27, -0.089), (28, 0.142), (29, 0.081), (30, -0.029), (31, -0.019), (32, 0.025), (33, 0.079), (34, 0.055), (35, 0.151), (36, 0.023), (37, 0.024), (38, -0.005), (39, 0.058), (40, -0.064), (41, -0.043), (42, -0.025), (43, 0.074), (44, -0.133), (45, -0.001), (46, -0.062), (47, -0.035), (48, 0.029), (49, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9718861 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

Author: Yue Zhang ; Stephen Clark

Abstract: We show that the standard beam-search algorithm can be used as an efficient decoder for the global linear model of Zhang and Clark (2008) for joint word segmentation and POS-tagging, achieving a significant speed improvement. Such decoding is enabled by: (1) separating full word features from partial word features so that feature templates can be instantiated incrementally, according to whether the current character is separated or appended; (2) deciding the POS-tag of a potential word when its first character is processed. Early-update is used with perceptron training so that the linear model gives a high score to a correct partial candidate as well as a full output. Effective scoring of partial structures allows the decoder to give high accuracy with a small beam-size of 16. In our 10-fold crossvalidation experiments with the Chinese Tree- . bank, our system performed over 10 times as fast as Zhang and Clark (2008) with little accuracy loss. The accuracy of our system on the standard CTB 5 test was competitive with the best in the literature. 1 Introduction and Motivation Several approaches have been proposed to solve word segmentation and POS-tagging jointly, including the reranking approach (Shi and Wang, 2007; Jiang et al., 2008b), the hybrid approach (Nakagawa and Uchimoto, 2007; Jiang et al., 2008a), and the single-model approach (Ng and Low, 2004; Zhang and Clark, 2008; Kruengkrai et al., 2009). These methods led to accuracy improvements over the traditional, pipelined segmentation and POS-tagging . . . 843 clark} @ cl cam ac uk baseline by avoiding segmentation error propagation and making use of part-of-speech information to improve segmentation. The single-model approach to joint segmentation and POS-tagging offers consistent training of all in- formation, concerning words, characters and partsof-speech. However, exact inference with dynamic programming can be infeasible if features are defined over a large enough range of the output, such as over a two-word history. In our previous work (Zhang and Clark, 2008), which we refer to as Z&C08; from now on, we used an approximate decoding algorithm that keeps track of a set of partially built structures for each character, which can be seen as a dynamic programming chart which is greatly reduced by pruning. In this paper we follow the line of single-model research, in particular the global linear model of Z&C08.; We show that effective decoding can be achieved with standard beam-search, which gives significant speed improvements compared to the decoding algorithm of Z&C08;, and achieves accuracies that are competitive with the state-of-the-art. Our research is also in line with recent research on improving the speed of NLP systems with little or no accuracy loss (Charniak et al., 2006; Roark and Hollingshead, 2008). Our speed improvement is achieved by the use of a single-beam decoder. Given an input sentence, candidate outputs are built incrementally, one character at a time. When each character is processed, it is combined with existing candidates in all possible ways to generate new candidates, and an agenda is used to keep the N-best candidate outputs from ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 8t4ic3s–852, the begining of the sentence to the current character. Compared to the multiple-beam search algorithm of Z&C08;, the use of a single beam can lead to an order of magnitude faster decoding speed. 1.1 The processing of partial words An important problem that we solve in this paper is the handling of partial words with a single beam decoder for the global model. As we pointed out in Z&C08;, it is very difficult to score partial words properly when they are compared with full words, although such comparison is necessary for incremental decoding with a single-beam. To allow comparisons with full words, partial words can either be treated as full words, or handled differently. We showed in Z&C08; that a naive single-beam decoder which treats partial words in the same way as full words failed to give a competitive accu- racy. An important reason for the low accuracy is over-segmentation during beam-search. Consider the three characters “ 自 来 水 (tap water)”. The first two characters do not make sense when put together as a single word. Rather, when treated as two singlecharacter words, they can make sense in a sentence such as “请 (please) 自 (self) 来 (come) 取 (take)”. Therefore, when using single-beam search to process “ 自 来 水 (tap water)”, the two-character word candidate “ 自 来” is likely to have been thrown off the agenda before the third character “水” is considered, leading to an unrecoverable segmentation error. This problem is even more severe for a joint segmentor and POS-tagger than for a pure word segmentor, since the POS-tags and POS-tag bigram of “ 自 and “来” further supports them being separated when ”来” is considered. The multiple-beam search decoder we proposed in Z&C08; can be seen as a means to ensure that the three characters “ 自 来 水” always have a chance to be considered as a single word. It explores candidate segmentations from the beginning of the sentence until each character, and avoids the problem of processing partial words by considering only full words. However, since it ex- ” plores a larger part of the search space than a singlebeam decoder, its time complexity is correspondingly higher. In this paper, we treat partial words differently from full words, so that in the previous example, 844 the decoder can take the first two characters in “ 自 来 水 (tap water)” as a partial word, and keep it in the beam before the third character is processed. One challenge is the representation of POS-tags for partial words. The POS of a partial word is undefined without the corresponding full word information. Though a partial word can make sense with a particular POS-tag when it is treated as a complete word, this POS-tag is not necessarily the POS of the full word which contains the partial word. Take the three-character sequence “下 雨 天” as an example. The first character “下” represents a singlecharacter word “below”, for which the POS can be LC or VV. The first two characters “下 雨” represent a two-character word “rain”, for which the POS can be VV. Moreover, all three characters when put together make the word “rainy day”, for which the POS is NN. As discussed above, assigning POS tags to partial words as if they were full words leads to low accuracy. An obvious solution to the above problem is not to assign a POS to a partial word until it becomes a full word. However, lack of POS information for partial words makes them less competitive compared to full words in the beam, since the scores of full words are futher supported by POS and POS ngram information. Therefore, not assigning POS to partial words potentially leads to over segmentation. In our experiments, this method did not give comparable accuracies to our Z&C08; system. In this paper, we take a different approach, and assign a POS-tag to a partial word when its first character is separated from the final character of the previous word. When more characters are appended to a partial word, the POS is not changed. The idea is to use the POS of a partial word as the predicted POS of the full word it will become. Possible predictions are made with the first character of the word, and the likely ones will be kept in the beam for the next processing steps. For example, with the three characters “下 雨 天”, we try to keep two partial words (besides full words) in the beam when the first word “下” is processed, with the POS being VV and NN, respectively. The first POS predicts the two-character word “下 雨” , and the second the three-character word “下 雨 天”. Now when the second character is processed, we still need to maintain the possible POS NN in the agenda, which predicts the three-character word “下 雨 天”. As a main contribution of this paper, we show that the mechanism ofpredicting the POS at the first character gives competitive accuracy. This mechanism can be justified theoretically. Unlike alphabetical languages, each Chinese character represents some specific meanings. Given a character, it is natural for a human speaker to know immediately what types of words it can start. The allows the knowledge of possible POS-tags of words that a character can start, using information about the character from the training data. Moreover, the POS of the previous words to the current word are also useful in deciding possible POS for the word.1 The mechanism of first-character decision of POS also boosts the efficiency, since the enumeration of POS is unecessary when a character is appended to the end of an existing word. As a result, the complexity of each processing step is reduce by half compared to a method without POS prediction. Finally, an intuitive way to represent the status of a partial word is using a flag explicitly, which means an early decision of the segmentation of the next incoming character. We take a simpler alternative approach, and treat every word as a partial word until the next incoming character is separated from the last character of this word. Before a word is confirmed as a full word, we only apply to it features that represent its current partial status, such as character bigrams, its starting character and its part-ofspeech, etc. Full word features, including the first and last characters of a word, are applied immediately after a word is confirmed as complete. An important component for our proposed system is the training process, which needs to ensure that the model scores a partial word with predicted POS properly. We use the averaged perceptron (Collins, 2002) for training, together with the “early update” mechanism of Collins and Roark (2004). Rather than updating the parameters after decoding is com- plete, the modified algorithm updates parameters at any processing step if the correct partial candidate falls out of the beam. In our experiments using the Chinese Treebank 1The next incoming characters are also a useful source of information for predicting the POS. However, our system achieved competitive accuracy with Z&C08; without such character lookahead features. 845 data, our system ran an order of magnitude faster than our Z&C08; system with little loss of accuracy. The accuracy of our system was competitive with other recent models. 2 Model and Feature Templates We use a linear model to score both partial and full candidate outputs. Given an input x, the score of a candidate output y is computed as: Score(y) = Φ(y) · where Φ(y) is the global feature vector extracted from y, and is the parameter vector of the model. Figure 1 shows the feature templates for the model, where templates 1 14 contain only segmentation information and templates 15 29 contain w~ , w~ – – both segmentation and POS information. Each template is instantiated according to the current character in the decoding process. Row “For” shows the conditions for template instantiation, where “s” indicates that the corresponding template is instantiated when the current character starts a new word, and “a” indicates that the corresponding template is instantiated when the current character does not start a new word. In the row for feature templates, w, t and c are used to represent a word, a POS-tag and a character, respectively. The subscripts are based on the current character, where w−1 represents the first word to the left of the current character, and p−2 represents the POS-tag on the second word to the left of the current character, and so on. As an example, feature template 1is instantiated when the current character starts a new word, and the resulting feature value is the word to the left of this character. start(w), end(w) and len(w) represent the first character, the last character and the length of word w, respectively. The length of a word is normalized to 16 if it is larger than 16. cat(c) represents the POS category of character c, which is the set of POS-tags seen on character c, as we used in Z&C08.; Given a partial or complete candidate y, its global feature vector Φ(y) is computed by instantiating all applicable feature templates from Table 1 for each character in y, according to whether or not the character is separated from the previous character. The feature templates are mostly taken from, or inspired by, the feature templates of Z&C08.; Templates 1, 2, 3, 4, 5, 8, 10, 12, 13, 14, 15, 19, 20, Feature templateFor 24, 27 and 29 concern complete word information, and they are used in the model to differentiate correct and incorrect output structures in the same way as our Z&C08; model. Templates 6, 7, 9, 16, 17, 18, 21, 22, 23, 25, 26 and 28 concern partial word information, whose role in the model is to indicate the likelihood that the partial word including the current character will become a correct full word. They act as guidance for the action to take for the cur846 function DECODE(sent, agenda): CLEAR(agenda) ADDITEM(agenda, “”) for index in [0..LEN(sent)]: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) retaugrenn BEST(agenda) Figure 1: The incremental beam-search decoder. rent character according to the context, and are the crucial reason for the effectiveness of the algorithm with a small beam-size. 2.1 Decoding The decoding algorithm builds an output candidate incrementally, one character at a time. Each character can either be attached to the current word or separated as the start a new word. When the current character starts a new word, a POS-tag is assigned to the new word. An agenda is used by the decoder to keep the N-best candidates during the incremental process. Before decoding starts, the agenda is initialized with an empty sentence. When a character is processed, existing candidates are removed from the agenda and extended with the current character in all possible ways, and the N-best newly generated candidates are put back onto the agenda. After all input characters have been processed, the highest-scored candidate from the agenda is taken as the output. Pseudo code for the decoder is shown in Figure 1. CLEAR removes all items from the agenda, ADDITEM adds a new item onto the agenda, N-BEST returns the N highest-scored items from the agenda, and BEST returns the highest-scored item from the agenda. LEN returns the number of characters in a sentence, and sent[i] returns the ith character from the sentence. APPEND appends a character to the last word in a candidate, and SEP joins a character as the start of a new word in a candidate, assigning a POS-tag to the new word. Both our decoding algorithm and the decoding algorithm of Z&C08; run in linear time. However, in order to generate possible candidates for each character, Z&C08; uses an extra loop to search for possible words that end with the current character. A restriction to the maximum word length is applied to limit the number of iterations in this loop, without which the algorithm would have quadratic time complexity. In contrast, our decoder does not search backword for the possible starting character of any word. Segmentation ambiguities are resolved by binary choices between the actions append or separate for each character, and no POS enumeration is required when the character is appended. This improves the speed by a significant factor. 2.2 Training The learning algorithm is based on the generalized perceptron (Collins, 2002), but parameter adjustments can be performed at any character during the decoding process, using the “early update” mechanism of Collins and Roark (2004). The parameter vector of the model is initialized as all zeros before training, and used to decode training examples. Each training example is turned into the raw input format, and processed in the same way as decoding. After each character is processed, partial candidates in the agenda are compared to the corresponding gold-standard output for the same characters. If none of the candidates in the agenda are correct, the decoding is stopped and the parameter vector is updated by adding the global feature vector of the gold-standard partial output and subtracting the global feature vector of the highest-scored partial candidate in the agenda. The training process then moves on to the next example. However, if any item in the agenda is the same as the corresponding gold-standard, the decoding process moves to the next character, without any change to the parameter values. After all characters are processed, the decoder prediction is compared with the training example. If the prediction is correct, the parameter vector is not changed; otherwise it is updated by adding the global feature vector of the training example and subtracting the global feature vector of the decoder prediction, just as the perceptron algorithm does. The same training examples can be used to train the model for multiple iterations. We use 847 the averaged parameter vector (Collins, 2002) as the final model. Pseudocode for the training algorithm is shown in Figure 2. It is based on the decoding algorithm in Figure 1, and the main differences are: (1) the training algorithm takes the gold-standard output and the parameter vector as two additional arguments; (2) the training algorithm does not return a prediction, but modifies the parameter vector when necessary; (3) lines 11to 20 are additional lines of code for parameter updates. Without lines 11 to 16, the training algorithm is exactly the same as the generalized perceptron algorithm. These lines are added to ensure that the agenda contains highly probable candidates during the whole beam-search process, and they are crucial to the high accuracy of the system. As stated earlier, the decoder relies on proper scoring of partial words to maintain a set of high quality candidates in the agenda. Updating the value of the parameter vector for partial outputs can be seen as a means to ensure correct scoring of partial candidates at any character. 2.3 Pruning We follow Z&C08; and use several pruning methods, most of which serve to to improve the accuracy by removing irrelevant candidates from the beam. First, the system records the maximum number of characters that a word with a particular POS-tag can have. For example, from the Chinese Treebank that we used for our experiments, most POS are associated with only with one- or two-character words. The only POS-tags that are seen with words over ten characters long are NN (noun), NR (proper noun) and CD (numbers). The maximum word length information is initialized as all ones, and updated according to each training example before it is processed. Second, a tag dictionary is used to record POStags associated with each word. During decoding, frequent words and words with “closed set” tags2 are only allowed POS-tags according to the tag dictionary, while other words are allowed every POS-tag to make candidate outputs. Whether a word is a frequent word is decided by the number of times it has been seen in the training process. Denoting the num2“Closed set” tags are the set of POS-tags which are only associated with a fixed set of words, according to the Penn Chinese Treebank specifications (Xia, 2000). function TRAIN(sent, agenda, gold-standard, w~ ): 01: CLEAR(agenda) 02: ADDITEM(agenda, “”) 03: for index in [0..LEN(sent)]: 04: 05: 06: 07: 08: 09: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) faogre cnadnad ← ←in agenda: if cand = gold-standard[0:index] : CONTINUE w~ ← w~ + Φ(gold-standard[0:index]) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ if BEST(agenda) gold-standard: w~ ← a ~wg + Φ(gold-standard) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ return = Figure 2: The incremental learning function. ber of times the most frequent word has been seen with M, a word is a frequent word if it has been seen more than M/5000 5 times. The threshold value is taken from Z&C08;, and we did not adjust it during development. Word frequencies are initialized as zeros and updated according to each training example before it is processed; the tag dictionary is initialized as empty and updated according to each training example before it is processed. Third, we make an additional record of the initial characters for words with “closed set” tags. During decoding, when the current character is added as the start of a new word, “closed set” tags are only assigned to the word if it is consistent with the record. This type of pruning is used in addition to the tag + dictionary to prune invalid partial words, while the tag dictionary is used to prune complete words. The record for initial character and POS is initially empty, and udpated according to each training example before it is processed. Finally, at any decoding step, we group partial 848 candidates that are generated by separating the current character as the start of a new word by the signature p0p−1w−1, and keep only the best among those having the same p0p−1w−1. The signature p0p−1w−1 is decided by the feature templates we use: it can be shown that if two candidates cand1 and cand2 generated at the same step have the same signature, and the score of cand1 is higher than the score of cand2, then at any future step, the highest scored candidate generated from cand1 will always have a higher score than the highest scored candidate generated from cand2. From the above pruning methods, only the third was not used by Z&C08.; It can be seen as an extra mechanism to help keep likely partial words in the agenda and improve the accuracy, but which does not give our system a speed advantage over Z&C08.; 3 Experiments We used the Chinese Treebank (CTB) data to perform one set of development tests and two sets of fi- Training iteration Figure 3: The influence of beam-sizes, and the convergence of the perceptron. nal tests. The CTB 4 was split into two parts, with the CTB 3 being used for a 10-fold cross validation test to compare speed and accuracies with Z&C08;, and the rest being used for development. The CTB 5 was used to perform the additional set of experiments to compare accuracies with other recent work. We use the standard F-measure to evaluate output accuracies. For word segmentation, precision is defined as the number of correctly segmented words divided by the total number of words in the output, and recall is defined as the number of correctly segmented words divided by the total number of words in the gold-standard output. For joint segmentation and POS-tagging, precision is defined as the number of correctly segmented and POS-tagged words divided by the total number of words from the output, and recall is defined as the correctly segmented and POS-tagged words divided by the total number of words in the gold-standard output. All our experiments were performed on a Linux platform, and a single 2.66GHz Intel Core 2 CPU. 3.1 Development tests Our development data consists of 150K words in 4798 sentences. 80% of the data were randomly chosen as the development training data, while the rest were used as the development test data. Our development tests were mainly used to decide the size ofthe beam, the number oftraining iterations, the ef- fect of partial features in beam-search decoding, and the effect of incremental learning (i.e. early update). 849 Figure 3 shows the accuracy curves for joint segmentation and POS-tagging by the number of training iterations, using different beam sizes. With the size of the beam increasing from 1to 32, the accuracies generally increase, while the amount of increase becomes small when the size of the beam becomes 16. After the 10th iteration, a beam size of 32 does not always give better accuracies than a beam size of 16. We therefore chose 16 as the size of the beam for our system. The testing times for each beam size between 1 and 32 are 7.16s, 11.90s, 18.42s, 27.82s, 46.77s and 89.21s, respectively. The corresponding speeds in the number of sentences per second are 111.45, 67.06, 43.32, 28.68, 17.06 and 8.95, respectively. Figure 3 also shows that the accuracy increases with an increased number of training iterations, but the amount of increase becomes small after the 25th iteration. We chose 29 as the number of iterations to train our system. The effect of incremental training: We compare the accuracies by incremental training using early update and normal perceptron training. In the normal perceptron training case, lines 11to 16 are taken out of the training algorithm in Figure 2. The algorithm reached the best performance at the 22nd iteration, with the segmentation F-score being 90.58% and joint F-score being 83.38%. In the incremental training case, the algorithm reached the best accuracy at the 30th training iteration, obtaining a segmentation F-score of 91.14% and a joint F-score of 84.06%. 3.2 Final tests using CTB 3 CTB 3 consists of 150K words in 10364 sentences. We follow Z&C08; and split it into 10 equal-sized parts. In each test, one part is taken as the test data and the other nine are combined together as the training data. We compare the speed and accuracy with the joint segmentor and tagger of Z&C08;, which is publicly available as the ZPar system, version 0.23. The results are shown in Table 2, where each row shows one cross validation test. The column head- ings “sf”, “jf”, “time” and “speed” refer to segmentation F-measure, joint F-measure, testing time (in 3http://www.sourceforge.net/projects/zpar #sZf&C08jftimespeed; tshfis papjefrtimespeed seconds) and testing speed (in the number of sentences per second), respectively. Our system gave a joint segmentation and POStagging F-score of 91.37%, which is only 0.04% lower than that of ZPar 0.2. The speed of our system was over 10 times as fast as ZPar 0.2. 3.3 Final tests using CTB 5 We follow Kruengkrai et al. (2009) and split the CTB 5 into training, development testing and testing sets, as shown in Table 3. We ignored the development test data since our system had been developed in previous experiments. Kruengkrai et al. (2009) made use of character type knowledge for spaces, numerals, symbols, alphabets, Chinese and other characters. In the previous experiments, our system did not use any knowledge beyond the training data. To make the comparison fairer, we included knowledge of English letters and Arabic numbers in this experiment. During both training and decoding, English letters and Arabic numbers are segmented using simple rules, treating consecutive English letters or Arabic numbers as a single word. The results are shown in Table 4, where row “N07” refers to the model of Nakagawa and Uchimoto (2007), rows “J08a” and “b” refer to the models of Jiang et al. (2008a) and Jiang et al. (2008b), and row “K09” refers to the models of Kruengkrai et al. (2009). Columns “sf” and “jf” refer to segmentation and joint accuracies, respectively. Our system 850 SectionsSentencesWords T Daerbsvltien3:gTrain14230i–7n021 g–71,3d90–21e035v 1elopm1e385n40t,8and5tes a648t,903o2n,18C92TB5. TJoKNab0ul8re79abs4(y:rtAesomcl-indurea)vycom9 pa7 r.i87s34o59n w3 i.t64h2710recntsudio sfjf CTB 5. gave comparable accuracies to these recent works, obtaining the best (same as the error-driven version of K09) joint F-score. 4 Related Work The effectiveness of our beam-search decoder showed that the joint segmentation and tagging problem may be less complex than previously perceived (Zhang and Clark, 2008; Jiang et al., 2008a). At the very least, the single model approach with a simple decoder achieved competitive accuracies to what has been achieved so far by the reranking (Shi and Wang, 2007; Jiang et al., 2008b) models and an ensemble model using machine-translation techniques (Jiang et al., 2008a). This may shed new light on joint segmentation and POS-tagging methods. Kruengkrai et al. (2009) and Zhang and Clark (2008) are the most similar to our system among related work. Both systems use a discriminatively trained linear model to score candidate outputs. The work of Kruengkrai et al. (2009) is based on Nakagawa and Uchimoto (2007), which separates the processing of known words and unknown words, and uses a set of segmentation tags to represent the segmentation of characters. In contrast, our model is conceptually simpler, and does not differentiate known words and unknown words. Moreover, our model is based on our previous work, in line with Zhang and Clark (2007), which does not treat word segmentation as character sequence labeling. Our learning and decoding algorithms are also different from Kruengkrai et al. (2009). While Kruengkrai et al. (2009) perform dynamic programming and MIRA learning, we use beam-search to perform incremental decoding, and the early-update version of the perceptron algorithm to train the model. Dynamic programming is exact inference, for which the time complexity is decided by the locality of feature templates. In contrast, beam-search is approximate and can run in linear time. The parameter updating for our algorithm is conceptually and computationally simpler than MIRA, though its performance can be slightly lower. However, the earlyupdate mechanism we use is consistent with our incremental approach, and improves the learning of the beam-search process. 5 Conclusion We showed that a simple beam-search decoding algorithm can be effectively applied to the decoding problem for a global linear model for joint word segmentation and POS-tagging. By guiding search with partial word information and performing learning for partial candidates, our system achieved sig- nificantly faster speed with little accuracy loss compared to the system of Z&C08.; The source code of our joint segmentor and POStagger can be found at: www.sourceforge.net/projects/zpar, version 0.4. 851 Acknowledgements We thank Canasai Kruengkrai for discussion on efficiency issues, and the anonymous reviewers for their suggestions. Yue Zhang and Stephen Clark are supported by the European Union Seventh Framework Programme (FP7-ICT-2009-4) under grant agreement no. 247762. References Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R. Shrivaths, Jeremy Moore, Michael Pozar, and Theresa Vu. 2006. Multilevel coarse-to-fine PCFG parsing. In Proceedings of HLT/NAACL, pages 168– 175, New York City, USA, June. Association for Computational Linguistics. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of ACL, pages 111–1 18, Barcelona, Spain, July. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP, pages 1–8, Philadelphia, USA, July. Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan L u¨. 2008a. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of ACL/HLT, pages 897–904, Columbus, Ohio, June. Wenbin Jiang, Haitao Mi, and Qun Liu. 2008b. Word lattice reranking for Chinese word segmentation and part-of-speech tagging. In Proceedings of COLING, pages 385–392, Manchester, UK, August. Canasai Kruengkrai, Kiyotaka Uchimoto, Jun’ichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of ACL/AFNLP, pages 5 13– 521, Suntec, Singapore, August. Tetsuji Nakagawa and Kiyotaka Uchimoto. 2007. A hybrid approach to word segmentation and POS tagging. In Proceedings of ACL Demo and Poster Session, Prague, Czech Republic, June. Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-ofspeech tagging: One-at-a-time or all-at-once? word- based or character-based? In Proceedings of EMNLP, Barcelona, Spain. Brian Roark and Kristy Hollingshead. 2008. Classifying chart cells for quadratic complexity context-free inference. In Proceedings of COLING, pages 745– 752, Manchester, UK, August. Coling 2008 Organizing Committee. Yanxin Shi and Mengqiu Wang. 2007. A dual-layer CRF based joint decoding method for cascade segmentation and labelling tasks. In Proceedings of IJCAI, Hyderabad, India. Fei Xia, 2000. The part-of-speech tagging guidelines for the Chinese Treebank (3.0). Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of ACL, pages 840–847, Prague, Czech Republic, June. Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL/HLT, pages 888–896, Columbus, Ohio, June. 852

2 0.75641644 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

Author: Baobao Chang ; Dongxu Han

Abstract: Almost all Chinese language processing tasks involve word segmentation of the language input as their first steps, thus robust and reliable segmentation techniques are always required to make sure those tasks wellperformed. In recent years, machine learning and sequence labeling models such as Conditional Random Fields (CRFs) are often used in segmenting Chinese texts. Compared with traditional lexicon-driven models, machine learned models achieve higher F-measure scores. But machine learned models heavily depend on training materials. Although they can effectively process texts from the same domain as the training texts, they perform relatively poorly when texts from new domains are to be processed. In this paper, we propose to use χ2 statistics when training an SVM-HMM based segmentation model to im- prove its ability to recall OOV words and then use bootstrapping strategies to maintain its ability to recall IV words. Experiments show the approach proposed in this paper enhances the domain portability of the Chinese word segmentation model and prevents drastic decline in performance when processing texts across domains.

3 0.6847809 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

4 0.55175459 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

5 0.41148791 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

Author: Liang Huang ; Haitao Mi

Abstract: Syntax-based translation models should in principle be efficient with polynomially-sized search space, but in practice they are often embarassingly slow, partly due to the cost of language model integration. In this paper we borrow from phrase-based decoding the idea to generate a translation incrementally left-to-right, and show that for tree-to-string models, with a clever encoding of derivation history, this method runs in averagecase polynomial-time in theory, and lineartime with beam search in practice (whereas phrase-based decoding is exponential-time in theory and quadratic-time in practice). Experiments show that, with comparable translation quality, our tree-to-string system (in Python) can run more than 30 times faster than the phrase-based system Moses (in C++).

6 0.29908901 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

7 0.27777722 61 emnlp-2010-Improving Gender Classification of Blog Authors

8 0.26260552 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

9 0.25986686 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

10 0.2575272 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

11 0.25722671 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

12 0.24091619 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

13 0.23361807 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

14 0.22847667 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

15 0.22524181 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

16 0.21188882 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

17 0.21063617 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

18 0.20886843 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

19 0.19986542 95 emnlp-2010-SRL-Based Verb Selection for ESL

20 0.19600981 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.285), (12, 0.047), (29, 0.133), (32, 0.024), (52, 0.037), (56, 0.064), (62, 0.022), (66, 0.111), (72, 0.092), (76, 0.03), (87, 0.026), (89, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82433438 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

Author: Alla Rozovskaya ; Dan Roth

Abstract: In this paper, we consider the problem of generating candidate corrections for the task of correcting errors in text. We focus on the task of correcting errors in preposition usage made by non-native English speakers, using discriminative classifiers. The standard approach to the problem assumes that the set of candidate corrections for a preposition consists of all preposition choices participating in the task. We determine likely preposition confusions using an annotated corpus of nonnative text and use this knowledge to produce smaller sets of candidates. We propose several methods of restricting candidate sets. These methods exclude candidate prepositions that are not observed as valid corrections in the annotated corpus and take into account the likelihood of each preposition confusion in the non-native text. We find that restricting candidates to those that are ob- served in the non-native data improves both the precision and the recall compared to the approach that views all prepositions as possible candidates. Furthermore, the approach that takes into account the likelihood of each preposition confusion is shown to be the most effective.

same-paper 2 0.75848669 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

Author: Yue Zhang ; Stephen Clark

Abstract: We show that the standard beam-search algorithm can be used as an efficient decoder for the global linear model of Zhang and Clark (2008) for joint word segmentation and POS-tagging, achieving a significant speed improvement. Such decoding is enabled by: (1) separating full word features from partial word features so that feature templates can be instantiated incrementally, according to whether the current character is separated or appended; (2) deciding the POS-tag of a potential word when its first character is processed. Early-update is used with perceptron training so that the linear model gives a high score to a correct partial candidate as well as a full output. Effective scoring of partial structures allows the decoder to give high accuracy with a small beam-size of 16. In our 10-fold crossvalidation experiments with the Chinese Tree- . bank, our system performed over 10 times as fast as Zhang and Clark (2008) with little accuracy loss. The accuracy of our system on the standard CTB 5 test was competitive with the best in the literature. 1 Introduction and Motivation Several approaches have been proposed to solve word segmentation and POS-tagging jointly, including the reranking approach (Shi and Wang, 2007; Jiang et al., 2008b), the hybrid approach (Nakagawa and Uchimoto, 2007; Jiang et al., 2008a), and the single-model approach (Ng and Low, 2004; Zhang and Clark, 2008; Kruengkrai et al., 2009). These methods led to accuracy improvements over the traditional, pipelined segmentation and POS-tagging . . . 843 clark} @ cl cam ac uk baseline by avoiding segmentation error propagation and making use of part-of-speech information to improve segmentation. The single-model approach to joint segmentation and POS-tagging offers consistent training of all in- formation, concerning words, characters and partsof-speech. However, exact inference with dynamic programming can be infeasible if features are defined over a large enough range of the output, such as over a two-word history. In our previous work (Zhang and Clark, 2008), which we refer to as Z&C08; from now on, we used an approximate decoding algorithm that keeps track of a set of partially built structures for each character, which can be seen as a dynamic programming chart which is greatly reduced by pruning. In this paper we follow the line of single-model research, in particular the global linear model of Z&C08.; We show that effective decoding can be achieved with standard beam-search, which gives significant speed improvements compared to the decoding algorithm of Z&C08;, and achieves accuracies that are competitive with the state-of-the-art. Our research is also in line with recent research on improving the speed of NLP systems with little or no accuracy loss (Charniak et al., 2006; Roark and Hollingshead, 2008). Our speed improvement is achieved by the use of a single-beam decoder. Given an input sentence, candidate outputs are built incrementally, one character at a time. When each character is processed, it is combined with existing candidates in all possible ways to generate new candidates, and an agenda is used to keep the N-best candidate outputs from ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 8t4ic3s–852, the begining of the sentence to the current character. Compared to the multiple-beam search algorithm of Z&C08;, the use of a single beam can lead to an order of magnitude faster decoding speed. 1.1 The processing of partial words An important problem that we solve in this paper is the handling of partial words with a single beam decoder for the global model. As we pointed out in Z&C08;, it is very difficult to score partial words properly when they are compared with full words, although such comparison is necessary for incremental decoding with a single-beam. To allow comparisons with full words, partial words can either be treated as full words, or handled differently. We showed in Z&C08; that a naive single-beam decoder which treats partial words in the same way as full words failed to give a competitive accu- racy. An important reason for the low accuracy is over-segmentation during beam-search. Consider the three characters “ 自 来 水 (tap water)”. The first two characters do not make sense when put together as a single word. Rather, when treated as two singlecharacter words, they can make sense in a sentence such as “请 (please) 自 (self) 来 (come) 取 (take)”. Therefore, when using single-beam search to process “ 自 来 水 (tap water)”, the two-character word candidate “ 自 来” is likely to have been thrown off the agenda before the third character “水” is considered, leading to an unrecoverable segmentation error. This problem is even more severe for a joint segmentor and POS-tagger than for a pure word segmentor, since the POS-tags and POS-tag bigram of “ 自 and “来” further supports them being separated when ”来” is considered. The multiple-beam search decoder we proposed in Z&C08; can be seen as a means to ensure that the three characters “ 自 来 水” always have a chance to be considered as a single word. It explores candidate segmentations from the beginning of the sentence until each character, and avoids the problem of processing partial words by considering only full words. However, since it ex- ” plores a larger part of the search space than a singlebeam decoder, its time complexity is correspondingly higher. In this paper, we treat partial words differently from full words, so that in the previous example, 844 the decoder can take the first two characters in “ 自 来 水 (tap water)” as a partial word, and keep it in the beam before the third character is processed. One challenge is the representation of POS-tags for partial words. The POS of a partial word is undefined without the corresponding full word information. Though a partial word can make sense with a particular POS-tag when it is treated as a complete word, this POS-tag is not necessarily the POS of the full word which contains the partial word. Take the three-character sequence “下 雨 天” as an example. The first character “下” represents a singlecharacter word “below”, for which the POS can be LC or VV. The first two characters “下 雨” represent a two-character word “rain”, for which the POS can be VV. Moreover, all three characters when put together make the word “rainy day”, for which the POS is NN. As discussed above, assigning POS tags to partial words as if they were full words leads to low accuracy. An obvious solution to the above problem is not to assign a POS to a partial word until it becomes a full word. However, lack of POS information for partial words makes them less competitive compared to full words in the beam, since the scores of full words are futher supported by POS and POS ngram information. Therefore, not assigning POS to partial words potentially leads to over segmentation. In our experiments, this method did not give comparable accuracies to our Z&C08; system. In this paper, we take a different approach, and assign a POS-tag to a partial word when its first character is separated from the final character of the previous word. When more characters are appended to a partial word, the POS is not changed. The idea is to use the POS of a partial word as the predicted POS of the full word it will become. Possible predictions are made with the first character of the word, and the likely ones will be kept in the beam for the next processing steps. For example, with the three characters “下 雨 天”, we try to keep two partial words (besides full words) in the beam when the first word “下” is processed, with the POS being VV and NN, respectively. The first POS predicts the two-character word “下 雨” , and the second the three-character word “下 雨 天”. Now when the second character is processed, we still need to maintain the possible POS NN in the agenda, which predicts the three-character word “下 雨 天”. As a main contribution of this paper, we show that the mechanism ofpredicting the POS at the first character gives competitive accuracy. This mechanism can be justified theoretically. Unlike alphabetical languages, each Chinese character represents some specific meanings. Given a character, it is natural for a human speaker to know immediately what types of words it can start. The allows the knowledge of possible POS-tags of words that a character can start, using information about the character from the training data. Moreover, the POS of the previous words to the current word are also useful in deciding possible POS for the word.1 The mechanism of first-character decision of POS also boosts the efficiency, since the enumeration of POS is unecessary when a character is appended to the end of an existing word. As a result, the complexity of each processing step is reduce by half compared to a method without POS prediction. Finally, an intuitive way to represent the status of a partial word is using a flag explicitly, which means an early decision of the segmentation of the next incoming character. We take a simpler alternative approach, and treat every word as a partial word until the next incoming character is separated from the last character of this word. Before a word is confirmed as a full word, we only apply to it features that represent its current partial status, such as character bigrams, its starting character and its part-ofspeech, etc. Full word features, including the first and last characters of a word, are applied immediately after a word is confirmed as complete. An important component for our proposed system is the training process, which needs to ensure that the model scores a partial word with predicted POS properly. We use the averaged perceptron (Collins, 2002) for training, together with the “early update” mechanism of Collins and Roark (2004). Rather than updating the parameters after decoding is com- plete, the modified algorithm updates parameters at any processing step if the correct partial candidate falls out of the beam. In our experiments using the Chinese Treebank 1The next incoming characters are also a useful source of information for predicting the POS. However, our system achieved competitive accuracy with Z&C08; without such character lookahead features. 845 data, our system ran an order of magnitude faster than our Z&C08; system with little loss of accuracy. The accuracy of our system was competitive with other recent models. 2 Model and Feature Templates We use a linear model to score both partial and full candidate outputs. Given an input x, the score of a candidate output y is computed as: Score(y) = Φ(y) · where Φ(y) is the global feature vector extracted from y, and is the parameter vector of the model. Figure 1 shows the feature templates for the model, where templates 1 14 contain only segmentation information and templates 15 29 contain w~ , w~ – – both segmentation and POS information. Each template is instantiated according to the current character in the decoding process. Row “For” shows the conditions for template instantiation, where “s” indicates that the corresponding template is instantiated when the current character starts a new word, and “a” indicates that the corresponding template is instantiated when the current character does not start a new word. In the row for feature templates, w, t and c are used to represent a word, a POS-tag and a character, respectively. The subscripts are based on the current character, where w−1 represents the first word to the left of the current character, and p−2 represents the POS-tag on the second word to the left of the current character, and so on. As an example, feature template 1is instantiated when the current character starts a new word, and the resulting feature value is the word to the left of this character. start(w), end(w) and len(w) represent the first character, the last character and the length of word w, respectively. The length of a word is normalized to 16 if it is larger than 16. cat(c) represents the POS category of character c, which is the set of POS-tags seen on character c, as we used in Z&C08.; Given a partial or complete candidate y, its global feature vector Φ(y) is computed by instantiating all applicable feature templates from Table 1 for each character in y, according to whether or not the character is separated from the previous character. The feature templates are mostly taken from, or inspired by, the feature templates of Z&C08.; Templates 1, 2, 3, 4, 5, 8, 10, 12, 13, 14, 15, 19, 20, Feature templateFor 24, 27 and 29 concern complete word information, and they are used in the model to differentiate correct and incorrect output structures in the same way as our Z&C08; model. Templates 6, 7, 9, 16, 17, 18, 21, 22, 23, 25, 26 and 28 concern partial word information, whose role in the model is to indicate the likelihood that the partial word including the current character will become a correct full word. They act as guidance for the action to take for the cur846 function DECODE(sent, agenda): CLEAR(agenda) ADDITEM(agenda, “”) for index in [0..LEN(sent)]: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) retaugrenn BEST(agenda) Figure 1: The incremental beam-search decoder. rent character according to the context, and are the crucial reason for the effectiveness of the algorithm with a small beam-size. 2.1 Decoding The decoding algorithm builds an output candidate incrementally, one character at a time. Each character can either be attached to the current word or separated as the start a new word. When the current character starts a new word, a POS-tag is assigned to the new word. An agenda is used by the decoder to keep the N-best candidates during the incremental process. Before decoding starts, the agenda is initialized with an empty sentence. When a character is processed, existing candidates are removed from the agenda and extended with the current character in all possible ways, and the N-best newly generated candidates are put back onto the agenda. After all input characters have been processed, the highest-scored candidate from the agenda is taken as the output. Pseudo code for the decoder is shown in Figure 1. CLEAR removes all items from the agenda, ADDITEM adds a new item onto the agenda, N-BEST returns the N highest-scored items from the agenda, and BEST returns the highest-scored item from the agenda. LEN returns the number of characters in a sentence, and sent[i] returns the ith character from the sentence. APPEND appends a character to the last word in a candidate, and SEP joins a character as the start of a new word in a candidate, assigning a POS-tag to the new word. Both our decoding algorithm and the decoding algorithm of Z&C08; run in linear time. However, in order to generate possible candidates for each character, Z&C08; uses an extra loop to search for possible words that end with the current character. A restriction to the maximum word length is applied to limit the number of iterations in this loop, without which the algorithm would have quadratic time complexity. In contrast, our decoder does not search backword for the possible starting character of any word. Segmentation ambiguities are resolved by binary choices between the actions append or separate for each character, and no POS enumeration is required when the character is appended. This improves the speed by a significant factor. 2.2 Training The learning algorithm is based on the generalized perceptron (Collins, 2002), but parameter adjustments can be performed at any character during the decoding process, using the “early update” mechanism of Collins and Roark (2004). The parameter vector of the model is initialized as all zeros before training, and used to decode training examples. Each training example is turned into the raw input format, and processed in the same way as decoding. After each character is processed, partial candidates in the agenda are compared to the corresponding gold-standard output for the same characters. If none of the candidates in the agenda are correct, the decoding is stopped and the parameter vector is updated by adding the global feature vector of the gold-standard partial output and subtracting the global feature vector of the highest-scored partial candidate in the agenda. The training process then moves on to the next example. However, if any item in the agenda is the same as the corresponding gold-standard, the decoding process moves to the next character, without any change to the parameter values. After all characters are processed, the decoder prediction is compared with the training example. If the prediction is correct, the parameter vector is not changed; otherwise it is updated by adding the global feature vector of the training example and subtracting the global feature vector of the decoder prediction, just as the perceptron algorithm does. The same training examples can be used to train the model for multiple iterations. We use 847 the averaged parameter vector (Collins, 2002) as the final model. Pseudocode for the training algorithm is shown in Figure 2. It is based on the decoding algorithm in Figure 1, and the main differences are: (1) the training algorithm takes the gold-standard output and the parameter vector as two additional arguments; (2) the training algorithm does not return a prediction, but modifies the parameter vector when necessary; (3) lines 11to 20 are additional lines of code for parameter updates. Without lines 11 to 16, the training algorithm is exactly the same as the generalized perceptron algorithm. These lines are added to ensure that the agenda contains highly probable candidates during the whole beam-search process, and they are crucial to the high accuracy of the system. As stated earlier, the decoder relies on proper scoring of partial words to maintain a set of high quality candidates in the agenda. Updating the value of the parameter vector for partial outputs can be seen as a means to ensure correct scoring of partial candidates at any character. 2.3 Pruning We follow Z&C08; and use several pruning methods, most of which serve to to improve the accuracy by removing irrelevant candidates from the beam. First, the system records the maximum number of characters that a word with a particular POS-tag can have. For example, from the Chinese Treebank that we used for our experiments, most POS are associated with only with one- or two-character words. The only POS-tags that are seen with words over ten characters long are NN (noun), NR (proper noun) and CD (numbers). The maximum word length information is initialized as all ones, and updated according to each training example before it is processed. Second, a tag dictionary is used to record POStags associated with each word. During decoding, frequent words and words with “closed set” tags2 are only allowed POS-tags according to the tag dictionary, while other words are allowed every POS-tag to make candidate outputs. Whether a word is a frequent word is decided by the number of times it has been seen in the training process. Denoting the num2“Closed set” tags are the set of POS-tags which are only associated with a fixed set of words, according to the Penn Chinese Treebank specifications (Xia, 2000). function TRAIN(sent, agenda, gold-standard, w~ ): 01: CLEAR(agenda) 02: ADDITEM(agenda, “”) 03: for index in [0..LEN(sent)]: 04: 05: 06: 07: 08: 09: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) faogre cnadnad ← ←in agenda: if cand = gold-standard[0:index] : CONTINUE w~ ← w~ + Φ(gold-standard[0:index]) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ if BEST(agenda) gold-standard: w~ ← a ~wg + Φ(gold-standard) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ return = Figure 2: The incremental learning function. ber of times the most frequent word has been seen with M, a word is a frequent word if it has been seen more than M/5000 5 times. The threshold value is taken from Z&C08;, and we did not adjust it during development. Word frequencies are initialized as zeros and updated according to each training example before it is processed; the tag dictionary is initialized as empty and updated according to each training example before it is processed. Third, we make an additional record of the initial characters for words with “closed set” tags. During decoding, when the current character is added as the start of a new word, “closed set” tags are only assigned to the word if it is consistent with the record. This type of pruning is used in addition to the tag + dictionary to prune invalid partial words, while the tag dictionary is used to prune complete words. The record for initial character and POS is initially empty, and udpated according to each training example before it is processed. Finally, at any decoding step, we group partial 848 candidates that are generated by separating the current character as the start of a new word by the signature p0p−1w−1, and keep only the best among those having the same p0p−1w−1. The signature p0p−1w−1 is decided by the feature templates we use: it can be shown that if two candidates cand1 and cand2 generated at the same step have the same signature, and the score of cand1 is higher than the score of cand2, then at any future step, the highest scored candidate generated from cand1 will always have a higher score than the highest scored candidate generated from cand2. From the above pruning methods, only the third was not used by Z&C08.; It can be seen as an extra mechanism to help keep likely partial words in the agenda and improve the accuracy, but which does not give our system a speed advantage over Z&C08.; 3 Experiments We used the Chinese Treebank (CTB) data to perform one set of development tests and two sets of fi- Training iteration Figure 3: The influence of beam-sizes, and the convergence of the perceptron. nal tests. The CTB 4 was split into two parts, with the CTB 3 being used for a 10-fold cross validation test to compare speed and accuracies with Z&C08;, and the rest being used for development. The CTB 5 was used to perform the additional set of experiments to compare accuracies with other recent work. We use the standard F-measure to evaluate output accuracies. For word segmentation, precision is defined as the number of correctly segmented words divided by the total number of words in the output, and recall is defined as the number of correctly segmented words divided by the total number of words in the gold-standard output. For joint segmentation and POS-tagging, precision is defined as the number of correctly segmented and POS-tagged words divided by the total number of words from the output, and recall is defined as the correctly segmented and POS-tagged words divided by the total number of words in the gold-standard output. All our experiments were performed on a Linux platform, and a single 2.66GHz Intel Core 2 CPU. 3.1 Development tests Our development data consists of 150K words in 4798 sentences. 80% of the data were randomly chosen as the development training data, while the rest were used as the development test data. Our development tests were mainly used to decide the size ofthe beam, the number oftraining iterations, the ef- fect of partial features in beam-search decoding, and the effect of incremental learning (i.e. early update). 849 Figure 3 shows the accuracy curves for joint segmentation and POS-tagging by the number of training iterations, using different beam sizes. With the size of the beam increasing from 1to 32, the accuracies generally increase, while the amount of increase becomes small when the size of the beam becomes 16. After the 10th iteration, a beam size of 32 does not always give better accuracies than a beam size of 16. We therefore chose 16 as the size of the beam for our system. The testing times for each beam size between 1 and 32 are 7.16s, 11.90s, 18.42s, 27.82s, 46.77s and 89.21s, respectively. The corresponding speeds in the number of sentences per second are 111.45, 67.06, 43.32, 28.68, 17.06 and 8.95, respectively. Figure 3 also shows that the accuracy increases with an increased number of training iterations, but the amount of increase becomes small after the 25th iteration. We chose 29 as the number of iterations to train our system. The effect of incremental training: We compare the accuracies by incremental training using early update and normal perceptron training. In the normal perceptron training case, lines 11to 16 are taken out of the training algorithm in Figure 2. The algorithm reached the best performance at the 22nd iteration, with the segmentation F-score being 90.58% and joint F-score being 83.38%. In the incremental training case, the algorithm reached the best accuracy at the 30th training iteration, obtaining a segmentation F-score of 91.14% and a joint F-score of 84.06%. 3.2 Final tests using CTB 3 CTB 3 consists of 150K words in 10364 sentences. We follow Z&C08; and split it into 10 equal-sized parts. In each test, one part is taken as the test data and the other nine are combined together as the training data. We compare the speed and accuracy with the joint segmentor and tagger of Z&C08;, which is publicly available as the ZPar system, version 0.23. The results are shown in Table 2, where each row shows one cross validation test. The column head- ings “sf”, “jf”, “time” and “speed” refer to segmentation F-measure, joint F-measure, testing time (in 3http://www.sourceforge.net/projects/zpar #sZf&C08jftimespeed; tshfis papjefrtimespeed seconds) and testing speed (in the number of sentences per second), respectively. Our system gave a joint segmentation and POStagging F-score of 91.37%, which is only 0.04% lower than that of ZPar 0.2. The speed of our system was over 10 times as fast as ZPar 0.2. 3.3 Final tests using CTB 5 We follow Kruengkrai et al. (2009) and split the CTB 5 into training, development testing and testing sets, as shown in Table 3. We ignored the development test data since our system had been developed in previous experiments. Kruengkrai et al. (2009) made use of character type knowledge for spaces, numerals, symbols, alphabets, Chinese and other characters. In the previous experiments, our system did not use any knowledge beyond the training data. To make the comparison fairer, we included knowledge of English letters and Arabic numbers in this experiment. During both training and decoding, English letters and Arabic numbers are segmented using simple rules, treating consecutive English letters or Arabic numbers as a single word. The results are shown in Table 4, where row “N07” refers to the model of Nakagawa and Uchimoto (2007), rows “J08a” and “b” refer to the models of Jiang et al. (2008a) and Jiang et al. (2008b), and row “K09” refers to the models of Kruengkrai et al. (2009). Columns “sf” and “jf” refer to segmentation and joint accuracies, respectively. Our system 850 SectionsSentencesWords T Daerbsvltien3:gTrain14230i–7n021 g–71,3d90–21e035v 1elopm1e385n40t,8and5tes a648t,903o2n,18C92TB5. TJoKNab0ul8re79abs4(y:rtAesomcl-indurea)vycom9 pa7 r.i87s34o59n w3 i.t64h2710recntsudio sfjf CTB 5. gave comparable accuracies to these recent works, obtaining the best (same as the error-driven version of K09) joint F-score. 4 Related Work The effectiveness of our beam-search decoder showed that the joint segmentation and tagging problem may be less complex than previously perceived (Zhang and Clark, 2008; Jiang et al., 2008a). At the very least, the single model approach with a simple decoder achieved competitive accuracies to what has been achieved so far by the reranking (Shi and Wang, 2007; Jiang et al., 2008b) models and an ensemble model using machine-translation techniques (Jiang et al., 2008a). This may shed new light on joint segmentation and POS-tagging methods. Kruengkrai et al. (2009) and Zhang and Clark (2008) are the most similar to our system among related work. Both systems use a discriminatively trained linear model to score candidate outputs. The work of Kruengkrai et al. (2009) is based on Nakagawa and Uchimoto (2007), which separates the processing of known words and unknown words, and uses a set of segmentation tags to represent the segmentation of characters. In contrast, our model is conceptually simpler, and does not differentiate known words and unknown words. Moreover, our model is based on our previous work, in line with Zhang and Clark (2007), which does not treat word segmentation as character sequence labeling. Our learning and decoding algorithms are also different from Kruengkrai et al. (2009). While Kruengkrai et al. (2009) perform dynamic programming and MIRA learning, we use beam-search to perform incremental decoding, and the early-update version of the perceptron algorithm to train the model. Dynamic programming is exact inference, for which the time complexity is decided by the locality of feature templates. In contrast, beam-search is approximate and can run in linear time. The parameter updating for our algorithm is conceptually and computationally simpler than MIRA, though its performance can be slightly lower. However, the earlyupdate mechanism we use is consistent with our incremental approach, and improves the learning of the beam-search process. 5 Conclusion We showed that a simple beam-search decoding algorithm can be effectively applied to the decoding problem for a global linear model for joint word segmentation and POS-tagging. By guiding search with partial word information and performing learning for partial candidates, our system achieved sig- nificantly faster speed with little accuracy loss compared to the system of Z&C08.; The source code of our joint segmentor and POStagger can be found at: www.sourceforge.net/projects/zpar, version 0.4. 851 Acknowledgements We thank Canasai Kruengkrai for discussion on efficiency issues, and the anonymous reviewers for their suggestions. Yue Zhang and Stephen Clark are supported by the European Union Seventh Framework Programme (FP7-ICT-2009-4) under grant agreement no. 247762. References Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R. Shrivaths, Jeremy Moore, Michael Pozar, and Theresa Vu. 2006. Multilevel coarse-to-fine PCFG parsing. In Proceedings of HLT/NAACL, pages 168– 175, New York City, USA, June. Association for Computational Linguistics. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of ACL, pages 111–1 18, Barcelona, Spain, July. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP, pages 1–8, Philadelphia, USA, July. Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan L u¨. 2008a. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of ACL/HLT, pages 897–904, Columbus, Ohio, June. Wenbin Jiang, Haitao Mi, and Qun Liu. 2008b. Word lattice reranking for Chinese word segmentation and part-of-speech tagging. In Proceedings of COLING, pages 385–392, Manchester, UK, August. Canasai Kruengkrai, Kiyotaka Uchimoto, Jun’ichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of ACL/AFNLP, pages 5 13– 521, Suntec, Singapore, August. Tetsuji Nakagawa and Kiyotaka Uchimoto. 2007. A hybrid approach to word segmentation and POS tagging. In Proceedings of ACL Demo and Poster Session, Prague, Czech Republic, June. Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-ofspeech tagging: One-at-a-time or all-at-once? word- based or character-based? In Proceedings of EMNLP, Barcelona, Spain. Brian Roark and Kristy Hollingshead. 2008. Classifying chart cells for quadratic complexity context-free inference. In Proceedings of COLING, pages 745– 752, Manchester, UK, August. Coling 2008 Organizing Committee. Yanxin Shi and Mengqiu Wang. 2007. A dual-layer CRF based joint decoding method for cascade segmentation and labelling tasks. In Proceedings of IJCAI, Hyderabad, India. Fei Xia, 2000. The part-of-speech tagging guidelines for the Chinese Treebank (3.0). Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of ACL, pages 840–847, Prague, Czech Republic, June. Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL/HLT, pages 888–896, Columbus, Ohio, June. 852

3 0.56520212 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

4 0.55995673 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

Author: Marco Baroni ; Roberto Zamparelli

Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.

5 0.55897909 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

6 0.55477351 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

7 0.55452859 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

8 0.55403107 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

9 0.55352306 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

10 0.55343831 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

11 0.55101526 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

12 0.55048746 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

13 0.55039406 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

14 0.5495984 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

15 0.54947478 84 emnlp-2010-NLP on Spoken Documents Without ASR

16 0.54891992 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

17 0.54820824 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

18 0.54508591 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

19 0.54435498 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

20 0.54409051 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions