emnlp emnlp2010 knowledge-graph by maker-knowledge-mining

emnlp 2010 knowledge graph

papers list:

1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

Author: Dmitriy Genzel ; Jakob Uszkoreit ; Franz Och

Abstract: As a prerequisite to translation of poetry, we implement the ability to produce translations with meter and rhyme for phrase-based MT, examine whether the hypothesis space of such a system is flexible enough to accomodate such constraints, and investigate the impact of such constraints on translation quality.

2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

Author: Yue Zhang ; Stephen Clark

Abstract: We show that the standard beam-search algorithm can be used as an efficient decoder for the global linear model of Zhang and Clark (2008) for joint word segmentation and POS-tagging, achieving a significant speed improvement. Such decoding is enabled by: (1) separating full word features from partial word features so that feature templates can be instantiated incrementally, according to whether the current character is separated or appended; (2) deciding the POS-tag of a potential word when its first character is processed. Early-update is used with perceptron training so that the linear model gives a high score to a correct partial candidate as well as a full output. Effective scoring of partial structures allows the decoder to give high accuracy with a small beam-size of 16. In our 10-fold crossvalidation experiments with the Chinese Tree- . bank, our system performed over 10 times as fast as Zhang and Clark (2008) with little accuracy loss. The accuracy of our system on the standard CTB 5 test was competitive with the best in the literature. 1 Introduction and Motivation Several approaches have been proposed to solve word segmentation and POS-tagging jointly, including the reranking approach (Shi and Wang, 2007; Jiang et al., 2008b), the hybrid approach (Nakagawa and Uchimoto, 2007; Jiang et al., 2008a), and the single-model approach (Ng and Low, 2004; Zhang and Clark, 2008; Kruengkrai et al., 2009). These methods led to accuracy improvements over the traditional, pipelined segmentation and POS-tagging . . . 843 clark} @ cl cam ac uk baseline by avoiding segmentation error propagation and making use of part-of-speech information to improve segmentation. The single-model approach to joint segmentation and POS-tagging offers consistent training of all in- formation, concerning words, characters and partsof-speech. However, exact inference with dynamic programming can be infeasible if features are defined over a large enough range of the output, such as over a two-word history. In our previous work (Zhang and Clark, 2008), which we refer to as Z&C08; from now on, we used an approximate decoding algorithm that keeps track of a set of partially built structures for each character, which can be seen as a dynamic programming chart which is greatly reduced by pruning. In this paper we follow the line of single-model research, in particular the global linear model of Z&C08.; We show that effective decoding can be achieved with standard beam-search, which gives significant speed improvements compared to the decoding algorithm of Z&C08;, and achieves accuracies that are competitive with the state-of-the-art. Our research is also in line with recent research on improving the speed of NLP systems with little or no accuracy loss (Charniak et al., 2006; Roark and Hollingshead, 2008). Our speed improvement is achieved by the use of a single-beam decoder. Given an input sentence, candidate outputs are built incrementally, one character at a time. When each character is processed, it is combined with existing candidates in all possible ways to generate new candidates, and an agenda is used to keep the N-best candidate outputs from ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 8t4ic3s–852, the begining of the sentence to the current character. Compared to the multiple-beam search algorithm of Z&C08;, the use of a single beam can lead to an order of magnitude faster decoding speed. 1.1 The processing of partial words An important problem that we solve in this paper is the handling of partial words with a single beam decoder for the global model. As we pointed out in Z&C08;, it is very difficult to score partial words properly when they are compared with full words, although such comparison is necessary for incremental decoding with a single-beam. To allow comparisons with full words, partial words can either be treated as full words, or handled differently. We showed in Z&C08; that a naive single-beam decoder which treats partial words in the same way as full words failed to give a competitive accu- racy. An important reason for the low accuracy is over-segmentation during beam-search. Consider the three characters “ 自来水 (tap water)”. The first two characters do not make sense when put together as a single word. Rather, when treated as two singlecharacter words, they can make sense in a sentence such as “请 (please) 自 (self) 来 (come) 取 (take)”. Therefore, when using single-beam search to process “ 自来水 (tap water)”, the two-character word candidate “ 自来” is likely to have been thrown off the agenda before the third character “水” is considered, leading to an unrecoverable segmentation error. This problem is even more severe for a joint segmentor and POS-tagger than for a pure word segmentor, since the POS-tags and POS-tag bigram of “ 自 and “来” further supports them being separated when ”来” is considered. The multiple-beam search decoder we proposed in Z&C08; can be seen as a means to ensure that the three characters “ 自来水” always have a chance to be considered as a single word. It explores candidate segmentations from the beginning of the sentence until each character, and avoids the problem of processing partial words by considering only full words. However, since it ex- ” plores a larger part of the search space than a singlebeam decoder, its time complexity is correspondingly higher. In this paper, we treat partial words differently from full words, so that in the previous example, 844 the decoder can take the first two characters in “ 自来水 (tap water)” as a partial word, and keep it in the beam before the third character is processed. One challenge is the representation of POS-tags for partial words. The POS of a partial word is undefined without the corresponding full word information. Though a partial word can make sense with a particular POS-tag when it is treated as a complete word, this POS-tag is not necessarily the POS of the full word which contains the partial word. Take the three-character sequence “下雨天” as an example. The first character “下” represents a singlecharacter word “below”, for which the POS can be LC or VV. The first two characters “下雨” represent a two-character word “rain”, for which the POS can be VV. Moreover, all three characters when put together make the word “rainy day”, for which the POS is NN. As discussed above, assigning POS tags to partial words as if they were full words leads to low accuracy. An obvious solution to the above problem is not to assign a POS to a partial word until it becomes a full word. However, lack of POS information for partial words makes them less competitive compared to full words in the beam, since the scores of full words are futher supported by POS and POS ngram information. Therefore, not assigning POS to partial words potentially leads to over segmentation. In our experiments, this method did not give comparable accuracies to our Z&C08; system. In this paper, we take a different approach, and assign a POS-tag to a partial word when its first character is separated from the final character of the previous word. When more characters are appended to a partial word, the POS is not changed. The idea is to use the POS of a partial word as the predicted POS of the full word it will become. Possible predictions are made with the first character of the word, and the likely ones will be kept in the beam for the next processing steps. For example, with the three characters “下雨天”, we try to keep two partial words (besides full words) in the beam when the first word “下” is processed, with the POS being VV and NN, respectively. The first POS predicts the two-character word “下雨” ， and the second the three-character word “下雨天”. Now when the second character is processed, we still need to maintain the possible POS NN in the agenda, which predicts the three-character word “下雨天”. As a main contribution of this paper, we show that the mechanism ofpredicting the POS at the first character gives competitive accuracy. This mechanism can be justified theoretically. Unlike alphabetical languages, each Chinese character represents some specific meanings. Given a character, it is natural for a human speaker to know immediately what types of words it can start. The allows the knowledge of possible POS-tags of words that a character can start, using information about the character from the training data. Moreover, the POS of the previous words to the current word are also useful in deciding possible POS for the word.1 The mechanism of first-character decision of POS also boosts the efficiency, since the enumeration of POS is unecessary when a character is appended to the end of an existing word. As a result, the complexity of each processing step is reduce by half compared to a method without POS prediction. Finally, an intuitive way to represent the status of a partial word is using a flag explicitly, which means an early decision of the segmentation of the next incoming character. We take a simpler alternative approach, and treat every word as a partial word until the next incoming character is separated from the last character of this word. Before a word is confirmed as a full word, we only apply to it features that represent its current partial status, such as character bigrams, its starting character and its part-ofspeech, etc. Full word features, including the first and last characters of a word, are applied immediately after a word is confirmed as complete. An important component for our proposed system is the training process, which needs to ensure that the model scores a partial word with predicted POS properly. We use the averaged perceptron (Collins, 2002) for training, together with the “early update” mechanism of Collins and Roark (2004). Rather than updating the parameters after decoding is com- plete, the modified algorithm updates parameters at any processing step if the correct partial candidate falls out of the beam. In our experiments using the Chinese Treebank 1The next incoming characters are also a useful source of information for predicting the POS. However, our system achieved competitive accuracy with Z&C08; without such character lookahead features. 845 data, our system ran an order of magnitude faster than our Z&C08; system with little loss of accuracy. The accuracy of our system was competitive with other recent models. 2 Model and Feature Templates We use a linear model to score both partial and full candidate outputs. Given an input x, the score of a candidate output y is computed as: Score(y) = Φ(y) · where Φ(y) is the global feature vector extracted from y, and is the parameter vector of the model. Figure 1 shows the feature templates for the model, where templates 1 14 contain only segmentation information and templates 15 29 contain w~ , w~ – – both segmentation and POS information. Each template is instantiated according to the current character in the decoding process. Row “For” shows the conditions for template instantiation, where “s” indicates that the corresponding template is instantiated when the current character starts a new word, and “a” indicates that the corresponding template is instantiated when the current character does not start a new word. In the row for feature templates, w, t and c are used to represent a word, a POS-tag and a character, respectively. The subscripts are based on the current character, where w−1 represents the first word to the left of the current character, and p−2 represents the POS-tag on the second word to the left of the current character, and so on. As an example, feature template 1is instantiated when the current character starts a new word, and the resulting feature value is the word to the left of this character. start(w), end(w) and len(w) represent the first character, the last character and the length of word w, respectively. The length of a word is normalized to 16 if it is larger than 16. cat(c) represents the POS category of character c, which is the set of POS-tags seen on character c, as we used in Z&C08.; Given a partial or complete candidate y, its global feature vector Φ(y) is computed by instantiating all applicable feature templates from Table 1 for each character in y, according to whether or not the character is separated from the previous character. The feature templates are mostly taken from, or inspired by, the feature templates of Z&C08.; Templates 1, 2, 3, 4, 5, 8, 10, 12, 13, 14, 15, 19, 20, Feature templateFor 24, 27 and 29 concern complete word information, and they are used in the model to differentiate correct and incorrect output structures in the same way as our Z&C08; model. Templates 6, 7, 9, 16, 17, 18, 21, 22, 23, 25, 26 and 28 concern partial word information, whose role in the model is to indicate the likelihood that the partial word including the current character will become a correct full word. They act as guidance for the action to take for the cur846 function DECODE(sent, agenda): CLEAR(agenda) ADDITEM(agenda, “”) for index in [0..LEN(sent)]: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) retaugrenn BEST(agenda) Figure 1: The incremental beam-search decoder. rent character according to the context, and are the crucial reason for the effectiveness of the algorithm with a small beam-size. 2.1 Decoding The decoding algorithm builds an output candidate incrementally, one character at a time. Each character can either be attached to the current word or separated as the start a new word. When the current character starts a new word, a POS-tag is assigned to the new word. An agenda is used by the decoder to keep the N-best candidates during the incremental process. Before decoding starts, the agenda is initialized with an empty sentence. When a character is processed, existing candidates are removed from the agenda and extended with the current character in all possible ways, and the N-best newly generated candidates are put back onto the agenda. After all input characters have been processed, the highest-scored candidate from the agenda is taken as the output. Pseudo code for the decoder is shown in Figure 1. CLEAR removes all items from the agenda, ADDITEM adds a new item onto the agenda, N-BEST returns the N highest-scored items from the agenda, and BEST returns the highest-scored item from the agenda. LEN returns the number of characters in a sentence, and sent[i] returns the ith character from the sentence. APPEND appends a character to the last word in a candidate, and SEP joins a character as the start of a new word in a candidate, assigning a POS-tag to the new word. Both our decoding algorithm and the decoding algorithm of Z&C08; run in linear time. However, in order to generate possible candidates for each character, Z&C08; uses an extra loop to search for possible words that end with the current character. A restriction to the maximum word length is applied to limit the number of iterations in this loop, without which the algorithm would have quadratic time complexity. In contrast, our decoder does not search backword for the possible starting character of any word. Segmentation ambiguities are resolved by binary choices between the actions append or separate for each character, and no POS enumeration is required when the character is appended. This improves the speed by a significant factor. 2.2 Training The learning algorithm is based on the generalized perceptron (Collins, 2002), but parameter adjustments can be performed at any character during the decoding process, using the “early update” mechanism of Collins and Roark (2004). The parameter vector of the model is initialized as all zeros before training, and used to decode training examples. Each training example is turned into the raw input format, and processed in the same way as decoding. After each character is processed, partial candidates in the agenda are compared to the corresponding gold-standard output for the same characters. If none of the candidates in the agenda are correct, the decoding is stopped and the parameter vector is updated by adding the global feature vector of the gold-standard partial output and subtracting the global feature vector of the highest-scored partial candidate in the agenda. The training process then moves on to the next example. However, if any item in the agenda is the same as the corresponding gold-standard, the decoding process moves to the next character, without any change to the parameter values. After all characters are processed, the decoder prediction is compared with the training example. If the prediction is correct, the parameter vector is not changed; otherwise it is updated by adding the global feature vector of the training example and subtracting the global feature vector of the decoder prediction, just as the perceptron algorithm does. The same training examples can be used to train the model for multiple iterations. We use 847 the averaged parameter vector (Collins, 2002) as the final model. Pseudocode for the training algorithm is shown in Figure 2. It is based on the decoding algorithm in Figure 1, and the main differences are: (1) the training algorithm takes the gold-standard output and the parameter vector as two additional arguments; (2) the training algorithm does not return a prediction, but modifies the parameter vector when necessary; (3) lines 11to 20 are additional lines of code for parameter updates. Without lines 11 to 16, the training algorithm is exactly the same as the generalized perceptron algorithm. These lines are added to ensure that the agenda contains highly probable candidates during the whole beam-search process, and they are crucial to the high accuracy of the system. As stated earlier, the decoder relies on proper scoring of partial words to maintain a set of high quality candidates in the agenda. Updating the value of the parameter vector for partial outputs can be seen as a means to ensure correct scoring of partial candidates at any character. 2.3 Pruning We follow Z&C08; and use several pruning methods, most of which serve to to improve the accuracy by removing irrelevant candidates from the beam. First, the system records the maximum number of characters that a word with a particular POS-tag can have. For example, from the Chinese Treebank that we used for our experiments, most POS are associated with only with one- or two-character words. The only POS-tags that are seen with words over ten characters long are NN (noun), NR (proper noun) and CD (numbers). The maximum word length information is initialized as all ones, and updated according to each training example before it is processed. Second, a tag dictionary is used to record POStags associated with each word. During decoding, frequent words and words with “closed set” tags2 are only allowed POS-tags according to the tag dictionary, while other words are allowed every POS-tag to make candidate outputs. Whether a word is a frequent word is decided by the number of times it has been seen in the training process. Denoting the num2“Closed set” tags are the set of POS-tags which are only associated with a fixed set of words, according to the Penn Chinese Treebank specifications (Xia, 2000). function TRAIN(sent, agenda, gold-standard, w~ ): 01: CLEAR(agenda) 02: ADDITEM(agenda, “”) 03: for index in [0..LEN(sent)]: 04: 05: 06: 07: 08: 09: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: for cand in agenda: new ← APPEND(cand, sent[index]) ADDITEM(agenda, new) for pos in TAGSET(): new ← SEP(cand, sent[index], pos) ADDITEM(agenda, new) agenda ← N-BEST(agenda) faogre cnadnad ← ←in agenda: if cand = gold-standard[0:index] : CONTINUE w~ ← w~ + Φ(gold-standard[0:index]) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ if BEST(agenda) gold-standard: w~ ← a ~wg + Φ(gold-standard) ww~~ ← ww~ ~ - Φ(BEST(agenda)) wr~et ←urn w~ return = Figure 2: The incremental learning function. ber of times the most frequent word has been seen with M, a word is a frequent word if it has been seen more than M/5000 5 times. The threshold value is taken from Z&C08;, and we did not adjust it during development. Word frequencies are initialized as zeros and updated according to each training example before it is processed; the tag dictionary is initialized as empty and updated according to each training example before it is processed. Third, we make an additional record of the initial characters for words with “closed set” tags. During decoding, when the current character is added as the start of a new word, “closed set” tags are only assigned to the word if it is consistent with the record. This type of pruning is used in addition to the tag + dictionary to prune invalid partial words, while the tag dictionary is used to prune complete words. The record for initial character and POS is initially empty, and udpated according to each training example before it is processed. Finally, at any decoding step, we group partial 848 candidates that are generated by separating the current character as the start of a new word by the signature p0p−1w−1, and keep only the best among those having the same p0p−1w−1. The signature p0p−1w−1 is decided by the feature templates we use: it can be shown that if two candidates cand1 and cand2 generated at the same step have the same signature, and the score of cand1 is higher than the score of cand2, then at any future step, the highest scored candidate generated from cand1 will always have a higher score than the highest scored candidate generated from cand2. From the above pruning methods, only the third was not used by Z&C08.; It can be seen as an extra mechanism to help keep likely partial words in the agenda and improve the accuracy, but which does not give our system a speed advantage over Z&C08.; 3 Experiments We used the Chinese Treebank (CTB) data to perform one set of development tests and two sets of fi- Training iteration Figure 3: The influence of beam-sizes, and the convergence of the perceptron. nal tests. The CTB 4 was split into two parts, with the CTB 3 being used for a 10-fold cross validation test to compare speed and accuracies with Z&C08;, and the rest being used for development. The CTB 5 was used to perform the additional set of experiments to compare accuracies with other recent work. We use the standard F-measure to evaluate output accuracies. For word segmentation, precision is defined as the number of correctly segmented words divided by the total number of words in the output, and recall is defined as the number of correctly segmented words divided by the total number of words in the gold-standard output. For joint segmentation and POS-tagging, precision is defined as the number of correctly segmented and POS-tagged words divided by the total number of words from the output, and recall is defined as the correctly segmented and POS-tagged words divided by the total number of words in the gold-standard output. All our experiments were performed on a Linux platform, and a single 2.66GHz Intel Core 2 CPU. 3.1 Development tests Our development data consists of 150K words in 4798 sentences. 80% of the data were randomly chosen as the development training data, while the rest were used as the development test data. Our development tests were mainly used to decide the size ofthe beam, the number oftraining iterations, the ef- fect of partial features in beam-search decoding, and the effect of incremental learning (i.e. early update). 849 Figure 3 shows the accuracy curves for joint segmentation and POS-tagging by the number of training iterations, using different beam sizes. With the size of the beam increasing from 1to 32, the accuracies generally increase, while the amount of increase becomes small when the size of the beam becomes 16. After the 10th iteration, a beam size of 32 does not always give better accuracies than a beam size of 16. We therefore chose 16 as the size of the beam for our system. The testing times for each beam size between 1 and 32 are 7.16s, 11.90s, 18.42s, 27.82s, 46.77s and 89.21s, respectively. The corresponding speeds in the number of sentences per second are 111.45, 67.06, 43.32, 28.68, 17.06 and 8.95, respectively. Figure 3 also shows that the accuracy increases with an increased number of training iterations, but the amount of increase becomes small after the 25th iteration. We chose 29 as the number of iterations to train our system. The effect of incremental training: We compare the accuracies by incremental training using early update and normal perceptron training. In the normal perceptron training case, lines 11to 16 are taken out of the training algorithm in Figure 2. The algorithm reached the best performance at the 22nd iteration, with the segmentation F-score being 90.58% and joint F-score being 83.38%. In the incremental training case, the algorithm reached the best accuracy at the 30th training iteration, obtaining a segmentation F-score of 91.14% and a joint F-score of 84.06%. 3.2 Final tests using CTB 3 CTB 3 consists of 150K words in 10364 sentences. We follow Z&C08; and split it into 10 equal-sized parts. In each test, one part is taken as the test data and the other nine are combined together as the training data. We compare the speed and accuracy with the joint segmentor and tagger of Z&C08;, which is publicly available as the ZPar system, version 0.23. The results are shown in Table 2, where each row shows one cross validation test. The column head- ings “sf”, “jf”, “time” and “speed” refer to segmentation F-measure, joint F-measure, testing time (in 3http://www.sourceforge.net/projects/zpar #sZf&C08jftimespeed; tshfis papjefrtimespeed seconds) and testing speed (in the number of sentences per second), respectively. Our system gave a joint segmentation and POStagging F-score of 91.37%, which is only 0.04% lower than that of ZPar 0.2. The speed of our system was over 10 times as fast as ZPar 0.2. 3.3 Final tests using CTB 5 We follow Kruengkrai et al. (2009) and split the CTB 5 into training, development testing and testing sets, as shown in Table 3. We ignored the development test data since our system had been developed in previous experiments. Kruengkrai et al. (2009) made use of character type knowledge for spaces, numerals, symbols, alphabets, Chinese and other characters. In the previous experiments, our system did not use any knowledge beyond the training data. To make the comparison fairer, we included knowledge of English letters and Arabic numbers in this experiment. During both training and decoding, English letters and Arabic numbers are segmented using simple rules, treating consecutive English letters or Arabic numbers as a single word. The results are shown in Table 4, where row “N07” refers to the model of Nakagawa and Uchimoto (2007), rows “J08a” and “b” refer to the models of Jiang et al. (2008a) and Jiang et al. (2008b), and row “K09” refers to the models of Kruengkrai et al. (2009). Columns “sf” and “jf” refer to segmentation and joint accuracies, respectively. Our system 850 SectionsSentencesWords T Daerbsvltien3:gTrain14230i–7n021 g–71,3d90–21e035v 1elopm1e385n40t,8and5tes a648t,903o2n,18C92TB5. TJoKNab0ul8re79abs4(y:rtAesomcl-indurea)vycom9 pa7 r.i87s34o59n w3 i.t64h2710recntsudio sfjf CTB 5. gave comparable accuracies to these recent works, obtaining the best (same as the error-driven version of K09) joint F-score. 4 Related Work The effectiveness of our beam-search decoder showed that the joint segmentation and tagging problem may be less complex than previously perceived (Zhang and Clark, 2008; Jiang et al., 2008a). At the very least, the single model approach with a simple decoder achieved competitive accuracies to what has been achieved so far by the reranking (Shi and Wang, 2007; Jiang et al., 2008b) models and an ensemble model using machine-translation techniques (Jiang et al., 2008a). This may shed new light on joint segmentation and POS-tagging methods. Kruengkrai et al. (2009) and Zhang and Clark (2008) are the most similar to our system among related work. Both systems use a discriminatively trained linear model to score candidate outputs. The work of Kruengkrai et al. (2009) is based on Nakagawa and Uchimoto (2007), which separates the processing of known words and unknown words, and uses a set of segmentation tags to represent the segmentation of characters. In contrast, our model is conceptually simpler, and does not differentiate known words and unknown words. Moreover, our model is based on our previous work, in line with Zhang and Clark (2007), which does not treat word segmentation as character sequence labeling. Our learning and decoding algorithms are also different from Kruengkrai et al. (2009). While Kruengkrai et al. (2009) perform dynamic programming and MIRA learning, we use beam-search to perform incremental decoding, and the early-update version of the perceptron algorithm to train the model. Dynamic programming is exact inference, for which the time complexity is decided by the locality of feature templates. In contrast, beam-search is approximate and can run in linear time. The parameter updating for our algorithm is conceptually and computationally simpler than MIRA, though its performance can be slightly lower. However, the earlyupdate mechanism we use is consistent with our incremental approach, and improves the learning of the beam-search process. 5 Conclusion We showed that a simple beam-search decoding algorithm can be effectively applied to the decoding problem for a global linear model for joint word segmentation and POS-tagging. By guiding search with partial word information and performing learning for partial candidates, our system achieved sig- nificantly faster speed with little accuracy loss compared to the system of Z&C08.; The source code of our joint segmentor and POStagger can be found at: www.sourceforge.net/projects/zpar, version 0.4. 851 Acknowledgements We thank Canasai Kruengkrai for discussion on efficiency issues, and the anonymous reviewers for their suggestions. Yue Zhang and Stephen Clark are supported by the European Union Seventh Framework Programme (FP7-ICT-2009-4) under grant agreement no. 247762. References Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R. Shrivaths, Jeremy Moore, Michael Pozar, and Theresa Vu. 2006. Multilevel coarse-to-fine PCFG parsing. In Proceedings of HLT/NAACL, pages 168– 175, New York City, USA, June. Association for Computational Linguistics. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of ACL, pages 111–1 18, Barcelona, Spain, July. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP, pages 1–8, Philadelphia, USA, July. Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan L u¨. 2008a. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of ACL/HLT, pages 897–904, Columbus, Ohio, June. Wenbin Jiang, Haitao Mi, and Qun Liu. 2008b. Word lattice reranking for Chinese word segmentation and part-of-speech tagging. In Proceedings of COLING, pages 385–392, Manchester, UK, August. Canasai Kruengkrai, Kiyotaka Uchimoto, Jun’ichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of ACL/AFNLP, pages 5 13– 521, Suntec, Singapore, August. Tetsuji Nakagawa and Kiyotaka Uchimoto. 2007. A hybrid approach to word segmentation and POS tagging. In Proceedings of ACL Demo and Poster Session, Prague, Czech Republic, June. Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-ofspeech tagging: One-at-a-time or all-at-once? word- based or character-based? In Proceedings of EMNLP, Barcelona, Spain. Brian Roark and Kristy Hollingshead. 2008. Classifying chart cells for quadratic complexity context-free inference. In Proceedings of COLING, pages 745– 752, Manchester, UK, August. Coling 2008 Organizing Committee. Yanxin Shi and Mengqiu Wang. 2007. A dual-layer CRF based joint decoding method for cascade segmentation and labelling tasks. In Proceedings of IJCAI, Hyderabad, India. Fei Xia, 2000. The part-of-speech tagging guidelines for the Chinese Treebank (3.0). Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of ACL, pages 840–847, Prague, Czech Republic, June. Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL/HLT, pages 888–896, Columbus, Ohio, June. 852

3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

Author: Shaojun Zhao ; Daniel Gildea

Abstract: A word in one language can be translated to zero, one, or several words in other languages. Using word fertility features has been shown to be useful in building word alignment models for statistical machine translation. We built a fertility hidden Markov model by adding fertility to the hidden Markov model. This model not only achieves lower alignment error rate than the hidden Markov model, but also runs faster. It is similar in some ways to IBM Model 4, but is much easier to understand. We use Gibbs sampling for parameter estimation, which is more principled than the neighborhood method used in IBM Model 4.

4 emnlp-2010-A Game-Theoretic Approach to Generating Spatial Descriptions

Author: Dave Golland ; Percy Liang ; Dan Klein

Abstract: Language is sensitive to both semantic and pragmatic effects. To capture both effects, we model language use as a cooperative game between two players: a speaker, who generates an utterance, and a listener, who responds with an action. Specifically, we consider the task of generating spatial references to objects, wherein the listener must accurately identify an object described by the speaker. We show that a speaker model that acts optimally with respect to an explicit, embedded listener model substantially outperforms one that is trained to directly generate spatial descriptions.

5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

Author: Minh-Thang Luong ; Preslav Nakov ; Min-Yen Kan

Abstract: We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

Author: Karthik Raghunathan ; Heeyoung Lee ; Sudarshan Rangarajan ; Nate Chambers ; Mihai Surdeanu ; Dan Jurafsky ; Christopher Manning

Abstract: Most coreference resolution models determine if two mentions are coreferent using a single function over a set of constraints or features. This approach can lead to incorrect decisions as lower precision features often overwhelm the smaller number of high precision ones. To overcome this problem, we propose a simple coreference architecture based on a sieve that applies tiers of deterministic coreference models one at a time from highest to lowest precision. Each tier builds on the previous tier’s entity cluster output. Further, our model propagates global information by sharing attributes (e.g., gender and number) across mentions in the same cluster. This cautious sieve guarantees that stronger features are given precedence over weaker ones and that each decision is made using all of the information available at the time. The framework is highly modular: new coreference modules can be plugged in without any change to the other modules. In spite of its simplicity, our approach outperforms many state-of-the-art supervised and unsupervised models on several standard corpora. This suggests that sievebased approaches could be applied to other NLP tasks.

9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

Author: Rushin Shah ; Paramveer S. Dhillon ; Mark Liberman ; Dean Foster ; Mohamed Maamouri ; Lyle Ungar

Abstract: We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that over- comes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators. 1 Background and Motivation This paper presents a methodology for generating high quality lexical analysis of highly inflected languages, and demonstrates excellent performance applying our approach to Arabic. Lexical analysis of the written form of a language involves resolving, explicitly or implicitly, several different kinds ofambiguities. Unfortunately, the usual ways of talking about this process are also ambiguous, and our general approach to the problem, though not unprecedented, has uncommon aspects. Therefore, in order 725 Paramveer S. Dhillon, Mark Liberman, Dean Foster, Mohamed Maamouri and Lyle Ungar University of Pennsylvania 345 1Walnut Street Philadelphia, PA 19104, USA {dhi l lon | myl | ungar} @ cis .upenn .edu floo snt|emry@lw|huanrgta on .upenn .eednun maamouri @ ldc .upenn .edu , , to avoid confusion, we begin by describing how we define the problem. In an inflected language with an alphabetic writing system, a central issue is how to interpret strings of characters as forms of words. For example, the English letter-string ‘winds’ will normally be interpreted in one of four different ways, all four of which involve the sequence of two formatives wind+s. The stem ‘wind’ might be analyzed as (1) a noun meaning something like “air in motion”, pronounced [wInd] , which we can associate with an arbitrary but stable identifier like wind n1; (2) a verb wind v1 derived from that noun, and pronounced the same way; (3) a verb wind v2 meaning something like “(cause to) twist”, pronounced [waInd]; or (4) a noun wind n2 derived from that verb, and pro- nounced the same way. Each of these “lemmas”, or dictionary entries, will have several distinguishable senses, which we may also wish to associate with stable identifiers. The affix ‘-s’ might be analyzed as the plural inflection, if the stem is a noun; or as the third-person singular inflection, if the stem is a verb. We see this analysis as conceptually divided into four parts: 1) Morphological analysis, which recognizes that the letter-string ‘winds’ might be (perhaps among other things) wind/N s/PLURAL or wind/V s/3SING; 2) Morphological disambiguation, which involves deciding, for example, that in the phrase “the four winds”, ‘winds’ is probably a plural noun, i.e. wind/N s/PLURAL; 3) Lemma analysis, which involves recognizing that the stem wind in ‘winds’ might be any of the four lemmas listed above – perhaps with a further listing of senses or other sub-entries for each of them; and 4) Lemma disambiguation, deciding, for example, that + + + ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 7t2ic5s–735, the phrase “the four winds” probably involves the lemma wind n1. Confusingly, the standard word-analysis tasks in computational linguistics involve various combinations of pieces of these logically-distinguished operations. Thus, “part of speech (POS) tagging” is mainly what we’ve called “morphological disambiguation”, except that it doesn’t necessarily require identifying the specific stems and affixes involved. In some cases, it also may require a small amount of “lemma disambiguation”, for example to distinguish a proper noun from a common noun. “Sense disambiguation” is basically a form of what we’ve called “lemma disambiguation”, except that the sense disambiguation task may assume that the part of speech is known, and may break down lexical identity more finely than our system happens to do. “Lemmatization” generally refers to a radically simplified form of “lemma analysis” and “lemma disambiguation”, where the goal is simply to collapse different inflected forms of any similarly-spelled stems, so that the strings ‘wind’, ‘winds’, ‘winded’, ‘winding’ will all be treated as instances of the same thing, without in fact making any attempt to determine the identity of “lemmas” in the traditional sense of dictionary entries. Linguists use the term morphology to include all aspects of lexical analysis under discussion here. But in most computational applications, “morphological analysis” does not include the disambiguation of lemmas, because most morphological analyzers do not reference a set of stable lemma IDs. So for the purposes of this paper, we will continue to discuss lemma analysis and disambiguation as conceptually distinct from morphological analysis and disambiguation, although, in fact, our system disambiguates both of these aspects of lexical analysis at the same time. The lexical analysis of textual character-strings is a more complex and consequential problem in Arabic than it is in English, for several reasons. First, Arabic inflectional morphology is more complex than English inflectional morphology is. Where an English verb has five basic forms, for example, an Arabic verb in principle may have dozens. Second, the Arabic orthographic system writes elements such as prepositions, articles, and possessive pronouns without setting them off by spaces, roughly 726 as if the English phrase “in a way” were written “inaway”. This leads to an enormous increase in the number of distinct “orthographic words”, and a substantial increase in ambiguity. Third, short vowels are normally omitted in Arabic text, roughly as if English “in a way” were written “nway”. As a result, a whitespace/punctuation-delimited letter-string in Arabic text typically has many more alternative analyses than a comparable English letter-string does, and these analyses have many more parts, drawn from a much larger vocabulary of form-classes. While an English “tagger” can specify the morphosyntactic status of a word by choosing from a few dozen tags, an equivalent level of detail in Arabic would require thousands of alternatives. Similarly, the number of lemmas that might play a role in a given letter-sequence is generally much larger in Arabic than in English. We start our labeling of Arabic text with the alternative analyses provided by SAMA v. 3.1, the Standard Arabic Morphological Analyzer (Maamouri et al., 2009). SAMA is an updated version of the earlier Buckwalter analyzers (Buckwalter, 2004), with a number of significant differences in analysis to make it compatible with the LDC Arabic Treebank 3-v3.2 (Maamouri et al., 2004). The input to SAMA is an Arabic orthographic word (a string of letters delimited by whitespace or punctuation), and the output of SAMA is a set of alternative analyses, as shown in Table 1. For a typical word, SAMA produces approximately a dozen alternative analyses, but for certain highly ambiguous words it can produce hundreds of alternatives. The SAMA analyzer has good coverage; for typical texts, the correct analysis of an orthographic word can be found somewhere in SAMA’s list of alternatives about 95% of the time. However, this broad coverage comes at a cost; the list of analytic alternatives must include a long Zipfian tail of rare or contextually-implausible analyses, which collectively are correct often enough to make a large contribution to the coverage statistics. Furthermore, SAMA’s long lists of alternative analyses are not evaluated or ordered in terms of overall or contextual plausibility. This makes the results less useful in most practical applications. Our goal is to rank these alternative analyses so that the correct answer is as near to the top of the list as possible. Despite some risk of confusion, we’ll refer to SAMA’s list of alternative analyses for an orthographic word as potential labels for that word. And despite a greater risk ofconfusion, we’ll refer to the assignment of probabilities to the set of SAMA labels for a particular Arabic word in a particular textual context as tagging, by analogy to the operation of a stochastic part-of-speech tagger, which similarly assigns probabilities to the set of labels available for a word in textual context. Although our algorithms have been developed for the particular case of Arabic and the particular set of lexical-analysis labels produced by SAMA, they should be applicable without modification to the sets of labels produced by any broad-coverage lexical analyzer for the orthographic words of any highlyinflected language. In choosing our approach, we have been moti- vated by two specific applications. One application aims to help learners of Arabic in reading text, by offering a choice of English glosses with associated Arabic morphological analyses and vocalizations. SAMA’s excellent coverage is an important basis for this help; but SAMA’s long, unranked list of alternative analyses for a particular letter-string, where many analyses may involve rare words or alternatives that are completely implausible in the context, will be confusing at best for a learner. It is much more helpful for the list to be ranked so that the correct answer is almost always near the top, and is usually one of the top two or three alternatives. In our second application, this same sort of ranking is also helpful for the linguistically expert native speakers who do Arabic Treebank analysis. These 727 annotators understand the text without difficulty, but find it time-consuming and fatiguing to scan a long list of rare or contextually-implausible alternatives for the correct SAMA output. Their work is faster and more accurate if they start with a list that is ranked accurately in order of contextual plausibility. Other applications are also possible, such as vocalization of Arabic text for text-to-speech synthesis, or lexical analysis for Arabic parsing. However, our initial goals have been to rank the list of SAMA outputs for human users. We note in passing that the existence of set of stable “lemma IDs” is an unusual feature of SAMA, which in our opinion ought to be emulated by approaches to lexical analysis in other languages. The lack of such stable lemma IDs has helped to disguise the fact that without lemma analysis and disambiguation, morphological analyses and disambiguation is only a partial solution to the problem of lexical analysis. In principle, it is obvious that lemma disambiguation and morphological disambiguation are mutually beneficial. If we know the answer to one of the questions, the other one is easier to answer. However, these two tasks require rather different sets of contextual features. Lemma disambiguation is similar to the problem of word-sense disambiguation on some definitions, they are identical and as a result, it benefits from paragraph-level and documentlevel bag-of-words attributes that help to character– – ize what the text is “about” and therefore which lemmas are more likely to play a role in it. In contrast, morphological disambiguation mainly depends on features of nearby words, which help to characterize how inflected forms of these lemmas might fit into local phrasal structures. 2 Problem and Methodology Consider a collection oftokens (observations), ti, referred to by index i∈ {1, . . . , n}, where each token fise raressdo tcoia bteyd i nwdiethx a s∈et { of p features, xij, efaocr hth teo k jethn feature, and a label, li, which is a combination of a lemma and a morphological analysis. We use indicator functions yik to indicate whether or not the kth label for the ith token is present. We represent the complete set of features and labels for the entire training data using matrix notation as X and Y , respectively. Our goal is to predict the label l (or equivalently, the vector y for a given feature vector x. A standard linear regression model of this problem would be y = xβ + ? (1) The standard linear regression estimate of β (ig- ×× × noring, for simplicity the fact that the ys are 0/1) is: βˆ = (XTtrainXtrain)−1XtTrainYtrain (2) where Ytrain is an n h matrix containing 0s and 1s indicating whise tahner n or nho mt aetarcixh coofn tthaien ihn possible labels is the correct label (li) for each of the n tokens ti, Xtrain is an n p matrix of context features for each of thei n tokens, pth mea ctoriexff oifcie cnotnst are p hs .f However, this is a large, sparse, multiple l hab.el problem, and the above formulation is neither statistically nor computationally efficient. Each observation (x, y) consists of thousands of features associated with thousands of potential labels, almost all of which are zero. Worse, the matrix of coefficients β, to be estimated is large (p h) and one should thus use some soatretd do ifs tr laarngsefe (pr learning dto o nshea srheo strength across the different labels. We present a novel principled and highly computationally efficient method of estimating this multilabel model. We use a two stage procedure, first using a subset (Xtrain1 , Ytrain1) of training data to give a fast approximate estimate of β; we then use a second smaller subset of the training data (Xtrain2, Ytrain2,) to “correct” these estimates in a eβˆx way that we will show can be viewed as a specialized shrinkage. Our first stage estimation approximates β, but avoids the expensive computa728 tion of (XTtrainXtrain)−1. Our second stage corrects (shrinks) these initial estimates in a manner specialized to this problem. The second stage takes advantage of the fact that we only need to consider those candidate labels produced by SAMA. Thus, only dozens of the thousands of possible labels are considered for each token. We now present our algorithm. We start with a corpus D of documents d of labeled Arabic text. As described above, each token, ti is associated with a set of features characterizing its context, computed from the other words in the same document, and a label, li = (lemmai, morphologyi), which is a combination of a lemma and a morphological analysis. As described below, we introduce a novel factorization of the morphology into 15 different components. Our estimation algorithm, shown in Algorithm 1, has two stages. We partition the training corpus into × two subsets, one of which (Xtrain1) is used to estimate the coefficients βs and the other of which (Xtrain2) is used to optimally “shrink” these coefficient estimates to reduce variance and prevent overfitting due to data sparsity. For the first stage of our estimation procedure, we simplify the estimate of the (β) matrix (Equation 2) to avoid the inversion of the very high dimensional (p p) matrix (XTX) by approximating (XTX) by (itps diagonal, Var(X), the inverse of which is trivial to compute; i.e. we estimate β using βˆ = Var(Xtrain1)−1XtTrain1Ytrain1 (3) For the second stage, we assume that the coefficients for each feature can be shrunk differently, but that coefficients for each feature should be shrunk the same regardless of what label they are predicting. Thus, for a given observation we predict: ˆgik=Xpwjβˆjkxij (4) Xj=1 where the weights wj indicate how much to shrink each of the p features. In practice, we fold the variance of each of the j features into the weight, giving a slightly modified equation: ˆgik=Xj=p1αjβj∗kxij (5) where β∗ = XtTrain1Ytrain1 is just a matrix of the counts of how often each context feature shows up with each label in the first training set. The vector α, which we will estimate by regression, is just the shrinkage weights w rescaled by the feature variance. Note that the formation here is different from the first stage. Instead of having each observation be a token, we now let each observation be a (token, label) pair, but only include those labels that were output by SAMA. For a given token ti and potential label lk, our goal is to approximate the indicator function g(i, k), which is 1 if the kth label of token ti is present, and 0 otherwise. We find candidate labels using a morphological analyzer (namely SAMA), which returns a set of possible candidate labels, say C(t), for each Arabic token t. Our pre- dicted label for ti is then argmaxk∈C(ti)g(i, k). The regression model for learning tthe weights αj in the second stage thus has a row for each label g(i, k) associated with a SAMA candidate for each token i = ntrain1+1 . . . ntrain2 in the second training set. The value of g(i, k) is predicted as a function of the feature vector zijk = βj∗kxij. The shrinkage coefficients, αj, could be estimated from theory, using a version of James-Stein shrinkage (James and Stein, 1961), but in practice, superior results are obtained by estimating them empirically. Since there are only p of them (unlike the p ∗ h βs), a relatively asmreal oln training sheetm mis ( usunflfi kceie tnhte. Wp ∗e hfou βnsd), that regression-SVMs work slightly better than linear regression and significantly better than standard classification SVMs for this problem. Prediction is then done in the obvious way by taking the tokens in a test corpus Dtest, generating context features and candidate SAMA labels for each token ti, and selected the candidate label with the highest score ˆ g(i, k) that we set out to learn. More formally, The model parameters β∗ and α produced by the algorithm allow one to estimate the most likely label for a new token ti out of a set of can- didate labels C(ti) using kpred= argmaxk∈C(ti)jX=p1αjβj∗kxij (6) The most expensive part of the procedure is estimating β∗, which requires for each token in cor729 Algorithm 1 Training algorithm. Input: A training corpusDtrainof n observations (Xtrain, Ytrain) Partition Dtrain into two sets, D1 and D2, of sizes ntrain1 and ntrain2 = n − ntrain1 observations // Using D1, estimat=e β∗ βj∗k = Pin=tr1ain1 xijyik for the jth feature and kth label // Using D2, estimate αj // Generate new “features” Z and the true labels g(i, k) for each of the SAMA candidate labels for each of the tokens in D2 zijk = βj∗kxij for iin i= ntrain1 + 1...ntrain2 Estimate αj for the above (feature,label) pairs (zijk, g(i, k)) using Regression SVMs Output: α and β∗ pus D1, (a subset of D), finding the co-occurrence frequencies of each label element (a lemma, or a part of the morphological segmentation) with the target token and jointly with the token and with other tokens or characters in the context of the token of interest. For example, given an Arabic token, “yHlm”, we count what fraction of the time it is associated with each lemma (e.g. Halamu 1), count(lemma=Halam-u 1, token=yHlm) and each segment (e.g. “ya”), count(segment=ya, token=yHlm). (Of course, most tokens never show up with most lemmas or segments; this is not a problem.) We also find the base rates of the components of the labels (e.g., count(lemma=Halam-u 1), and what fraction of the time the label shows up in various contexts, e.g. count(lemma=Halam-u 1, previous token = yHlm). We describe these features in more detail below. 3 Features and Labels used for Training Our approach to tagging Arabic differs from conventional approaches in the two-part shrinkage-based method used, and in the choice of both features and labels used in our model. For features, we study both local context variables, as described above, and document-level word frequencies. For the labels, the key question is what labels are included and how they are factored. Standard “taggers” work by doing an n-way classification of all the alternatives, which is not feasible here due to the thousands of possible labels. Standard approaches such as Conditional Random Fields (CRFs) are intractable with so many labels. Moreover, few if any taggers do any lemma disambiguation; that is partly because one must start with some standard inventory of lemmas, which are not available for most languages, perhaps because the importance of lemma disambiguation has been underestimated. We make a couple of innovations to deal with these issues. First, we perform lemma disambiguation in addition to “tagging”. As mentioned above, lemmas and morphological information are not independent; the choice of lemma often influences morphology and vice versa. For example, Table 1 contains two analyses for the word qbl. For the first analysis, where the lemma is qabil-a 1 and the gloss is accept/receive/approve + he/it [verb], the word is a verb. However, for the second analysis, where the lemma is qabol 1 and the gloss is before, the word is a noun. Simultaneous lemma disambiguation and tagging introduces additional complexity: An analysis of ATB and SAMA shows that there are approximately 2,200 possible morphological analyses (“tags”) and 40,000 possible lemmas; even accounting for the fact that most combinations of lemmas and morphological analyses don’t occur, the size of the label space is still in the order of tens of thousands. To deal with data sparsity, our second innovation is to factor the labels. We factor each label linto a set of 16 label elements (LEs). These include lemmas, as well as morphological elements such as basic partof-speech, suffix, gender, number, mood, etc. These are explained in detail below. Thus, since each label l is a set of 15 categorical variables, each y in the first learning stage is actually a vector with 16 nonzero components and thousands of zeros. Since we do simultaneous estimation of the entire set of label elements, the value g(i, k) being predicted in the second learning phase is 1 if the entire label set is correct, and zero otherwise. We do not learn separate models for each label. 3.1 Label Elements (LEs) The fact that there are tens of thousands of possible labels presents the problem of extreme sparsity of label distribution in the training data. We find that a model that estimates coefficients β∗ to predict a sin730 data on basic POS include whether a noun is proper or common, whether a verb is transitive or not, etc. Both the basic POS and its suffix may have person, gender and number data. gle label (a label being in the Cartesian product of the set of label elements) yields poor performance. Therefore, as just mentioned, we factor each label l into a set of label elements (LEs), and learn the correlations β∗ between features and label elements, rather than features and entire label sets. This reduces, but does not come close to eliminating, the problem sparsity. A complete list of these LEs and their possible values is detailed in Table 2. 3.2 Features 3.2.1 Local Context Features We take (t, l) pairs from D2, and for each such pair generate features Z based on co-occurrence statistics β∗ in D1, as mentioned in Algorithm 2. These statistics include unigram co-occurrence frequencies of each label with the target token and bigram co-occurrence of the label with the token and with other tokens or characters in the context of the target token. We define them formally in Table 3. Let Zbaseline denote the set of all such basic features based on the local context statistics of the target token, namely the words and letters preceding and following it. We will use this set to create a baseline model. generate feature sets for our regression SVMs. For each label element (LE) e, we define a set of features Ze similar to Zbaseline; these features are based on co-occurrence frequencies of the particular LE e, not the entire label l. Finally, we define an aggregate feature set Zaggr as follows: Zaggr = Zbaseline [ {Ze} (7) where e ∈ {lemma, pre1, pre2, det, pos, dpos, suf, perpos, numpos, genpos, persuf, numsuf, gensuf, mood, pron}. 3.2.2 Document Level Features When trying to predict the lemma, it is useful to include not just the words and characters immediately adjacent to the target token, but also the all the words in the document. These words capture the “topic” of the document, and help to disambiguate different lemmas, which tend to be used or not used based on the topic being discussed, similarly to the way that word sense disambiguation systems in English sometimes use the “bag of words” the document to disambiguate, for example a “bank” for depositing money from a “bank” of a river. More precisely, we augment the features for each target token with the counts of each word in the document (the “term frequency” tf) in which the token occurs with a given label. Zfull = Zaggr [ Ztf (8) This set Zfull is our final feature set. We use Zfull to train an SVM model Mfull; this is our final predictive model. 731 3.3 Corpora used for Training and Testing We use three modules of the Penn Arabic Treebank (ATB) (Maamouri et al., 2004), namely ATB 1, ATB2 and ATB3 as our corpus of labeled Arabic text, D. Each ATB module is a collection of newswire data from a particular agency. ATB1 uses the Associated Press as a source, ATB2 uses Ummah, and ATB3 uses Annahar. D contains a total of 1,835 documents, accounting for approximately 350,000 words. We construct the training and testing sets Dtrain and Dtest from D using 10-fold cross validation, and we construct D1 and D2 from Dtrain by randomly performing a 9: 1 split. As mentioned earlier, we use the SAMA morphological analyzer to obtain candidate labels C(t) for each token t while training and testing an SVM model on D2 and Dtest respectively. A sample output of SAMA is shown in Table 1. To improve coverage, we also add to C(t) all the labels lseen for t in D1. We find that doing so improves coverage to 98% . This is an upper bound on the accuracy of our model. C(t) = SAMA(t) 4 [ {l|(t, l) ∈ D1} (9) Results We use two metrics of accuracy: A1, which measures the percentage of tokens for which the model assigns the highest score to the correct label or LE value (or E1= 100 A1, the corresponding percentage error), 1a=nd 1 A2, wAh1i,ch th measures tnhdei percentage of tokens for which the correct label or LE value is one of the two highest ranked choices returned by the model (or E2 = 100 A2). We test our bmyod theel Mfull on Dtest a =nd 1 a0c0hi −eve A A2)1. and A2 scores of 90.6% and 96.2% respectively. The accuracy achieved by our Mfull model is, to the best of our knowledge, higher than prior approaches have been able to achieve so far for the problem of combined morphological and lemma disambiguation. This is all the more impressive considering that the upper bound on accuracy for our model is 98% because, as described above, our set of candidate labels is incomplete. In order to analyze how well different LEs can be predicted, we train an SVM model Me for each LE e using the feature set Ze, and test all such models − − on Dtest. The results for all the LEs are reported in the form of error percentages E1 and E2 in Table 4. reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. A comparison of the results for Mfull with the results for Mlemma and Mpos is particularly informative. We see that Mfull is able to achieve a substantially lower E1 error score (9.4%) than Mlemma (11.1%) and Mpos (23.4%); in other words, we find that our full model is able to predict lemmas and basic parts-of-speech more accurately than the individ- ual models for each of these elements. We examine the effect of varying the size of D2, i.e. the number of SVM training instances, on the performance of Mfull on Dtest, and find that with increasing sizes of D2, E1 reduces only slightly from 9.5% to 9.4%, and shows no improvement thereafter. We also find that the use of documentlevel features in Mlemma reduces E1 and E2 percentages for Mlemma by 5.7% and 3.2% respectively. 4.1 Comparison to Alternate Approaches 4.1.1 Structured Prediction Models Preliminary experiments showed that knowing the predicted labels (lemma + morphology) of the surrounding words can slightly improve the predictive accuracy of our model. To further investigate this effect, we tried running experiments using different structured models, namely CRF (Conditional Random Fields) (Lafferty et al., 2001), (Structured) MIRA (Margin Infused Relaxation Algorithm) (Crammer et al., 2006) and Structured Perceptron (Collins, 2002). We used linear chain 732 CRFs as implemented in MALLET Toolbox (McCallum, 2001) and for Structured MIRA and Perceptron we used their implementations from EDLIN Toolbox (Ganchev and Georgiev, 2009). However, given the vast label space of our problem, running these methods proved infeasible. The time complexity of these methods scales badly with the number of labels; It took a week to train a linear chain CRF for only ∼ 50 labels and though MIRA and Perceptron are o 5n0lin leab algorithms, they MalsIoR Abec aonmde P ienr-tractable beyond a few hundred labels. Since our label space contains combinations of lemmas and morphologies, so even after factoring, the dimension of the label space is in the order of thousands. We also tried a na¨ ıve version (two-pass approximation) of these structured models. In addition to the features in Zfull, we include the predicted labels for the tokens preceding and following the target token as features. This new model is not only slow to train, but also achieves only slightly lower error rates (1.2% lower E1 and 1.0% lower E2) than Mfull. This provides an upper bound on the benefit of using the more complex structured models, and suggests that given their computational demands our (unstructured) model Mfull is a better choice. 4.1.2 MADA (Habash and Rambow, 2005) perform morphological disambiguation using a morphological analyzer. (Roth et al., 2008) augment this with lemma disambiguation; they call their system MADA. Our work differs from theirs in a number of respects. Firstly, they don’t use the two step regression procedure that we use. Secondly, they use only “unigram” features. Also, they do not learn a single model from a feature set based on labels and LEs; instead, they combine models for individual elements by using weighted agreement. We trained and tested MADA v2.32 using its full feature set on the same Dtrain and Dtest. We should point out that this is not an exact comparison, since MADA uses the older Buckwalter morphological analyzer.1 4.1.3 Other Alternatives Unfactored Labels: To illustrate the benefit obtained by breaking down each label l into 1A new version of MADA was released very close to the submission deadline for this conference. LEs, we contrast the performance of our Mfull model to an SVM model Mbaseline trained using only the feature set Zbaseline, which only contains features based on entire labels, those based on individual LEs. Independent lemma and morphology prediction: Another alternative approach is to predict lemmas and morphological analyses separately. We construct a feature set Zlemma0 = Zfull − Zlemma and train an SVM model Mlemma0 using this feature set. Labels are then predicted by simply combining the results predicted independently by Mlemma and Mlemma0 . Let Mind denote this approach. Unigram Features: Finally, we also consider a context-less approach, i.e. using only “unigram” features for labels as well as LEs. We call this feature set Zuni, and the corresponding SVM model Muni. The results of these various models, along with those of Mfull are summarized in Table 5. We see that Mfull has roughly half the error rate of the stateof-the-art MADA system. Note: The results reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. We used same train-test splits for all the datasets. 5 Related Work (Hajic, 2000) show that for highly inflectional languages, the use of a morphological analyzer improves accuracy of disambiguation. (Diab et al., 2004) perform tokenization, POS tagging and base phrase chunking using an SVM based learner. (Ahmed and N ¨urnberger, 2008) perform word-sense disambiguation using a Naive Bayesian 733 model and rely on parallel corpora and match- ing schemes instead of a morphological analyzer. (Kulick, 2010) perform simultaneous tokenization and part-of-speech tagging for Arabic by separating closed and open-class items and focusing on the likelihood of possible stems of openclass words. (Mohamed and K ¨ubler, 2010) present a hybrid method between word-based and segmentbased POS tagging for Arabic and report good results. (Toutanova and Cherry, 2009) perform joint lemmatization and part-of-speech tagging for English, Bulgarian, Czech and Slovene, but they do not use the two step estimation-shrinkage model described in this paper; nor do they factor labels. The idea of joint lemmatization and part-of-speech tagging has also been discussed in the context of Hungarian in (Kornai, 1994). A substantial amount of relevant work has been done previously for Hebrew. (Adler and Elhadad, 2006) perform Hebrew morphological disambiguation using an unsupervised morpheme-based HMM, but they report lower scores than those achieved by our model. Moreover, their analysis doesn’t include lemma IDs, which is a novelty of our model. (Goldberg et al., 2008) extend the work of (Adler and El- hadad, 2006) by using an EM algorithm, and achieve an accuracy of 88% for full morphological analysis, but again, this does not include lemma IDs. To the best of our knowledge, there is no existing research for Hebrew that does what we did for Arabic, namely to use simultaneous lemma and morphological disambiguation to improve both. (Dinur et al., 2009) show that prepositions and function words can be accurately segmented using unsupervised methods. However, by using this method as a preprocessing step, we would lose the power of a simultaneous solution for these problems. Our method is closer in style to a CRF, giving much of the accuracy gains of simultaneous solution, while being about 4 orders of magnitude easier to train. We believe that our use of factored labels is novel for the problem of simultaneous lemma and morphological disambiguation; however, (Smith et al., 2005) and (Hatori et al., 2008) have previously made use of features based on parts of labels in CRF models for morphological disambiguation and word-sense disambiguation respectively. Also, we note that there is a similarity between our two-stage machine learning approach and log-linear models in machine translation that break the data in two parts, estimating log-probabilities of generative models from one part, and discriminatively re-weighting the models using the second part. 6 Conclusions We introduced a new approach to accurately predict labels consisting of both lemmas and morphological analyses for Arabic text. We obtained an accuracy of over 90% substantially higher than current state-of-the-art systems. Key to our success is the factoring of labels into lemma and a large set of morphosyntactic elements, and the use of an algorithm that computes a simple initial estimate of the coefficient relating each contextual feature to each label element (simply by counting co-occurrence) and then regularizes these features by shrinking each of the coefficients for each feature by an amount determined by supervised learning using only the candidate label sets produced by SAMA. We also showed that using features of word ngrams is preferable to using features of only individual tokens of data. Finally, we showed that a model using a full feature set based on labels as well as – factored components of labels, which we call label elements (LEs) works better than a model created by combining individual models for each LE. We believe that the approach we have used to create our model can be successfully applied not just to Arabic but also to other languages such as Turkish, Hungarian and Finnish that have highly inflectional morphology. The current accuracy of of our model, getting the correct answer among the top two choices 96.2% of the time is high enough to be highly useful for tasks such as aiding the manual annotation of Arabic text; a more complete automation would require that accuracy for the single top choice. Acknowledgments We woud like to thank everyone at the Linguistic Data Consortium, especially Christopher Cieri, David Graff, Seth Kulick, Ann Bies, Wajdi Zaghouani and Basma Bouziri for their help. We also wish to thank the anonymous reviewers for their comments and suggestions. 734 References Meni Adler and Michael Elhadad. 2006. An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Farag Ahmed and Andreas N ¨urnberger. 2008. Arabic/English Word Translation Disambiguation using Parallel Corpora and Matching Schemes. In Proceedings of EAMT’08, Hamburg, Germany. Tim Buckwalter. 2004. Buckwalter Arabic Morphological Analyzer version 2.0. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of EMNLP’02. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online PassiveAggressive Algorithms. Journal of Machine Learning Research, 7:551–585. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic text: From Raw Text to Base Phrase Chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL’04). Elad Dinur, Dmitry Davidov, and Ari Rappoport. 2009. Unsupervised Concept Discovery in Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Kuzman Ganchev and Georgi Georgiev. 2009. Edlin: An Easy to Read Linear Learning Framework. In Proceedings of RANLP’09. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)*. In Proceedings of ACL’08. Nizar Habash and Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of ACL’05, Ann Arbor, MI, USA. Jan Hajic. 2000. Morphological Tagging: Data vs. Dictionaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’00). Jun Hatori, Yusuke Miyao, and Jun’ichi Tsujii. 2008. Word Sense Disambiguation for All Words using TreeStructured Conditional Random Fields. In Proceedings of COLing’08. W. James and Charles Stein. 1961 . Estimation with Quadratic Loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. Andr a´s Kornai. 1994. On Hungarian morphology (LinDissertationes 14). Lin- guistica, Series A: Studia et guistics Institute of Hungarian Academy of Sciences, Budapest. Seth Kulick. 2010. Simultaneous Tokenization and Partof-Speech Tagging for Arabic without a Morphological Analyzer. In Proceedings of ACL’10. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML’01, pages 282–289. Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a Large Scale Annotated Arabic Corpus. In Proceedings of NEMLAR Conference on Arabic Language Resources and Tools. Mohamed Maamouri, David Graff, Basma Bouziri, Sondos Krouna, and Seth Kulick. 2009. LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. Andrew McCallum, 2001. MALLET: A Machine Learning for Language Toolkit. Software available at http : / /mal let .cs .umas s .edu. Emad Mohamed and Sandra K ¨ubler. 2010. Arabic Part of Speech Tagging. In Proceedings of LREC’10. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of ACL’08, Columbus, Ohio, USA. Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-Based Morphological Disambiguation with Random Fields*. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). Kristina Toutanova and Colin Cherry. 2009. A Global Model for Joint Lemmatization and Part-of-Speech Prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, pages 486–494. 735

10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

Author: Peter McClanahan ; George Busby ; Robbie Haertel ; Kristian Heal ; Deryle Lonsdale ; Kevin Seppi ; Eric Ringger

Abstract: We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linkage, and morphological tagging and connect them in a pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of models with varying amounts of training data and find that with about 34,500 labeled tokens, we can outperform a reasonable baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, our joint model achieves 86.47% accuracy, a 29.7% reduction in error rate over the baseline.

11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: Although many algorithms have been developed to harvest lexical resources, few organize the mined terms into taxonomies. We propose (1) a semi-supervised algorithm that uses a root concept, a basic level concept, and recursive surface patterns to learn automatically from the Web hyponym-hypernym pairs subordinated to the root; (2) a Web based concept positioning procedure to validate the learned pairs’ is-a relations; and (3) a graph algorithm that derives from scratch the integrated taxonomy structure of all the terms. Comparing results with WordNet, we find that the algorithm misses some concepts and links, but also that it discovers many additional ones lacking in WordNet. We evaluate the taxonomization power of our method on reconstructing parts of the WordNet taxonomy. Experiments show that starting from scratch, the algorithm can reconstruct 62% of the WordNet taxonomy for the regions tested.

13 emnlp-2010-A Simple Domain-Independent Probabilistic Approach to Generation

Author: Gabor Angeli ; Percy Liang ; Dan Klein

Abstract: Percy Liang UC Berkeley Berkeley, CA 94720 pliang@cs.berkeley.edu Dan Klein UC Berkeley Berkeley, CA 94720 klein@cs.berkeley.edu We operate in a setting in which we are only given examples consisting of (i) a set of database records We present a simple, robust generation system which performs content selection and surface realization in a unified, domain-independent framework. In our approach, we break up the end-to-end generation process into a sequence of local decisions, arranged hierarchically and each trained discriminatively. We deployed our system in three different domains—Robocup sportscasting, technical weather forecasts, and common weather forecasts, obtaining results comparable to state-ofthe-art domain-specific systems both in terms of BLEU scores and human evaluation.

14 emnlp-2010-A Tree Kernel-Based Unified Framework for Chinese Zero Anaphora Resolution

Author: Fang Kong ; Guodong Zhou

Abstract: This paper proposes a unified framework for zero anaphora resolution, which can be divided into three sub-tasks: zero anaphor detection, anaphoricity determination and antecedent identification. In particular, all the three sub-tasks are addressed using tree kernel-based methods with appropriate syntactic parse tree structures. Experimental results on a Chinese zero anaphora corpus show that the proposed tree kernel-based methods significantly outperform the feature-based ones. This indicates the critical role of the structural information in zero anaphora resolution and the necessity of tree kernel-based methods in modeling such structural information. To our best knowledge, this is the first systematic work dealing with all the three sub-tasks in Chinese zero anaphora resolution via a unified framework. Moreover, we release a Chinese zero anaphora corpus of 100 documents, which adds a layer of annotation to the manu- ally-parsed sentences in the Chinese Treebank (CTB) 6.0.

15 emnlp-2010-A Unified Framework for Scope Learning via Simplified Shallow Semantic Parsing

Author: Qiaoming Zhu ; Junhui Li ; Hongling Wang ; Guodong Zhou

Abstract: This paper approaches the scope learning problem via simplified shallow semantic parsing. This is done by regarding the cue as the predicate and mapping its scope into several constituents as the arguments of the cue. Evaluation on the BioScope corpus shows that the structural information plays a critical role in capturing the relationship between a cue and its dominated arguments. It also shows that our parsing approach significantly outperforms the state-of-the-art chunking ones. Although our parsing approach is only evaluated on negation and speculation scope learning here, it is portable to other kinds of scope learning. 1

16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

Author: Guillaume Wisniewski ; Alexandre Allauzen ; Francois Yvon

Abstract: Extant Statistical Machine Translation (SMT) systems are very complex softwares, which embed multiple layers of heuristics and embark very large numbers of numerical parameters. As a result, it is difficult to analyze output translations and there is a real need for tools that could help developers to better understand the various causes of errors. In this study, we make a step in that direction and present an attempt to evaluate the quality of the phrase-based translation model. In order to identify those translation errors that stem from deficiencies in the phrase table (PT), we propose to compute the oracle BLEU-4 score, that is the best score that a system based on this PT can achieve on a reference corpus. By casting the computation of the oracle BLEU-1 as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of this score, and report measures performed on several standard benchmarks. Various other applications of these oracle decoding techniques are also reported and discussed. 1 Phrase-Based Machine Translation 1.1 Principle A Phrase-Based Translation System (PBTS) consists of a ruleset and a scoring function (Lopez, 2009). The ruleset, represented in the phrase table, is a set of phrase1pairs {(f, e) }, each pair expressing that the source phrase f can ,bee) r}e,w earicthten p (atirra enxslparteedss)i inngto t a target phrase e. Trarsaens flation hypotheses are generated by iteratively rewriting portions of the source sentence as prescribed by the ruleset, until each source word has been consumed by exactly one rule. The order of target words in an hypothesis is uniquely determined by the order in which the rewrite operation are performed. The search space ofthe translation model corresponds to the set of all possible sequences of 1Following the usage in statistical machine translation literature, use “phrase” to denote a subsequence of consecutive words. we 933 rules applications. The scoring function aims to rank all possible translation hypotheses in such a way that the best one has the highest score. A PBTS is learned from a parallel corpus in two independent steps. In a first step, the corpus is aligned at the word level, by using alignment tools such as Gi z a++ (Och and Ney, 2003) and some symmetrisation heuristics; phrases are then extracted by other heuristics (Koehn et al., 2003) and assigned numerical weights. In the second step, the parameters of the scoring function are estimated, typically through Minimum Error Rate training (Och, 2003). Translating a sentence amounts to finding the best scoring translation hypothesis in the search space. Because of the combinatorial nature of this problem, translation has to rely on heuristic search techniques such as greedy hill-climbing (Germann, 2003) or variants of best-first search like multi-stack decoding (Koehn, 2004). Moreover, to reduce the overall complexity of decoding, the search space is typically pruned using simple heuristics. For instance, the state-of-the-art phrase-based decoder Moses (Koehn et al., 2007) considers only a restricted number of translations for each source sequence2 and enforces a distortion limit3 over which phrases can be reordered. As a consequence, the best translation hypothesis returned by the decoder is not always the one with the highest score. 1.2 Typology of PBTS Errors Analyzing the errors of a SMT system is not an easy task, because of the number of models that are combined, the size of these models, and the high complexity of the various decision making processes. For a SMT system, three different kinds of errors can be distinguished (Germann et al., 2004; Auli et al., 2009): search errors, induction errors and model errors. The former corresponds to cases where the hypothesis with the best score is missed by the search procedure, either because of the use of an ap2the 3the option of Moses, defaulting to 20. dl option of Moses, whose default value is 7. tt l ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 9t3ic3s–943, proximate search method or because of the restrictions of the search space. Induction errors correspond to cases where, given the model, the search space does not contain the reference. Finally, model errors correspond to cases where the hypothesis with the highest score is not the best translation according to the evaluation metric. Model errors encompass several types oferrors that occur during learning (Bottou and Bousquet, 2008)4. Approximation errors are errors caused by the use of a restricted and oversimplistic class of functions (here, finitestate transducers to model the generation of hypotheses and a linear scoring function to discriminate them) to model the translation process. Estimation errors correspond to the use of sub-optimal values for both the phrase pairs weights and the parameters of the scoring function. The reasons behind these errors are twofold: first, training only considers a finite sample of data; second, it relies on error prone alignments. As a result, some “good” phrases are extracted with a small weight, or, in the limit, are not extracted at all; and conversely that some “poor” phrases are inserted into the phrase table, sometimes with a really optimistic score. Sorting out and assessing the impact of these various causes of errors is of primary interest for SMT system developers: for lack of such diagnoses, it is difficult to figure out which components of the system require the most urgent attention. Diagnoses are however, given the tight intertwining among the various component of a system, very difficult to obtain: most evaluations are limited to the computation of global scores and usually do not imply any kind of failure analysis. 1.3 Contribution and organization To systematically assess the impact of the multiple heuristic decisions made during training and decoding, we propose, following (Dreyer et al., 2007; Auli et al., 2009), to work out oracle scores, that is to evaluate the best achievable performances of a PBTS. We aim at both studying the expressive power of PBTS and at providing tools for identifying and quantifying causes of failure. Under standard metrics such as BLEU (Papineni et al., 2002), oracle scores are difficult (if not impossible) to compute, but, by casting the computation of the oracle unigram recall and precision as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of the oracle BLEU-4 scores and report measurements performed on several standard benchmarks. The main contributions of this paper are twofold. We first introduce an ILP program able to efficiently find the best hypothesis a PBTS can achieve. This program can be easily extended to test various improvements to 4We omit here optimization errors. 934 phrase-base systems or to evaluate the impact of different parameter settings. Second, we present a number of complementary results illustrating the usage of our oracle decoder for identifying and analyzing PBTS errors. Our experimental results confirm the main conclusions of (Turchi et al., 2008), showing that extant PBTs have the potential to generate hypotheses having very high BLEU4 score and that their main bottleneck is their scoring function. The rest of this paper is organized as follows: in Section 2, we introduce and formalize the oracle decoding problem, and present a series of ILP problems of increasing complexity designed so as to deliver accurate lowerbounds of oracle score. This section closes with various extensions allowing to model supplementary constraints, most notably reordering constraints (Section 2.5). Our experiments are reported in Section 3, where we first introduce the training and test corpora, along with a description of our system building pipeline (Section 3. 1). We then discuss the baseline oracle BLEU scores (Section 3.2), analyze the non-reachable parts of the reference translations, and comment several complementary results which allow to identify causes of failures. Section 4 discuss our approach and findings with respect to the existing literature on error analysis and oracle decoding. We conclude and discuss further prospects in Section 5. 2 Oracle Decoder 2.1 The Oracle Decoding Problem Definition To get some insights on the errors of phrasebased systems and better understand their limits, we propose to consider the oracle decoding problem defined as follows: given a source sentence, its reference translation5 and a phrase table, what is the “best” translation hypothesis a system can generate? As usual, the quality of an hypothesis is evaluated by the similarity between the reference and the hypothesis. Note that in the oracle decoding problem, we are only assessing the ability of PBT systems to generate good candidate translations, irrespective of their ability to score them properly. We believe that studying this problem is interesting for various reasons. First, as described in Section 3.4, comparing the best hypothesis a system could have generated and the hypothesis it actually generates allows us to carry on both quantitative and qualitative failure analysis. The oracle decoding problem can also be used to assess the expressive power of phrase-based systems (Auli et al., 2009). Other applications include computing acceptable pseudo-references for discriminative training (Tillmann and Zhang, 2006; Liang et al., 2006; Arun and 5The oracle decoding problem can be extended to the case of multiple references. For the sake of simplicity, we only describe the case of a single reference. Koehn, 2007) or combining machine translation systems in a multi-source setting (Li and Khudanpur, 2009). We have also used oracle decoding to identify erroneous or difficult to translate references (Section 3.3). Evaluation Measure To fully define the oracle decoding problem, a measure of the similarity between a translation hypothesis and its reference translation has to be chosen. The most obvious choice is the BLEU-4 score (Papineni et al., 2002) used in most machine translation evaluations. However, using this metric in the oracle decoding problem raises several issues. First, BLEU-4 is a metric defined at the corpus level and is hard to interpret at the sentence level. More importantly, BLEU-4 is not decomposable6: as it relies on 4-grams statistics, the contribution of each phrase pair to the global score depends on the translation of the previous and following phrases and can not be evaluated in isolation. Because of its nondecomposability, maximizing BLEU-4 is hard; in particular, the phrase-level decomposability of the evaluation × metric is necessary in our approach. To circumvent this difficulty, we propose to evaluate the similarity between a translation hypothesis and a reference by the number of their common words. This amounts to evaluating translation quality in terms of unigram precision and recall, which are highly correlated with human judgements (Lavie et al., ). This measure is closely related to the BLEU-1 evaluation metric and the Meteor (Banerjee and Lavie, 2005) metric (when it is evaluated without considering near-matches and the distortion penalty). We also believe that hypotheses that maximize the unigram precision and recall at the sentence level yield corpus level BLEU-4 scores close the maximal achievable. Indeed, in the setting we will introduce in the next section, BLEU-1 and BLEU-4 are highly correlated: as all correct words of the hypothesis will be compelled to be at their correct position, any hypothesis with a high 1-gram precision is also bound to have a high 2-gram precision, etc. 2.2 Formalizing the Oracle Decoding Problem The oracle decoding problem has already been considered in the case of word-based models, in which all translation units are bound to contain only one word. The problem can then be solved by a bipartite graph matching algorithm (Leusch et al., 2008): given a n m binary matarligxo describing possible t 2r0an08sl)a:ti goinv elinn aks n b×emtw beeinna source words and target words7, this algorithm finds the subset of links maximizing the number of words of the reference that have been translated, while ensuring that each word 6Neither at the sentence (Chiang et al., 2008), nor at the phrase level. 7The (i, j) entry of the matrix is 1if the ith word of the source can be translated by the jth word of the reference, 0 otherwise. 935 is translated only once. Generalizing this approach to phrase-based systems amounts to solving the following problem: given a set of possible translation links between potential phrases of the source and of the target, find the subset of links so that the unigram precision and recall are the highest possible. The corresponding oracle hypothesis can then be easily generated by selecting the target phrases that are aligned with one source phrase, disregarding the others. In addition, to mimic the way OOVs are usually handled, we match identical OOV tokens appearing both in the source and target sentences. In this approach, the unigram precision is always one (every word generated in the oracle hypothesis matches exactly one word in the reference). As a consequence, to find the oracle hypothesis, we just have to maximize the recall, that is the number of words appearing both in the hypothesis and in the reference. Considering phrases instead of isolated words has a major impact on the computational complexity: in this new setting, the optimal segmentations in phrases of both the source and of the target have to be worked out in addition to links selection. Moreover, constraints have to be taken into account so as to enforce a proper segmentation of the source and target sentences. These constraints make it impossible to use the approach of (Leusch et al., 2008) and concur in making the oracle decoding problem for phrase-based models more complex than it is for word-based models: it can be proven, using arguments borrowed from (De Nero and Klein, 2008), that this problem is NP-hard even for the simple unigram precision measure. 2.3 An Integer Program for Oracle Decoding To solve the combinatorial problem introduced in the previous section, we propose to cast it into an Integer Linear Programming (ILP) problem, for which many generic solvers exist. ILP has already been used in SMT to find the optimal translation for word-based (Germann et al., 2001) and to study the complexity of learning phrase alignments (De Nero and Klein, 2008) models. Following the latter reference, we introduce the following variables: fi,j (resp. ek,l) is a binary indicator variable that is true when the phrase contains all spans from betweenword position i to j (resp. k to l) of the source (resp. target) sentence. We also introduce a binary variable, denoted ai,j,k,l, to describe a possible link between source phrase fi,j and target phrase ek,l. These variables are built from the entries of the phrase table according to selection strategies introduced in Section 2.4. In the following, index variables are so that: 0 ≤ i< j ≤ n, in the source sentence and 0 ≤ k < l ≤ m, in the target sentence, where n (resp. m) is the length of the source (resp. target) sentence. Solving the oracle decoding problem then amounts to optimizing the following objective function: mi,j,akx,li,Xj,k,lai,j,k,l· (l − k), (1) under the constraints: X ∀x ∈ J1,mK : ek,l ≤ 1 (2) = (3) 1∀,kn,lK : Xai,j,k,l = fk,l (4) ∀i,j : Xai,j,k,l (5) k,l s.tX. Xk≤x≤l ∀∀xy ∈∈ J11,,mnKK : X i,j s.tX. Xi≤y≤j fi,j 1 Xi,j = ei,j Xk,l The objective function (1) corresponds to the number of target words that are generated. The first set of constraints (2) ensures that each word in the reference e ap- pears in no more than one phrase. Maximizing the objective under these constraints amounts to maximizing the unigram recall. The second set of constraints (3) ensures that each word in the source f is translated exactly once, which guarantees that the search space of the ILP problem is the same as the search space of a phrase-based system. Constraints (4) bind the fk,l and ai,j,k,l variables, ensuring that whenever a link ai,j,k,l is active, the corresponding phrase fk,l is also active. Constraints (5) play a similar role for the reference. The Relaxed Problem Even though it accurately models the search space of a phrase-based decoder, this programs is not really useful as is: due to out-ofvocabulary words or missing entries in the phrase table, the constraint that all source words should be translated yields infeasible problems8. We propose to relax this problem and allow some source words to remain untranslated. This is done by replacing constraints (3) by: ∀y ∈ J1,nK : X i,j s.tX. Xi≤y≤j fi,j ≤ 1 To better ref∀lyec ∈t th J1e, bneKh :avior of phrase-based decoders, which attempt to translate all source words, we also need to modify the objective function as follows: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) Xi,j (6) The second term in this new objective ensures that optimal solutions translate as many source words as possible. 8An ILP problem is said to be infeasible when tion violates at least one constraint. every possible solu- 936 The Relaxed-Distortion Problem A last caveat with the Relaxed optimization program is caused by frequently occurring source tokens, such as function words or punctuation signs, which can often align with more than one target word. For lack of taking distortion information into account in our objective function, all these alignments are deemed equivalent, even if some of them are clearly more satisfactory than others. This situation is illustrated on Figure 1. le chat et the cat and le the chien dog Figure 1: Equivalent alignments between “le” and “the”. The dashed lines corresponds to a less interpretable solution. To overcome this difficulty, we propose a last change to the objective function: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) X ai,j,k,l|k − i| Xi,j −α (7) i Xk ,l X,j, Compared to the objective function of the relaxed problem (6), we introduce here a supplementary penalty factor which favors monotonous alignments. For each phrase pair, the higher the difference between source and target positions, the higher this penalty. If α is small enough, this extra term allows us to select, among all the optimal alignments of the re l axed problem, the one with the lowest distortion. In our experiments, we set α to min {n, m} to ensure that the penalty factor is always smminall{enr, ,tmha}n tthoe e rneswuarred t fhoart aligning atwltyo single iwso ardlwsa. 2.4 Selecting Indicator Variables In the approach introduced in the previous sections, the oracle decoding problem is solved by selecting, among a set of possible translation links, the ones that yield the solution with the highest unigram recall. We propose two strategies to build this set of possible translation links. In the first one, denoted exact match, an indicator ai,j,k,l is created if there is an entry (f, e) so that f spans from word position ito j in the source and e from word position k to l in the target. In this strategy, the ILP program considers exactly the same ruleset as conventional phrase-based decoders. We also consider an alternative strategy, which could help us to identify errors made during the phrase extraction process. In this strategy, denoted inside match, an indicator ai,j,k,l is created when the following three criteria are met: i) f spans from position ito j of the source; ii) a substring of e, denoted e, spans from position k to l of the reference; iii) (f, e¯) is not an entry of the phrase table. The resulting set of indicator variables thus contains, at least, all the variables used in the exact match strategy. In addition, we license here the use of phrases containing words that do not occur in the reference. In fact, using such solutions can yield higher BLEU scores when the reward for additional correct matches exceeds the cost incurred by wrong predictions. These cases are symptoms of situations where the extraction heuristic failed to extract potentially useful subphrases. 2.5 Oracle Decoding with Reordering Constraints The ILP problem introduced in the previous section can be extended in several ways to describe and test various improvements to phrase-based systems or to evaluate the impact of different parameter settings. This flexibility mainly stems from the possibility offered by our framework to express arbitrary constraints over variables. In this section, we illustrate these possibilities by describing how reordering constraints can easily be considered. As a first example, the Moses decoder uses a distortion limit to constrain the set of possible reorderings. This constraint “enforces (...) that the last word of a phrase chosen for translation cannot be more than d9 words from the leftmost untranslated word in the source” (Lopez, 2009) and is expressed as: ∀aijkl , ai0j0k0l0 s.t. k > k0, aijkl · ai0j0k0l0 · |j − i0 + 1| ≤ d, The maximum distortion limit strategy (Lopez, 2009) is also easily expressed and take the following form (assuming this constraint is parameterized by d): ∀l < m − 1, ai,j,k,l·ai0,j0,l+1,l0 · |i0 − j − 1| 71is%t e6hs.a distortion greater that Moses default distortion limit. alignment decisions enabled by the use of larger training corpora and phrase table. To evaluate the impact ofthe second heuristic, we computed the number of phrases discarded by Moses (be- cause of the default ttl limit) but used in the oracle hypotheses. In the English to French NEWSCO setting, they account for 34.11% of the total number of phrases used in the oracle hypotheses. When the oracle decoder is constrained to use the same phrase table as Moses, its BLEU-4 score drops to 42.78. This shows that filtering the phrase table prior to decoding discards many useful phrase pairs and is seriously limiting the best achievable performance, a conclusion shared with (Auli et al., 2009). Search Errors Search errors can be identified by comparing the score of the best hypothesis found by Moses and the score of the oracle hypothesis. If the score of the oracle hypothesis is higher, then there has been a search error; on the contrary, there has been an estimation error when the score of the oracle hypothesis is lower than the score of the best hypothesis found by Moses. 940 Based on the comparison of the score of Moses hypotheses and of oracle hypotheses for the English to French NEWSCO setting, our preliminary conclusion is that the number of search errors is quite limited: only about 5% of the hypotheses of our oracle decoder are actually getting a better score than Moses solutions. Again, this shows that the scoring function (model error) is one of the main bottleneck of current PBTS. Comparing these hypotheses is nonetheless quite revealing: while Moses mostly selects phrase pairs with high translation scores and generates monotonous alignments, our ILP decoder uses larger reorderings and less probable phrases to achieve better solutions: on average, the reordering score of oracle solutions is −5.74, compared to −76.78 fscoro rMeo osfe osr outputs. iGonivsen is −the5 weight assigned through MERT training to the distortion score, no wonder that these hypotheses are severely penalized. The Impact of Phrase Length The observed outputs do not only depend on decisions made during the search, but also on decisions made during training. One such decision is the specification of maximal length for the source and target phrases. In our framework, evaluating the impact of this decision is simple: it suffices to change the definition of indicator variables so as to consider only alignments between phrases of a given length. In the English-French NEWSCO setting, the most restrictive choice, when only alignments between single words are authorized, yields an oracle BLEU-4 of 48.68; however, authorizing phrases up to length 2 allows to achieve an oracle value of 66.57, very close to the score achieved when considering all extracted phrases (67.77). This is corroborated with a further analysis of our oracle alignments, which use phrases whose average source length is 1.21 words (respectively 1.31 for target words). If many studies have already acknowledged the predomi- nance of “small” phrases in actual translations, our oracle scores suggest that, for this language pair, increasing the phrase length limit beyond 2 or 3 might be a waste of computational resources. 4 Related Work To the best of our knowledge, there are only a few works that try to study the expressive power ofphrase-based machine translation systems or to provide tools for analyzing potential causes of failure. The approach described in (Auli et al., 2009) is very similar to ours: in this study, the authors propose to find and analyze the limits of machine translation systems by studying the reference reachability. A reference is reachable for a given system if it can be exactly generated by this system. Reference reachability is assessed using Moses in forced decoding mode: during search, all hypotheses that deviate from the reference are simply discarded. Even though the main goal of this study was to compare the search space of phrase-based and hierarchical systems, it also provides some insights on the impact of various search parameters in Moses, delivering conclusions that are consistent with our main results. As described in Section 1.2, these authors also propose a typology of the errors of a statistical translation systems, but do not attempt to provide methods for identifying them. The authors of (Turchi et al., 2008) study the learn- ing capabilities of Moses by extensively analyzing learning curves representing the translation performances as a function of the number of examples, and by corrupting the model parameters. Even though their focus is more on assessing the scoring function, they reach conclusions similar to ours: the current bottleneck of translation performances is not the representation power of the PBTS but rather in their scoring functions. Oracle decoding is useful to compute reachable pseudo-references in the context of discriminative training. This is the main motivation of (Tillmann and Zhang, 2006), where the authors compute high BLEU hypotheses by running a conventional decoder so as to maximize a per-sentence approximation of BLEU-4, under a simple (local) reordering model. Oracle decoding has also been used to assess the limitations induced by various reordering constraints in (Dreyer et al., 2007). To this end, the authors propose to use a beam-search based oracle decoder, which computes lower bounds of the best achievable BLEU-4 using dynamic programming techniques over finite-state (for so-called local and IBM constraints) or hierarchically structured (for ITG constraints) sets of hypotheses. Even 941 though the numbers reported in this study are not directly comparable with ours17, it seems that our decoder is not only conceptually much simpler, but also achieves much more optimistic lower-bounds of the oracle BLEU score. The approach described in (Li and Khudanpur, 2009) employs a similar technique, which is to guide a heuristic search in an hypergraph representing possible translation hypotheses with n-gram counts matches, which amounts to decoding with a n-gram model trained on the sole reference translation. Additional tricks are presented in this article to speed-up decoding. Computing oracle BLEU scores is also the subject of (Zens and Ney, 2005; Leusch et al., 2008), yet with a different emphasis. These studies are concerned with finding the best hypotheses in a word graph or in a consensus network, a problem that has various implications for multi-pass decoding and/or system combination techniques. The former reference describes an exponential approximate algorithm, while the latter proves the NPcompleteness of this problem and discuss various heuristic approaches. Our problem is somewhat more complex and using their techniques would require us to built word graphs containing all the translations induced by arbitrary segmentations and permutations of the source sentence. 5 Conclusions In this paper, we have presented a methodology for analyzing the errors of PBTS, based on the computation of an approximation of the BLEU-4 oracle score. We have shown that this approximation could be computed fairly accurately and efficiently using Integer Linear Programming techniques. Our main result is a confirmation of the fact that extant PBTS systems are expressive enough to achieve very high translation performance with respect to conventional quality measurements. The main efforts should therefore strive to improve on the way phrases and hypotheses are scored during training. This gives further support to attempts aimed at designing context-dependent scoring functions as in (Stroppa et al., 2007; Gimpel and Smith, 2008), or at attempts to perform discriminative training of feature-rich models. (Bangalore et al., 2007). We have shown that the examination of difficult-totranslate sentences was an effective way to detect errors or inconsistencies in the reference translations, making our approach a potential aid for controlling the quality or assessing the difficulty of test data. Our experiments have also highlighted the impact of various parameters. Various extensions of the baseline ILP program have been suggested and/or evaluated. In particular, the ILP formalism lends itself well to expressing various constraints that are typically used in conventional PBTS. In 17The best BLEU-4 oracle they achieve on Europarl German to English is approximately 48; but they considered a smaller version of the training corpus and the WMT’06 test set. our future work, we aim at using this ILP framework to systematically assess various search configurations. We plan to explore how replacing non-reachable references with high-score pseudo-references can improve discrim- inative training of PBTS. We are also concerned by determining how tight is our approximation of the BLEU4 score is: to this end, we intend to compute the best BLEU-4 score within the n-best solutions of the oracle decoding problem. Acknowledgments Warm thanks to Houda Bouamor for helping us with the annotation tool. This work has been partly financed by OSEO, the French State Agency for Innovation, under the Quaero program. References Tobias Achterberg. 2007. Constraint Integer Programming. Ph.D. thesis, Technische Universit a¨t Berlin. http : / / opus .kobv .de /tuberl in/vol ltexte / 2 0 0 7 / 16 11/ . Abhishek Arun and Philipp Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In Proc. of MT Summit XI, Copenhagen, Denmark. Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Koehn. 2009. A systematic analysis of translation model search spaces. In Proc. of WMT, pages 224–232, Athens, Greece. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Srinivas Bangalore, Patrick Haffner, and Stephan Kanthak. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proc. of ACL, pages 152–159, Prague, Czech Republic. L e´on Bottou and Olivier Bousquet. 2008. The tradeoffs oflarge scale learning. In Proc. of NIPS, pages 161–168, Vancouver, B.C., Canada. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proc. of WMT, pages 1–28, Athens, Greece. David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proc. of ECML, pages 610–619, Honolulu, Hawaii. John De Nero and Dan Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL: HLT, Short Papers, pages 25–28, Columbus, Ohio. Markus Dreyer, Keith B. Hall, and Sanjeev P. Khudanpur. 2007. Comparing reordering constraints for smt using efficient bleu oracle computation. In NAACL-HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 103– 110, Rochester, New York. 942 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proc. of ACL, pages 228–235, Toulouse, France. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1-2): 127– 143. Ulrich Germann. 2003. Greedy decoding for statistical machine translation in almost linear time. In Proc. of NAACL, pages 1–8, Edmonton, Canada. Kevin Gimpel and Noah A. Smith. 2008. Rich source-side context for statistical machine translation. In Proc. of WMT, pages 9–17, Columbus, Ohio. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL, pages 48–54, Edmonton, Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL, demonstration session. Philipp Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA, pages 115–124, Washington DC. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proc. of HLT, pages 161–168, Vancouver, Canada. Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. The significance of recall in automatic metrics for MT evaluation. In In Proc. of AMTA, pages 134–143, Washington DC. Gregor Leusch, Evgeny Matusov, and Hermann Ney. 2008. Complexity of finding the BLEU-optimal hypothesis in a confusion network. In Proc. of EMNLP, pages 839–847, Honolulu, Hawaii. Zhifei Li and Sanjeev Khudanpur. 2009. Efficient extraction of oracle-best translations from hypergraphs. In Proc. of NAACL, pages 9–12, Boulder, Colorado. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proc. of ACL, pages 761–768, Sydney, Australia. Adam Lopez. 2009. Translation as weighted deduction. In Proc. of EACL, pages 532–540, Athens, Greece. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. , 29(1): 19–5 1. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. Technical report, Philadelphia, Pennsylvania. D. Roth and W. Yih. 2005. Integer linear programming inference for conditional random fields. In Proc. of ICML, pages 737–744, Bonn, Germany. Nicolas Stroppa, Antal van den Bosch, and Andy Way. 2007. Exploiting source similarity for smt using context-informed features. In Andy Way and Barbara Proc. of TMI, pages Christoph Tillmann 231–240, Sk¨ ovde, and Tong Zhang. Gawronska, editors, Sweden. 2006. A discriminative global training algorithm for statistical mt. In Proc. of ACL, 721–728, Sydney, Australia. Turchi, Tijl De Bie, and Nello pages Marco Cristianini. 2008. Learn- ing performance of a machine translation system: a statistical and computational analysis. In Proc. of WMT, pages Columbus, Ohio. 35–43, Richard Zens and Hermann Ney. 2005. Word graphs for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 191–198, Ann Arbor, Michigan. 943

19 emnlp-2010-Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation

Author: Erica Greene ; Tugba Bodrumlu ; Kevin Knight

Abstract: Tugba Bodrumlu Dept. of Computer Science Univ. of Southern California Los Angeles, CA 90089 bodrumlu@cs . usc . edu Kevin Knight Information Sciences Institute Univ. of Southern California 4676 Admiralty Way Marina del Rey, CA 90292 kn i @ i i ght s .edu from existing online poetry corpora. We use these patterns to generate new poems and translate exist- We employ statistical methods to analyze, generate, and translate rhythmic poetry. We first apply unsupervised learning to reveal word-stress patterns in a corpus of raw poetry. We then use these word-stress patterns, in addition to rhyme and discourse models, to generate English love poetry. Finally, we translate Italian poetry into English, choosing target realizations that conform to desired rhythmic patterns.

20 emnlp-2010-Automatic Detection and Classification of Social Events

Author: Apoorv Agarwal ; Owen Rambow

Abstract: In this paper we introduce the new task of social event extraction from text. We distinguish two broad types of social events depending on whether only one or both parties are aware of the social contact. We annotate part of Automatic Content Extraction (ACE) data, and perform experiments using Support Vector Machines with Kernel methods. We use a combination of structures derived from phrase structure trees and dependency trees. A characteristic of our events (which distinguishes them from ACE events) is that the participating entities can be spread far across the parse trees. We use syntactic and semantic insights to devise a new structure derived from dependency trees and show that this plays a role in achieving the best performing system for both social event detection and classification tasks. We also use three data sampling approaches to solve the problem of data skewness. Sampling methods improve the F1-measure for the task of relation detection by over 20% absolute over the baseline.

21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

26 emnlp-2010-Classifying Dialogue Acts in One-on-One Live Chats

27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models

31 emnlp-2010-Constraints Based Taxonomic Relation Classification

32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media