emnlp emnlp2010 emnlp2010-82 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ahmet Aker ; Trevor Cohn ; Robert Gaizauskas
Abstract: In this paper we address two key challenges for extractive multi-document summarization: the search problem of finding the best scoring summary and the training problem of learning the best model parameters. We propose an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run. Further, we propose a discriminative training algorithm which directly maximises the quality ofthe best summary, rather than assuming a sentence-level decomposition as in earlier work. Our approach leads to significantly better results than earlier techniques across a number of evaluation metrics.
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract In this paper we address two key challenges for extractive multi-document summarization: the search problem of finding the best scoring summary and the training problem of learning the best model parameters. [sent-7, score-0.653]
2 We propose an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run. [sent-8, score-0.541]
3 This short summary can be used as a replacement for the original documents to reduce, for instance, the time a reader would spend if she were to read the original documents. [sent-12, score-0.345]
4 Following dominant trends in summarization research (Mani, 2001), we focus solely on extractive summarization which simplifies the summarization task to the problem of identifying a subset of units from the document collection (here sentences) which are concatenated to form the summary. [sent-13, score-0.514]
5 Most multi-document summarization systems define a model which assigns a score to a candidate summary based on the features of the sentences included in the summary. [sent-14, score-0.463]
6 Search is typically performed by a greedy algorithm which selects each sentence in decreasing order of model score until the desired summary – length is reached (see, e. [sent-16, score-0.423]
7 1 We show in this paper that the search problem can be solved optimally and efficiently using A* search (Russell et al. [sent-21, score-0.262]
8 Assuming the model only uses features local to each sentence in the summary, our algorithm finds the best scoring extractive summary up to a given length in words. [sent-23, score-0.614]
9 The model is then used during search to find a summary composed of high scoring (‘good’) sentences (see for a review Ouyang et al. [sent-28, score-0.584]
10 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 4t8ic2s–491, the best scoring whole summary under the model has a high score under the evaluation metric. [sent-38, score-0.441]
11 Next in section 3 we present an A* search algorithm for finding the best scoring (argmax) summary under the model with a constraint on the maximum summary length. [sent-42, score-0.832]
12 Following the predominant approach to data-driven summarisation, we define a linear model which scores summaries as the weighted sum of their features, s(y|x) = Φ(x, y) · λ , (1) where x is the document set, composed of k sentences, y ⊆ {1. [sent-49, score-0.755]
13 ·)k i}s a feea tthuere s eftun ocfti soenle wcthedich s rettuenrncse a nvdeiccteosr, o Φf( (f·e,a·)tu irses a f foera tthuree ec faunndcitdioante summary and λ are the model parameters. [sent-55, score-0.28]
14 This is achieved by finding λ such that yˆ is similar to the gold standard summary according to an automatic evaluation metric, as described in section 4. [sent-62, score-0.28]
15 3 A* Search The prediction problem is to find the best scoring extractive summary (see Equation 3) up to a given length, L. [sent-63, score-0.522]
16 At first glance, this appears to be a simple problem that might be solved efficiently with a greedy algorithm, say by taking the sentences in order of decreasing score and stopping just before the summary exceeds the length threshold. [sent-64, score-0.426]
17 The problem of constructing the summary can be considered a search problem in which we start with an empty summary and incrementally enlarge the summary by concatenating a sentence from our document set. [sent-66, score-1.1]
18 The search graph starts with an empty summary (the starting state) and each outgoing edge adds a sentence to produce a subsequent state, and is assigned a score under the model. [sent-67, score-0.488]
19 The summarisation problem is then equivalent to finding the best scoring path (summed over the edge scores) between the start state and a goal state. [sent-69, score-0.239]
20 A* is a best-first search algorithm which can efficiently find the best scoring path or the n-best paths (unlike the greedy algorithm which is not op- timal, and the backtracking variant which is not efficient). [sent-72, score-0.312]
21 The search procedure requires a scoring function for each state, here s(y|x) from (2), and function for e 2Our approach could be adapted to support global features, which would require changes to the heuristic for A* search to bound the score obtainable from the global features. [sent-73, score-0.571]
22 a heuristic function which estimates the additional score to get from a given state to a goal state. [sent-75, score-0.247]
23 For the search to be optimal guaranteed to find the best scoring path as the first solution the heuristic must be admissible, meaning that it bounds from above the score for reaching a goal state. [sent-76, score-0.473]
24 We present three different admissible heuristics later in this section, which bound the score with differing tightness and consequently different search cost. [sent-77, score-0.24]
25 Algorithm 1presents A* search for our extractive summarisation model. [sent-78, score-0.31]
26 Given a set of sentences to summary, a scoring and a heuristic function, it finds the best scoring summary. [sent-79, score-0.4]
27 This is achieved by building the search graph incrementally, and storing each frontier state in a priority queue (line 1) which is sorted by the sum of the state’s score and its heuristic. [sent-80, score-0.307]
28 Whenever one of these states is popped in line 2, we know that it outscores all competing hypotheses and therefore represents the optimal summary (because the heuristic is guaranteed to never underestimate the cost to a goal state from an unfinished state). [sent-84, score-0.576]
29 3 Note that in algorithm 1 we create the summary by building a list of sentence indices in sorted order to avoid spurious ambiguity which would unnecessarily expaPnd the search space. [sent-85, score-0.477]
30 We now retuPrn to the problem of defining the heuristic function, h(y; x, l) which provides an upper bound on the additional score achievable in reaching a goal state from state y. [sent-87, score-0.326]
31 In the algorithm, we use the shorthand sn = φ(xn) · λ for sentencPe n’s score, ln = length(xn) for its length and ly = Pn∈y ln for the total length of the current state (unfiPnisnh∈eyd summary). [sent-97, score-0.282]
32 The h1 heuristic is overly simple in that it assumes we can ‘reuse’ a high scoring short sentence many times despite this being disallowed by the model. [sent-102, score-0.26]
33 These nodes are scored with the score for the summary thus far plus a heuristic term. [sent-115, score-0.48]
34 Note that in finding the best two summaries the search process did not need to instantiate the full search graph. [sent-120, score-0.881]
35 To test the efficacy of A* search with each of the different heuristic functions, we now present empirical runtime results. [sent-121, score-0.251]
36 The h1 heuristic was used and the score and heuristic scores are shown separately for clarity. [sent-124, score-0.32]
37 generated the 100-best summaries with word limit L = 200. [sent-126, score-0.683]
38 6 sentences in document set Figure 2: Efficiency of A* search search is roughly linear in the number of sentences in the document set. [sent-136, score-0.474]
39 When we generate a summary about a location of type church, for instance, then we apply the church language model on the related input documents related to the location. [sent-172, score-0.417]
40 oTuhnist: :f Eeaatcuhre s eisn tuesnedce eto g eletsar ans swighneetdhe ar summaries with many sentences are better than summaries with few sentences or vice versa. [sent-174, score-1.3]
41 wordCount: Number of words in the summary, twoo drdeCcidoeu nwt:he Nthumer btehre omf owdoerld ssh ionu thlde f sauvmorm long summaries or short ones. [sent-175, score-0.619]
42 g Eiffel Tower, Mont Blanc) each with a manually assigned place name and object type category (e. [sent-178, score-0.241]
43 For each place name there are up to four model summaries that were created manually after reading existing image descriptions taken from the VirtualTourist travel community web-site. [sent-181, score-0.89]
44 Each summary contains a minimum of 190 and a maximum of 210 words. [sent-182, score-0.28]
45 We compute for each sentence in each description a ROUGE score by comparing the sentence to those included in the model summaries for that particular place name and retaining the highest score. [sent-187, score-0.901]
46 The21M07na15uxm6be51 Mrs5inthe12Ac30ov2glumns give detail about the number of documents (descriptions) for each place, number of sentences for each place and document (description) and the lengths of the sentences. [sent-200, score-0.319]
47 Testing For testing purposes we use the rest of the place names (105) from the 289 place name set. [sent-203, score-0.377]
48 For each place name we use a set of input documents, generate a summary from these documents using our summarizer and compare the results against model summaries of that place name using ROUGE. [sent-204, score-1.387]
49 The in domain documents are the VirtualTourist original image descriptions from which the model summaries were derived. [sent-206, score-0.812]
50 As with the training set we take all place name descriptions for a particular place and use them as input documents to our summarizer. [sent-207, score-0.415]
51 The numbers in the columns give detail about the number of sentences for each place and document and the lengths ofthe sentences. [sent-216, score-0.254]
52 WordLimit and sentenceLimit summaries are the ones generated using the model trained by MERT. [sent-222, score-0.619]
53 As described in section 4 we trained the summariser using the A* search decoder to maximise the ROUGE score of the best scoring summaries. [sent-223, score-0.324]
54 We used the heuristic function h3 in A* search because it is the best performing heuristic, and 100-best lists. [sent-224, score-0.251]
55 To experiment with different summary length conditions we differentiate between summaries with a word limit (wordLimit, set to 200 words) and summaries containing N number of sentences (sentenceLimit) as stop condition in A* search. [sent-225, score-1.652]
56 We set N so that in both wordLimit and sentenceLimit summaries we obtain more or less the same number of words (because our training data contains on average 17 words for each word we set N to 12, 12* 17=194). [sent-226, score-0.619]
57 In the testing for both wordLimit and sentenceLimit we generate summaries with the same word limit constraint which allows us to have a fair comparison between the ROUGE recall scores. [sent-228, score-0.762]
58 In these summaries the sentences are ranked based on the weighted features produced by Support Vector Regression (SVR). [sent-230, score-0.65]
59 (2010) use multi-document summarization and linear regression methods to rank sentences in the documents. [sent-232, score-0.308]
60 As regression model they used SVR and showed 9We use the term regression to refer to SVR. [sent-233, score-0.348]
61 However, to have a fair comparison between all our summary types we use these weights to generate summaries using the A* search with the word limit as constraint. [sent-237, score-1.125]
62 The table shows ROUGE recall numbers obtained by comparing model summaries against automatically generated summaries on the training data. [sent-240, score-1.238]
63 In Table 4 we can see that the scores for wordLimit and sentenceLimit type summaries are always at maximum on the metric they were trained on (this can be observed by following the main diagonal of the result matrix). [sent-242, score-0.735]
64 However, because we have two different input document sets we report separate results for each of these (Table 5 shows result for in domain data and Table 6 shows result for out automated summaries are generated using the in domain input documents. [sent-251, score-0.762]
65 automated summaries are generated using the out of domain input documents. [sent-252, score-0.653]
66 From Table 5 we can see that the wordLimit summaries score highest compared to the other two types of summaries. [sent-255, score-0.668]
67 This is different from the training results where sentenceLimit summary type summaries are the top scoring ones. [sent-256, score-1.046]
68 As mentioned earlier the sentenceLimit summaries contain exactly 12 sentences, where on average each sentence in the training data has 17 words. [sent-257, score-0.647]
69 We picked 12 sentences to achieve roughly the same word limit con- × straint (12 17 = 204) so they can be compared tsotr athinet (w1o2r ×dL 1im7it = =an 2d0 regression type esu cmommaparireesd. [sent-258, score-0.304]
70 However, these sentenceLimit summaries have an average of 221 words, which explains the higher ROUGE recall scores seen in training compared to testing (where a 200 word limit was imposed). [sent-259, score-0.764]
71 The wordLimit summaries are significantly better than the scores from the other summary types irrespective of the evaluation metric. [sent-260, score-0.93]
72 at 489 noted that these summaries are the only ones where the training and testing had the same condition in A* search concerning the summary word limit constraint. [sent-264, score-1.144]
73 The scores in sentenceLimit type summaries are significantly lower than wordLimit summaries, despite using MERT to learn the weights. [sent-265, score-0.685]
74 The regression type summaries achieved the worst ROUGE metric scores. [sent-267, score-0.878]
75 The weights used to generate these summaries were trained on single sentences using SVR. [sent-268, score-0.681]
76 These results indicate that if the goal is to generate high scoring summaries under a length limit in testing, then the same constraint should also be used in training. [sent-269, score-0.89]
77 From Table 5 and 6 we can see that the summaries obtained from VirtualTourist captions (in domain data) score roughly the same as the summaries generated using web-documents (out of domain data) as input. [sent-270, score-1.38]
78 2 Manual Evaluation We also evaluated our summaries using a readability assessment as in DUC and TAC. [sent-274, score-0.647]
79 DUC and TAC manually assess the quality of automatically generated summaries by asking human subjects to score each summary using five criteria grammaticality, redundancy, clarity, focus and coherence criteria. [sent-275, score-0.977]
80 For this evaluation we used the best scoring sum– maries from the wordLimit summary type (R-1, R-2 and R-SU4) generated using web-documents (out of domain documents) as input. [sent-277, score-0.461]
81 We also evaluate the regression summary types generated using the same input documents to investigate the correlation between high and low ROUGE metric scores to manual evaluation ones. [sent-278, score-0.657]
82 From the regression summary type we only use summaries under the R2 and RSU4 trained models. [sent-279, score-1.108]
83 In total we evaluated five different summary types (three from wordLimit and two from regression). [sent-280, score-0.28]
84 For each type we randomly selected 30 place names and asked three people to assess the summaries for these place names. [sent-281, score-0.922]
85 2t84S9n(U72R41, R2, RSU4) and regression (R2, RSU4) summary types. [sent-287, score-0.454]
86 summaries (30 from each summary type) in a random way and was asked to assess them according to the DUC and TAC manual assessment scheme. [sent-289, score-0.95]
87 11 From Table 7 we can see that overall the wordLimit type summaries perform better than the regression ones. [sent-291, score-0.828]
88 For each metric in regression summary types (R-2 and R-SU4) we compute the significance of the difference with the same metrics in wordLimit summary types. [sent-292, score-0.784]
89 12 The results for the clarity, coherence and focus criteria in wordLimit summaries are significantly better than in regression ones (p<0. [sent-293, score-0.822]
90 Although in regression type summaries the scores for the grammaticality criterion are lower than those in wordLimit summaries the difference is not significant. [sent-298, score-1.549]
91 Furthermore, we can see that the redundancy scores for regression summaries are slightly higher than those for wordLimit summaries. [sent-299, score-0.889]
92 As mentioned above, feature weights for wordLimit summaries are trained using summaries with a specific word limit constraint, whereas the weights for the regression summaries are learned using single sentences. [sent-301, score-2.157]
93 490 like summaries” will lead to a higher content agreement between the training and the model summaries whereas this is not guaranteed with single sentences. [sent-311, score-0.653]
94 However, when these sentences are combined into summaries it is not guaranteed that these summaries will also have high content overlap with the entire model ones. [sent-313, score-1.303]
95 Therefore we believe if there is a high content agreement between the training and model summaries this could lead to more readable summaries. [sent-314, score-0.619]
96 In case of the redundancy criterion we have compared to wordLimit summary type high scores in regression summaries although wordLimit summaries are significantly better than regression ones when it concerns the ROUGE scores. [sent-316, score-2.022]
97 To minimize redundancy in summaries it is necessary to also take into consideration global features addressing the linguistic aspects of the summaries. [sent-319, score-0.684]
98 7 Conclusion In this paper we have proposed an A* search approach for generating a summary from a ranked list of sentences and learning feature weights for a feature based extractive multi-document summarization system. [sent-321, score-0.706]
99 Furthermore, we highlighted the importance of uniformity in training and testing and argued that if the goal is to generate high scoring summaries under a length limit in testing, then the same constraint should also be used in training. [sent-323, score-0.94]
100 In the future we plan to expand this feature set with global features, especially ones measuring lexical diversity in the summaries to reduce the redundancy in them. [sent-325, score-0.684]
wordName wordTfidf (topN-words)
[('summaries', 0.619), ('wordlimit', 0.324), ('summary', 0.28), ('rouge', 0.244), ('regression', 0.174), ('sentencelimit', 0.162), ('search', 0.131), ('aker', 0.13), ('extractive', 0.13), ('heuristic', 0.12), ('place', 0.118), ('scoring', 0.112), ('summarization', 0.103), ('duc', 0.097), ('virtualtourist', 0.081), ('document', 0.075), ('summarizer', 0.069), ('redundancy', 0.065), ('abbey', 0.065), ('ouyang', 0.065), ('svr', 0.065), ('documents', 0.065), ('limit', 0.064), ('gaizauskas', 0.063), ('name', 0.059), ('optimise', 0.056), ('descriptions', 0.055), ('schedule', 0.054), ('state', 0.051), ('mert', 0.051), ('testing', 0.05), ('metric', 0.05), ('brandow', 0.049), ('edmundson', 0.049), ('summarisation', 0.049), ('score', 0.049), ('grammaticality', 0.046), ('backtracking', 0.042), ('russell', 0.042), ('ln', 0.041), ('visited', 0.041), ('length', 0.039), ('image', 0.039), ('sorted', 0.038), ('ly', 0.038), ('queue', 0.038), ('church', 0.037), ('type', 0.035), ('aggregated', 0.035), ('guaranteed', 0.034), ('correlation', 0.034), ('domain', 0.034), ('return', 0.033), ('sn', 0.033), ('names', 0.032), ('admissible', 0.032), ('conroy', 0.032), ('finishing', 0.032), ('incur', 0.032), ('mly', 0.032), ('optimises', 0.032), ('popped', 0.032), ('riedhammer', 0.032), ('rreeqquuiirree', 0.032), ('summariser', 0.032), ('westminster', 0.032), ('end', 0.032), ('cost', 0.032), ('scores', 0.031), ('nodes', 0.031), ('sentences', 0.031), ('weights', 0.031), ('lengths', 0.03), ('composed', 0.03), ('object', 0.029), ('coherence', 0.029), ('constraint', 0.029), ('sentence', 0.028), ('disconnect', 0.028), ('necessitate', 0.028), ('clarity', 0.028), ('assessment', 0.028), ('bound', 0.028), ('goal', 0.027), ('greedy', 0.027), ('incrementally', 0.026), ('max', 0.026), ('gets', 0.026), ('finds', 0.025), ('criterion', 0.025), ('abstractive', 0.025), ('captions', 0.025), ('optimising', 0.025), ('sheffield', 0.025), ('tac', 0.025), ('query', 0.024), ('radev', 0.023), ('yth', 0.023), ('manual', 0.023), ('salton', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning
Author: Ahmet Aker ; Trevor Cohn ; Robert Gaizauskas
Abstract: In this paper we address two key challenges for extractive multi-document summarization: the search problem of finding the best scoring summary and the training problem of learning the best model parameters. We propose an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run. Further, we propose a discriminative training algorithm which directly maximises the quality ofthe best summary, rather than assuming a sentence-level decomposition as in earlier work. Our approach leads to significantly better results than earlier techniques across a number of evaluation metrics.
2 0.2438195 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text
Author: Michael Paul ; ChengXiang Zhai ; Roxana Girju
Abstract: This paper presents a two-stage approach to summarizing multiple contrastive viewpoints in opinionated text. In the first stage, we use an unsupervised probabilistic approach to model and extract multiple viewpoints in text. We experiment with a variety of lexical and syntactic features, yielding significant performance gains over bag-of-words feature sets. In the second stage, we introduce Comparative LexRank, a novel random walk formulation to score sentences and pairs of sentences from opposite viewpoints based on both their representativeness of the collection as well as their contrastiveness with each other. Exper- imental results show that the proposed approach can generate informative summaries of viewpoints in opinionated text.
3 0.14646335 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
Author: Kristian Woodsend ; Yansong Feng ; Mirella Lapata
Abstract: The task of selecting information and rendering it appropriately appears in multiple contexts in summarization. In this paper we present a model that simultaneously optimizes selection and rendering preferences. The model operates over a phrase-based representation of the source document which we obtain by merging PCFG parse trees and dependency graphs. Selection preferences for individual phrases are learned discriminatively, while a quasi-synchronous grammar (Smith and Eisner, 2006) captures rendering preferences such as paraphrases and compressions. Based on an integer linear programming formulation, the model learns to generate summaries that satisfy both types of preferences, while ensuring that length, topic coverage and grammar constraints are met. Experiments on headline and image caption generation show that our method obtains state-of-the-art performance using essentially the same model for both tasks without any major modifications.
4 0.096914172 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay
Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1
5 0.063044764 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
Author: Guillaume Wisniewski ; Alexandre Allauzen ; Francois Yvon
Abstract: Extant Statistical Machine Translation (SMT) systems are very complex softwares, which embed multiple layers of heuristics and embark very large numbers of numerical parameters. As a result, it is difficult to analyze output translations and there is a real need for tools that could help developers to better understand the various causes of errors. In this study, we make a step in that direction and present an attempt to evaluate the quality of the phrase-based translation model. In order to identify those translation errors that stem from deficiencies in the phrase table (PT), we propose to compute the oracle BLEU-4 score, that is the best score that a system based on this PT can achieve on a reference corpus. By casting the computation of the oracle BLEU-1 as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of this score, and report measures performed on several standard benchmarks. Various other applications of these oracle decoding techniques are also reported and discussed. 1 Phrase-Based Machine Translation 1.1 Principle A Phrase-Based Translation System (PBTS) consists of a ruleset and a scoring function (Lopez, 2009). The ruleset, represented in the phrase table, is a set of phrase1pairs {(f, e) }, each pair expressing that the source phrase f can ,bee) r}e,w earicthten p (atirra enxslparteedss)i inngto t a target phrase e. Trarsaens flation hypotheses are generated by iteratively rewriting portions of the source sentence as prescribed by the ruleset, until each source word has been consumed by exactly one rule. The order of target words in an hypothesis is uniquely determined by the order in which the rewrite operation are performed. The search space ofthe translation model corresponds to the set of all possible sequences of 1Following the usage in statistical machine translation literature, use “phrase” to denote a subsequence of consecutive words. we 933 rules applications. The scoring function aims to rank all possible translation hypotheses in such a way that the best one has the highest score. A PBTS is learned from a parallel corpus in two independent steps. In a first step, the corpus is aligned at the word level, by using alignment tools such as Gi z a++ (Och and Ney, 2003) and some symmetrisation heuristics; phrases are then extracted by other heuristics (Koehn et al., 2003) and assigned numerical weights. In the second step, the parameters of the scoring function are estimated, typically through Minimum Error Rate training (Och, 2003). Translating a sentence amounts to finding the best scoring translation hypothesis in the search space. Because of the combinatorial nature of this problem, translation has to rely on heuristic search techniques such as greedy hill-climbing (Germann, 2003) or variants of best-first search like multi-stack decoding (Koehn, 2004). Moreover, to reduce the overall complexity of decoding, the search space is typically pruned using simple heuristics. For instance, the state-of-the-art phrase-based decoder Moses (Koehn et al., 2007) considers only a restricted number of translations for each source sequence2 and enforces a distortion limit3 over which phrases can be reordered. As a consequence, the best translation hypothesis returned by the decoder is not always the one with the highest score. 1.2 Typology of PBTS Errors Analyzing the errors of a SMT system is not an easy task, because of the number of models that are combined, the size of these models, and the high complexity of the various decision making processes. For a SMT system, three different kinds of errors can be distinguished (Germann et al., 2004; Auli et al., 2009): search errors, induction errors and model errors. The former corresponds to cases where the hypothesis with the best score is missed by the search procedure, either because of the use of an ap2the 3the option of Moses, defaulting to 20. dl option of Moses, whose default value is 7. tt l ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 9t3ic3s–943, proximate search method or because of the restrictions of the search space. Induction errors correspond to cases where, given the model, the search space does not contain the reference. Finally, model errors correspond to cases where the hypothesis with the highest score is not the best translation according to the evaluation metric. Model errors encompass several types oferrors that occur during learning (Bottou and Bousquet, 2008)4. Approximation errors are errors caused by the use of a restricted and oversimplistic class of functions (here, finitestate transducers to model the generation of hypotheses and a linear scoring function to discriminate them) to model the translation process. Estimation errors correspond to the use of sub-optimal values for both the phrase pairs weights and the parameters of the scoring function. The reasons behind these errors are twofold: first, training only considers a finite sample of data; second, it relies on error prone alignments. As a result, some “good” phrases are extracted with a small weight, or, in the limit, are not extracted at all; and conversely that some “poor” phrases are inserted into the phrase table, sometimes with a really optimistic score. Sorting out and assessing the impact of these various causes of errors is of primary interest for SMT system developers: for lack of such diagnoses, it is difficult to figure out which components of the system require the most urgent attention. Diagnoses are however, given the tight intertwining among the various component of a system, very difficult to obtain: most evaluations are limited to the computation of global scores and usually do not imply any kind of failure analysis. 1.3 Contribution and organization To systematically assess the impact of the multiple heuristic decisions made during training and decoding, we propose, following (Dreyer et al., 2007; Auli et al., 2009), to work out oracle scores, that is to evaluate the best achievable performances of a PBTS. We aim at both studying the expressive power of PBTS and at providing tools for identifying and quantifying causes of failure. Under standard metrics such as BLEU (Papineni et al., 2002), oracle scores are difficult (if not impossible) to compute, but, by casting the computation of the oracle unigram recall and precision as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of the oracle BLEU-4 scores and report measurements performed on several standard benchmarks. The main contributions of this paper are twofold. We first introduce an ILP program able to efficiently find the best hypothesis a PBTS can achieve. This program can be easily extended to test various improvements to 4We omit here optimization errors. 934 phrase-base systems or to evaluate the impact of different parameter settings. Second, we present a number of complementary results illustrating the usage of our oracle decoder for identifying and analyzing PBTS errors. Our experimental results confirm the main conclusions of (Turchi et al., 2008), showing that extant PBTs have the potential to generate hypotheses having very high BLEU4 score and that their main bottleneck is their scoring function. The rest of this paper is organized as follows: in Section 2, we introduce and formalize the oracle decoding problem, and present a series of ILP problems of increasing complexity designed so as to deliver accurate lowerbounds of oracle score. This section closes with various extensions allowing to model supplementary constraints, most notably reordering constraints (Section 2.5). Our experiments are reported in Section 3, where we first introduce the training and test corpora, along with a description of our system building pipeline (Section 3. 1). We then discuss the baseline oracle BLEU scores (Section 3.2), analyze the non-reachable parts of the reference translations, and comment several complementary results which allow to identify causes of failures. Section 4 discuss our approach and findings with respect to the existing literature on error analysis and oracle decoding. We conclude and discuss further prospects in Section 5. 2 Oracle Decoder 2.1 The Oracle Decoding Problem Definition To get some insights on the errors of phrasebased systems and better understand their limits, we propose to consider the oracle decoding problem defined as follows: given a source sentence, its reference translation5 and a phrase table, what is the “best” translation hypothesis a system can generate? As usual, the quality of an hypothesis is evaluated by the similarity between the reference and the hypothesis. Note that in the oracle decoding problem, we are only assessing the ability of PBT systems to generate good candidate translations, irrespective of their ability to score them properly. We believe that studying this problem is interesting for various reasons. First, as described in Section 3.4, comparing the best hypothesis a system could have generated and the hypothesis it actually generates allows us to carry on both quantitative and qualitative failure analysis. The oracle decoding problem can also be used to assess the expressive power of phrase-based systems (Auli et al., 2009). Other applications include computing acceptable pseudo-references for discriminative training (Tillmann and Zhang, 2006; Liang et al., 2006; Arun and 5The oracle decoding problem can be extended to the case of multiple references. For the sake of simplicity, we only describe the case of a single reference. Koehn, 2007) or combining machine translation systems in a multi-source setting (Li and Khudanpur, 2009). We have also used oracle decoding to identify erroneous or difficult to translate references (Section 3.3). Evaluation Measure To fully define the oracle decoding problem, a measure of the similarity between a translation hypothesis and its reference translation has to be chosen. The most obvious choice is the BLEU-4 score (Papineni et al., 2002) used in most machine translation evaluations. However, using this metric in the oracle decoding problem raises several issues. First, BLEU-4 is a metric defined at the corpus level and is hard to interpret at the sentence level. More importantly, BLEU-4 is not decomposable6: as it relies on 4-grams statistics, the contribution of each phrase pair to the global score depends on the translation of the previous and following phrases and can not be evaluated in isolation. Because of its nondecomposability, maximizing BLEU-4 is hard; in particular, the phrase-level decomposability of the evaluation × metric is necessary in our approach. To circumvent this difficulty, we propose to evaluate the similarity between a translation hypothesis and a reference by the number of their common words. This amounts to evaluating translation quality in terms of unigram precision and recall, which are highly correlated with human judgements (Lavie et al., ). This measure is closely related to the BLEU-1 evaluation metric and the Meteor (Banerjee and Lavie, 2005) metric (when it is evaluated without considering near-matches and the distortion penalty). We also believe that hypotheses that maximize the unigram precision and recall at the sentence level yield corpus level BLEU-4 scores close the maximal achievable. Indeed, in the setting we will introduce in the next section, BLEU-1 and BLEU-4 are highly correlated: as all correct words of the hypothesis will be compelled to be at their correct position, any hypothesis with a high 1-gram precision is also bound to have a high 2-gram precision, etc. 2.2 Formalizing the Oracle Decoding Problem The oracle decoding problem has already been considered in the case of word-based models, in which all translation units are bound to contain only one word. The problem can then be solved by a bipartite graph matching algorithm (Leusch et al., 2008): given a n m binary matarligxo describing possible t 2r0an08sl)a:ti goinv elinn aks n b×emtw beeinna source words and target words7, this algorithm finds the subset of links maximizing the number of words of the reference that have been translated, while ensuring that each word 6Neither at the sentence (Chiang et al., 2008), nor at the phrase level. 7The (i, j) entry of the matrix is 1if the ith word of the source can be translated by the jth word of the reference, 0 otherwise. 935 is translated only once. Generalizing this approach to phrase-based systems amounts to solving the following problem: given a set of possible translation links between potential phrases of the source and of the target, find the subset of links so that the unigram precision and recall are the highest possible. The corresponding oracle hypothesis can then be easily generated by selecting the target phrases that are aligned with one source phrase, disregarding the others. In addition, to mimic the way OOVs are usually handled, we match identical OOV tokens appearing both in the source and target sentences. In this approach, the unigram precision is always one (every word generated in the oracle hypothesis matches exactly one word in the reference). As a consequence, to find the oracle hypothesis, we just have to maximize the recall, that is the number of words appearing both in the hypothesis and in the reference. Considering phrases instead of isolated words has a major impact on the computational complexity: in this new setting, the optimal segmentations in phrases of both the source and of the target have to be worked out in addition to links selection. Moreover, constraints have to be taken into account so as to enforce a proper segmentation of the source and target sentences. These constraints make it impossible to use the approach of (Leusch et al., 2008) and concur in making the oracle decoding problem for phrase-based models more complex than it is for word-based models: it can be proven, using arguments borrowed from (De Nero and Klein, 2008), that this problem is NP-hard even for the simple unigram precision measure. 2.3 An Integer Program for Oracle Decoding To solve the combinatorial problem introduced in the previous section, we propose to cast it into an Integer Linear Programming (ILP) problem, for which many generic solvers exist. ILP has already been used in SMT to find the optimal translation for word-based (Germann et al., 2001) and to study the complexity of learning phrase alignments (De Nero and Klein, 2008) models. Following the latter reference, we introduce the following variables: fi,j (resp. ek,l) is a binary indicator variable that is true when the phrase contains all spans from betweenword position i to j (resp. k to l) of the source (resp. target) sentence. We also introduce a binary variable, denoted ai,j,k,l, to describe a possible link between source phrase fi,j and target phrase ek,l. These variables are built from the entries of the phrase table according to selection strategies introduced in Section 2.4. In the following, index variables are so that: 0 ≤ i< j ≤ n, in the source sentence and 0 ≤ k < l ≤ m, in the target sentence, where n (resp. m) is the length of the source (resp. target) sentence. Solving the oracle decoding problem then amounts to optimizing the following objective function: mi,j,akx,li,Xj,k,lai,j,k,l· (l − k), (1) under the constraints: X ∀x ∈ J1,mK : ek,l ≤ 1 (2) = (3) 1∀,kn,lK : Xai,j,k,l = fk,l (4) ∀i,j : Xai,j,k,l (5) k,l s.tX. Xk≤x≤l ∀∀xy ∈∈ J11,,mnKK : X i,j s.tX. Xi≤y≤j fi,j 1 Xi,j = ei,j Xk,l The objective function (1) corresponds to the number of target words that are generated. The first set of constraints (2) ensures that each word in the reference e ap- pears in no more than one phrase. Maximizing the objective under these constraints amounts to maximizing the unigram recall. The second set of constraints (3) ensures that each word in the source f is translated exactly once, which guarantees that the search space of the ILP problem is the same as the search space of a phrase-based system. Constraints (4) bind the fk,l and ai,j,k,l variables, ensuring that whenever a link ai,j,k,l is active, the corresponding phrase fk,l is also active. Constraints (5) play a similar role for the reference. The Relaxed Problem Even though it accurately models the search space of a phrase-based decoder, this programs is not really useful as is: due to out-ofvocabulary words or missing entries in the phrase table, the constraint that all source words should be translated yields infeasible problems8. We propose to relax this problem and allow some source words to remain untranslated. This is done by replacing constraints (3) by: ∀y ∈ J1,nK : X i,j s.tX. Xi≤y≤j fi,j ≤ 1 To better ref∀lyec ∈t th J1e, bneKh :avior of phrase-based decoders, which attempt to translate all source words, we also need to modify the objective function as follows: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) Xi,j (6) The second term in this new objective ensures that optimal solutions translate as many source words as possible. 8An ILP problem is said to be infeasible when tion violates at least one constraint. every possible solu- 936 The Relaxed-Distortion Problem A last caveat with the Relaxed optimization program is caused by frequently occurring source tokens, such as function words or punctuation signs, which can often align with more than one target word. For lack of taking distortion information into account in our objective function, all these alignments are deemed equivalent, even if some of them are clearly more satisfactory than others. This situation is illustrated on Figure 1. le chat et the cat and le the chien dog Figure 1: Equivalent alignments between “le” and “the”. The dashed lines corresponds to a less interpretable solution. To overcome this difficulty, we propose a last change to the objective function: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) X ai,j,k,l|k − i| Xi,j −α (7) i Xk ,l X,j, Compared to the objective function of the relaxed problem (6), we introduce here a supplementary penalty factor which favors monotonous alignments. For each phrase pair, the higher the difference between source and target positions, the higher this penalty. If α is small enough, this extra term allows us to select, among all the optimal alignments of the re l axed problem, the one with the lowest distortion. In our experiments, we set α to min {n, m} to ensure that the penalty factor is always smminall{enr, ,tmha}n tthoe e rneswuarred t fhoart aligning atwltyo single iwso ardlwsa. 2.4 Selecting Indicator Variables In the approach introduced in the previous sections, the oracle decoding problem is solved by selecting, among a set of possible translation links, the ones that yield the solution with the highest unigram recall. We propose two strategies to build this set of possible translation links. In the first one, denoted exact match, an indicator ai,j,k,l is created if there is an entry (f, e) so that f spans from word position ito j in the source and e from word position k to l in the target. In this strategy, the ILP program considers exactly the same ruleset as conventional phrase-based decoders. We also consider an alternative strategy, which could help us to identify errors made during the phrase extraction process. In this strategy, denoted inside match, an indicator ai,j,k,l is created when the following three criteria are met: i) f spans from position ito j of the source; ii) a substring of e, denoted e, spans from position k to l of the reference; iii) (f, e¯) is not an entry of the phrase table. The resulting set of indicator variables thus contains, at least, all the variables used in the exact match strategy. In addition, we license here the use of phrases containing words that do not occur in the reference. In fact, using such solutions can yield higher BLEU scores when the reward for additional correct matches exceeds the cost incurred by wrong predictions. These cases are symptoms of situations where the extraction heuristic failed to extract potentially useful subphrases. 2.5 Oracle Decoding with Reordering Constraints The ILP problem introduced in the previous section can be extended in several ways to describe and test various improvements to phrase-based systems or to evaluate the impact of different parameter settings. This flexibility mainly stems from the possibility offered by our framework to express arbitrary constraints over variables. In this section, we illustrate these possibilities by describing how reordering constraints can easily be considered. As a first example, the Moses decoder uses a distortion limit to constrain the set of possible reorderings. This constraint “enforces (...) that the last word of a phrase chosen for translation cannot be more than d9 words from the leftmost untranslated word in the source” (Lopez, 2009) and is expressed as: ∀aijkl , ai0j0k0l0 s.t. k > k0, aijkl · ai0j0k0l0 · |j − i0 + 1| ≤ d, The maximum distortion limit strategy (Lopez, 2009) is also easily expressed and take the following form (assuming this constraint is parameterized by d): ∀l < m − 1, ai,j,k,l·ai0,j0,l+1,l0 · |i0 − j − 1| 71is%t e6hs.a distortion greater that Moses default distortion limit. alignment decisions enabled by the use of larger training corpora and phrase table. To evaluate the impact ofthe second heuristic, we computed the number of phrases discarded by Moses (be- cause of the default ttl limit) but used in the oracle hypotheses. In the English to French NEWSCO setting, they account for 34.11% of the total number of phrases used in the oracle hypotheses. When the oracle decoder is constrained to use the same phrase table as Moses, its BLEU-4 score drops to 42.78. This shows that filtering the phrase table prior to decoding discards many useful phrase pairs and is seriously limiting the best achievable performance, a conclusion shared with (Auli et al., 2009). Search Errors Search errors can be identified by comparing the score of the best hypothesis found by Moses and the score of the oracle hypothesis. If the score of the oracle hypothesis is higher, then there has been a search error; on the contrary, there has been an estimation error when the score of the oracle hypothesis is lower than the score of the best hypothesis found by Moses. 940 Based on the comparison of the score of Moses hypotheses and of oracle hypotheses for the English to French NEWSCO setting, our preliminary conclusion is that the number of search errors is quite limited: only about 5% of the hypotheses of our oracle decoder are actually getting a better score than Moses solutions. Again, this shows that the scoring function (model error) is one of the main bottleneck of current PBTS. Comparing these hypotheses is nonetheless quite revealing: while Moses mostly selects phrase pairs with high translation scores and generates monotonous alignments, our ILP decoder uses larger reorderings and less probable phrases to achieve better solutions: on average, the reordering score of oracle solutions is −5.74, compared to −76.78 fscoro rMeo osfe osr outputs. iGonivsen is −the5 weight assigned through MERT training to the distortion score, no wonder that these hypotheses are severely penalized. The Impact of Phrase Length The observed outputs do not only depend on decisions made during the search, but also on decisions made during training. One such decision is the specification of maximal length for the source and target phrases. In our framework, evaluating the impact of this decision is simple: it suffices to change the definition of indicator variables so as to consider only alignments between phrases of a given length. In the English-French NEWSCO setting, the most restrictive choice, when only alignments between single words are authorized, yields an oracle BLEU-4 of 48.68; however, authorizing phrases up to length 2 allows to achieve an oracle value of 66.57, very close to the score achieved when considering all extracted phrases (67.77). This is corroborated with a further analysis of our oracle alignments, which use phrases whose average source length is 1.21 words (respectively 1.31 for target words). If many studies have already acknowledged the predomi- nance of “small” phrases in actual translations, our oracle scores suggest that, for this language pair, increasing the phrase length limit beyond 2 or 3 might be a waste of computational resources. 4 Related Work To the best of our knowledge, there are only a few works that try to study the expressive power ofphrase-based machine translation systems or to provide tools for analyzing potential causes of failure. The approach described in (Auli et al., 2009) is very similar to ours: in this study, the authors propose to find and analyze the limits of machine translation systems by studying the reference reachability. A reference is reachable for a given system if it can be exactly generated by this system. Reference reachability is assessed using Moses in forced decoding mode: during search, all hypotheses that deviate from the reference are simply discarded. Even though the main goal of this study was to compare the search space of phrase-based and hierarchical systems, it also provides some insights on the impact of various search parameters in Moses, delivering conclusions that are consistent with our main results. As described in Section 1.2, these authors also propose a typology of the errors of a statistical translation systems, but do not attempt to provide methods for identifying them. The authors of (Turchi et al., 2008) study the learn- ing capabilities of Moses by extensively analyzing learning curves representing the translation performances as a function of the number of examples, and by corrupting the model parameters. Even though their focus is more on assessing the scoring function, they reach conclusions similar to ours: the current bottleneck of translation performances is not the representation power of the PBTS but rather in their scoring functions. Oracle decoding is useful to compute reachable pseudo-references in the context of discriminative training. This is the main motivation of (Tillmann and Zhang, 2006), where the authors compute high BLEU hypotheses by running a conventional decoder so as to maximize a per-sentence approximation of BLEU-4, under a simple (local) reordering model. Oracle decoding has also been used to assess the limitations induced by various reordering constraints in (Dreyer et al., 2007). To this end, the authors propose to use a beam-search based oracle decoder, which computes lower bounds of the best achievable BLEU-4 using dynamic programming techniques over finite-state (for so-called local and IBM constraints) or hierarchically structured (for ITG constraints) sets of hypotheses. Even 941 though the numbers reported in this study are not directly comparable with ours17, it seems that our decoder is not only conceptually much simpler, but also achieves much more optimistic lower-bounds of the oracle BLEU score. The approach described in (Li and Khudanpur, 2009) employs a similar technique, which is to guide a heuristic search in an hypergraph representing possible translation hypotheses with n-gram counts matches, which amounts to decoding with a n-gram model trained on the sole reference translation. Additional tricks are presented in this article to speed-up decoding. Computing oracle BLEU scores is also the subject of (Zens and Ney, 2005; Leusch et al., 2008), yet with a different emphasis. These studies are concerned with finding the best hypotheses in a word graph or in a consensus network, a problem that has various implications for multi-pass decoding and/or system combination techniques. The former reference describes an exponential approximate algorithm, while the latter proves the NPcompleteness of this problem and discuss various heuristic approaches. Our problem is somewhat more complex and using their techniques would require us to built word graphs containing all the translations induced by arbitrary segmentations and permutations of the source sentence. 5 Conclusions In this paper, we have presented a methodology for analyzing the errors of PBTS, based on the computation of an approximation of the BLEU-4 oracle score. We have shown that this approximation could be computed fairly accurately and efficiently using Integer Linear Programming techniques. Our main result is a confirmation of the fact that extant PBTS systems are expressive enough to achieve very high translation performance with respect to conventional quality measurements. The main efforts should therefore strive to improve on the way phrases and hypotheses are scored during training. This gives further support to attempts aimed at designing context-dependent scoring functions as in (Stroppa et al., 2007; Gimpel and Smith, 2008), or at attempts to perform discriminative training of feature-rich models. (Bangalore et al., 2007). We have shown that the examination of difficult-totranslate sentences was an effective way to detect errors or inconsistencies in the reference translations, making our approach a potential aid for controlling the quality or assessing the difficulty of test data. Our experiments have also highlighted the impact of various parameters. Various extensions of the baseline ILP program have been suggested and/or evaluated. In particular, the ILP formalism lends itself well to expressing various constraints that are typically used in conventional PBTS. In 17The best BLEU-4 oracle they achieve on Europarl German to English is approximately 48; but they considered a smaller version of the training corpus and the WMT’06 test set. our future work, we aim at using this ILP framework to systematically assess various search configurations. We plan to explore how replacing non-reachable references with high-score pseudo-references can improve discrim- inative training of PBTS. We are also concerned by determining how tight is our approximation of the BLEU4 score is: to this end, we intend to compute the best BLEU-4 score within the n-best solutions of the oracle decoding problem. Acknowledgments Warm thanks to Houda Bouamor for helping us with the annotation tool. This work has been partly financed by OSEO, the French State Agency for Innovation, under the Quaero program. References Tobias Achterberg. 2007. Constraint Integer Programming. Ph.D. thesis, Technische Universit a¨t Berlin. http : / / opus .kobv .de /tuberl in/vol ltexte / 2 0 0 7 / 16 11/ . Abhishek Arun and Philipp Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In Proc. of MT Summit XI, Copenhagen, Denmark. Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Koehn. 2009. A systematic analysis of translation model search spaces. In Proc. of WMT, pages 224–232, Athens, Greece. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Srinivas Bangalore, Patrick Haffner, and Stephan Kanthak. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proc. of ACL, pages 152–159, Prague, Czech Republic. L e´on Bottou and Olivier Bousquet. 2008. The tradeoffs oflarge scale learning. In Proc. of NIPS, pages 161–168, Vancouver, B.C., Canada. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proc. of WMT, pages 1–28, Athens, Greece. David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proc. of ECML, pages 610–619, Honolulu, Hawaii. John De Nero and Dan Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL: HLT, Short Papers, pages 25–28, Columbus, Ohio. Markus Dreyer, Keith B. Hall, and Sanjeev P. Khudanpur. 2007. Comparing reordering constraints for smt using efficient bleu oracle computation. In NAACL-HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 103– 110, Rochester, New York. 942 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proc. of ACL, pages 228–235, Toulouse, France. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1-2): 127– 143. Ulrich Germann. 2003. Greedy decoding for statistical machine translation in almost linear time. In Proc. of NAACL, pages 1–8, Edmonton, Canada. Kevin Gimpel and Noah A. Smith. 2008. Rich source-side context for statistical machine translation. In Proc. of WMT, pages 9–17, Columbus, Ohio. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL, pages 48–54, Edmonton, Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL, demonstration session. Philipp Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA, pages 115–124, Washington DC. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proc. of HLT, pages 161–168, Vancouver, Canada. Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. The significance of recall in automatic metrics for MT evaluation. In In Proc. of AMTA, pages 134–143, Washington DC. Gregor Leusch, Evgeny Matusov, and Hermann Ney. 2008. Complexity of finding the BLEU-optimal hypothesis in a confusion network. In Proc. of EMNLP, pages 839–847, Honolulu, Hawaii. Zhifei Li and Sanjeev Khudanpur. 2009. Efficient extraction of oracle-best translations from hypergraphs. In Proc. of NAACL, pages 9–12, Boulder, Colorado. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proc. of ACL, pages 761–768, Sydney, Australia. Adam Lopez. 2009. Translation as weighted deduction. In Proc. of EACL, pages 532–540, Athens, Greece. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. , 29(1): 19–5 1. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. Technical report, Philadelphia, Pennsylvania. D. Roth and W. Yih. 2005. Integer linear programming inference for conditional random fields. In Proc. of ICML, pages 737–744, Bonn, Germany. Nicolas Stroppa, Antal van den Bosch, and Andy Way. 2007. Exploiting source similarity for smt using context-informed features. In Andy Way and Barbara Proc. of TMI, pages Christoph Tillmann 231–240, Sk¨ ovde, and Tong Zhang. Gawronska, editors, Sweden. 2006. A discriminative global training algorithm for statistical mt. In Proc. of ACL, 721–728, Sydney, Australia. Turchi, Tijl De Bie, and Nello pages Marco Cristianini. 2008. Learn- ing performance of a machine translation system: a statistical and computational analysis. In Proc. of WMT, pages Columbus, Ohio. 35–43, Richard Zens and Hermann Ney. 2005. Word graphs for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 191–198, Ann Arbor, Michigan. 943
6 0.061423454 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names
7 0.057385895 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
8 0.051648874 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
9 0.046826646 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
10 0.046671875 80 emnlp-2010-Modeling Organization in Student Essays
11 0.045697417 73 emnlp-2010-Learning Recurrent Event Queries for Web Search
12 0.044048131 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
13 0.043441556 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
14 0.042742942 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts
15 0.042532839 61 emnlp-2010-Improving Gender Classification of Blog Authors
16 0.042435173 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter
17 0.042348523 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
18 0.042011306 84 emnlp-2010-NLP on Spoken Documents Without ASR
19 0.042010039 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs
20 0.040123694 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
topicId topicWeight
[(0, 0.161), (1, 0.054), (2, -0.118), (3, -0.018), (4, 0.094), (5, 0.075), (6, -0.022), (7, -0.024), (8, -0.047), (9, -0.079), (10, -0.095), (11, 0.054), (12, 0.075), (13, -0.046), (14, 0.017), (15, 0.028), (16, 0.288), (17, -0.426), (18, -0.272), (19, 0.227), (20, 0.138), (21, 0.213), (22, -0.035), (23, 0.082), (24, 0.098), (25, -0.003), (26, 0.009), (27, -0.027), (28, -0.059), (29, -0.051), (30, -0.014), (31, -0.061), (32, -0.061), (33, -0.014), (34, 0.05), (35, -0.111), (36, -0.047), (37, -0.078), (38, -0.016), (39, 0.001), (40, -0.017), (41, -0.085), (42, -0.02), (43, -0.007), (44, -0.1), (45, 0.057), (46, -0.018), (47, 0.004), (48, 0.043), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.95998758 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning
Author: Ahmet Aker ; Trevor Cohn ; Robert Gaizauskas
Abstract: In this paper we address two key challenges for extractive multi-document summarization: the search problem of finding the best scoring summary and the training problem of learning the best model parameters. We propose an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run. Further, we propose a discriminative training algorithm which directly maximises the quality ofthe best summary, rather than assuming a sentence-level decomposition as in earlier work. Our approach leads to significantly better results than earlier techniques across a number of evaluation metrics.
2 0.87594235 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text
Author: Michael Paul ; ChengXiang Zhai ; Roxana Girju
Abstract: This paper presents a two-stage approach to summarizing multiple contrastive viewpoints in opinionated text. In the first stage, we use an unsupervised probabilistic approach to model and extract multiple viewpoints in text. We experiment with a variety of lexical and syntactic features, yielding significant performance gains over bag-of-words feature sets. In the second stage, we introduce Comparative LexRank, a novel random walk formulation to score sentences and pairs of sentences from opposite viewpoints based on both their representativeness of the collection as well as their contrastiveness with each other. Exper- imental results show that the proposed approach can generate informative summaries of viewpoints in opinionated text.
3 0.50476331 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
Author: Kristian Woodsend ; Yansong Feng ; Mirella Lapata
Abstract: The task of selecting information and rendering it appropriately appears in multiple contexts in summarization. In this paper we present a model that simultaneously optimizes selection and rendering preferences. The model operates over a phrase-based representation of the source document which we obtain by merging PCFG parse trees and dependency graphs. Selection preferences for individual phrases are learned discriminatively, while a quasi-synchronous grammar (Smith and Eisner, 2006) captures rendering preferences such as paraphrases and compressions. Based on an integer linear programming formulation, the model learns to generate summaries that satisfy both types of preferences, while ensuring that length, topic coverage and grammar constraints are met. Experiments on headline and image caption generation show that our method obtains state-of-the-art performance using essentially the same model for both tasks without any major modifications.
4 0.22783463 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay
Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1
5 0.22103639 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
Author: Guillaume Wisniewski ; Alexandre Allauzen ; Francois Yvon
Abstract: Extant Statistical Machine Translation (SMT) systems are very complex softwares, which embed multiple layers of heuristics and embark very large numbers of numerical parameters. As a result, it is difficult to analyze output translations and there is a real need for tools that could help developers to better understand the various causes of errors. In this study, we make a step in that direction and present an attempt to evaluate the quality of the phrase-based translation model. In order to identify those translation errors that stem from deficiencies in the phrase table (PT), we propose to compute the oracle BLEU-4 score, that is the best score that a system based on this PT can achieve on a reference corpus. By casting the computation of the oracle BLEU-1 as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of this score, and report measures performed on several standard benchmarks. Various other applications of these oracle decoding techniques are also reported and discussed. 1 Phrase-Based Machine Translation 1.1 Principle A Phrase-Based Translation System (PBTS) consists of a ruleset and a scoring function (Lopez, 2009). The ruleset, represented in the phrase table, is a set of phrase1pairs {(f, e) }, each pair expressing that the source phrase f can ,bee) r}e,w earicthten p (atirra enxslparteedss)i inngto t a target phrase e. Trarsaens flation hypotheses are generated by iteratively rewriting portions of the source sentence as prescribed by the ruleset, until each source word has been consumed by exactly one rule. The order of target words in an hypothesis is uniquely determined by the order in which the rewrite operation are performed. The search space ofthe translation model corresponds to the set of all possible sequences of 1Following the usage in statistical machine translation literature, use “phrase” to denote a subsequence of consecutive words. we 933 rules applications. The scoring function aims to rank all possible translation hypotheses in such a way that the best one has the highest score. A PBTS is learned from a parallel corpus in two independent steps. In a first step, the corpus is aligned at the word level, by using alignment tools such as Gi z a++ (Och and Ney, 2003) and some symmetrisation heuristics; phrases are then extracted by other heuristics (Koehn et al., 2003) and assigned numerical weights. In the second step, the parameters of the scoring function are estimated, typically through Minimum Error Rate training (Och, 2003). Translating a sentence amounts to finding the best scoring translation hypothesis in the search space. Because of the combinatorial nature of this problem, translation has to rely on heuristic search techniques such as greedy hill-climbing (Germann, 2003) or variants of best-first search like multi-stack decoding (Koehn, 2004). Moreover, to reduce the overall complexity of decoding, the search space is typically pruned using simple heuristics. For instance, the state-of-the-art phrase-based decoder Moses (Koehn et al., 2007) considers only a restricted number of translations for each source sequence2 and enforces a distortion limit3 over which phrases can be reordered. As a consequence, the best translation hypothesis returned by the decoder is not always the one with the highest score. 1.2 Typology of PBTS Errors Analyzing the errors of a SMT system is not an easy task, because of the number of models that are combined, the size of these models, and the high complexity of the various decision making processes. For a SMT system, three different kinds of errors can be distinguished (Germann et al., 2004; Auli et al., 2009): search errors, induction errors and model errors. The former corresponds to cases where the hypothesis with the best score is missed by the search procedure, either because of the use of an ap2the 3the option of Moses, defaulting to 20. dl option of Moses, whose default value is 7. tt l ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 9t3ic3s–943, proximate search method or because of the restrictions of the search space. Induction errors correspond to cases where, given the model, the search space does not contain the reference. Finally, model errors correspond to cases where the hypothesis with the highest score is not the best translation according to the evaluation metric. Model errors encompass several types oferrors that occur during learning (Bottou and Bousquet, 2008)4. Approximation errors are errors caused by the use of a restricted and oversimplistic class of functions (here, finitestate transducers to model the generation of hypotheses and a linear scoring function to discriminate them) to model the translation process. Estimation errors correspond to the use of sub-optimal values for both the phrase pairs weights and the parameters of the scoring function. The reasons behind these errors are twofold: first, training only considers a finite sample of data; second, it relies on error prone alignments. As a result, some “good” phrases are extracted with a small weight, or, in the limit, are not extracted at all; and conversely that some “poor” phrases are inserted into the phrase table, sometimes with a really optimistic score. Sorting out and assessing the impact of these various causes of errors is of primary interest for SMT system developers: for lack of such diagnoses, it is difficult to figure out which components of the system require the most urgent attention. Diagnoses are however, given the tight intertwining among the various component of a system, very difficult to obtain: most evaluations are limited to the computation of global scores and usually do not imply any kind of failure analysis. 1.3 Contribution and organization To systematically assess the impact of the multiple heuristic decisions made during training and decoding, we propose, following (Dreyer et al., 2007; Auli et al., 2009), to work out oracle scores, that is to evaluate the best achievable performances of a PBTS. We aim at both studying the expressive power of PBTS and at providing tools for identifying and quantifying causes of failure. Under standard metrics such as BLEU (Papineni et al., 2002), oracle scores are difficult (if not impossible) to compute, but, by casting the computation of the oracle unigram recall and precision as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of the oracle BLEU-4 scores and report measurements performed on several standard benchmarks. The main contributions of this paper are twofold. We first introduce an ILP program able to efficiently find the best hypothesis a PBTS can achieve. This program can be easily extended to test various improvements to 4We omit here optimization errors. 934 phrase-base systems or to evaluate the impact of different parameter settings. Second, we present a number of complementary results illustrating the usage of our oracle decoder for identifying and analyzing PBTS errors. Our experimental results confirm the main conclusions of (Turchi et al., 2008), showing that extant PBTs have the potential to generate hypotheses having very high BLEU4 score and that their main bottleneck is their scoring function. The rest of this paper is organized as follows: in Section 2, we introduce and formalize the oracle decoding problem, and present a series of ILP problems of increasing complexity designed so as to deliver accurate lowerbounds of oracle score. This section closes with various extensions allowing to model supplementary constraints, most notably reordering constraints (Section 2.5). Our experiments are reported in Section 3, where we first introduce the training and test corpora, along with a description of our system building pipeline (Section 3. 1). We then discuss the baseline oracle BLEU scores (Section 3.2), analyze the non-reachable parts of the reference translations, and comment several complementary results which allow to identify causes of failures. Section 4 discuss our approach and findings with respect to the existing literature on error analysis and oracle decoding. We conclude and discuss further prospects in Section 5. 2 Oracle Decoder 2.1 The Oracle Decoding Problem Definition To get some insights on the errors of phrasebased systems and better understand their limits, we propose to consider the oracle decoding problem defined as follows: given a source sentence, its reference translation5 and a phrase table, what is the “best” translation hypothesis a system can generate? As usual, the quality of an hypothesis is evaluated by the similarity between the reference and the hypothesis. Note that in the oracle decoding problem, we are only assessing the ability of PBT systems to generate good candidate translations, irrespective of their ability to score them properly. We believe that studying this problem is interesting for various reasons. First, as described in Section 3.4, comparing the best hypothesis a system could have generated and the hypothesis it actually generates allows us to carry on both quantitative and qualitative failure analysis. The oracle decoding problem can also be used to assess the expressive power of phrase-based systems (Auli et al., 2009). Other applications include computing acceptable pseudo-references for discriminative training (Tillmann and Zhang, 2006; Liang et al., 2006; Arun and 5The oracle decoding problem can be extended to the case of multiple references. For the sake of simplicity, we only describe the case of a single reference. Koehn, 2007) or combining machine translation systems in a multi-source setting (Li and Khudanpur, 2009). We have also used oracle decoding to identify erroneous or difficult to translate references (Section 3.3). Evaluation Measure To fully define the oracle decoding problem, a measure of the similarity between a translation hypothesis and its reference translation has to be chosen. The most obvious choice is the BLEU-4 score (Papineni et al., 2002) used in most machine translation evaluations. However, using this metric in the oracle decoding problem raises several issues. First, BLEU-4 is a metric defined at the corpus level and is hard to interpret at the sentence level. More importantly, BLEU-4 is not decomposable6: as it relies on 4-grams statistics, the contribution of each phrase pair to the global score depends on the translation of the previous and following phrases and can not be evaluated in isolation. Because of its nondecomposability, maximizing BLEU-4 is hard; in particular, the phrase-level decomposability of the evaluation × metric is necessary in our approach. To circumvent this difficulty, we propose to evaluate the similarity between a translation hypothesis and a reference by the number of their common words. This amounts to evaluating translation quality in terms of unigram precision and recall, which are highly correlated with human judgements (Lavie et al., ). This measure is closely related to the BLEU-1 evaluation metric and the Meteor (Banerjee and Lavie, 2005) metric (when it is evaluated without considering near-matches and the distortion penalty). We also believe that hypotheses that maximize the unigram precision and recall at the sentence level yield corpus level BLEU-4 scores close the maximal achievable. Indeed, in the setting we will introduce in the next section, BLEU-1 and BLEU-4 are highly correlated: as all correct words of the hypothesis will be compelled to be at their correct position, any hypothesis with a high 1-gram precision is also bound to have a high 2-gram precision, etc. 2.2 Formalizing the Oracle Decoding Problem The oracle decoding problem has already been considered in the case of word-based models, in which all translation units are bound to contain only one word. The problem can then be solved by a bipartite graph matching algorithm (Leusch et al., 2008): given a n m binary matarligxo describing possible t 2r0an08sl)a:ti goinv elinn aks n b×emtw beeinna source words and target words7, this algorithm finds the subset of links maximizing the number of words of the reference that have been translated, while ensuring that each word 6Neither at the sentence (Chiang et al., 2008), nor at the phrase level. 7The (i, j) entry of the matrix is 1if the ith word of the source can be translated by the jth word of the reference, 0 otherwise. 935 is translated only once. Generalizing this approach to phrase-based systems amounts to solving the following problem: given a set of possible translation links between potential phrases of the source and of the target, find the subset of links so that the unigram precision and recall are the highest possible. The corresponding oracle hypothesis can then be easily generated by selecting the target phrases that are aligned with one source phrase, disregarding the others. In addition, to mimic the way OOVs are usually handled, we match identical OOV tokens appearing both in the source and target sentences. In this approach, the unigram precision is always one (every word generated in the oracle hypothesis matches exactly one word in the reference). As a consequence, to find the oracle hypothesis, we just have to maximize the recall, that is the number of words appearing both in the hypothesis and in the reference. Considering phrases instead of isolated words has a major impact on the computational complexity: in this new setting, the optimal segmentations in phrases of both the source and of the target have to be worked out in addition to links selection. Moreover, constraints have to be taken into account so as to enforce a proper segmentation of the source and target sentences. These constraints make it impossible to use the approach of (Leusch et al., 2008) and concur in making the oracle decoding problem for phrase-based models more complex than it is for word-based models: it can be proven, using arguments borrowed from (De Nero and Klein, 2008), that this problem is NP-hard even for the simple unigram precision measure. 2.3 An Integer Program for Oracle Decoding To solve the combinatorial problem introduced in the previous section, we propose to cast it into an Integer Linear Programming (ILP) problem, for which many generic solvers exist. ILP has already been used in SMT to find the optimal translation for word-based (Germann et al., 2001) and to study the complexity of learning phrase alignments (De Nero and Klein, 2008) models. Following the latter reference, we introduce the following variables: fi,j (resp. ek,l) is a binary indicator variable that is true when the phrase contains all spans from betweenword position i to j (resp. k to l) of the source (resp. target) sentence. We also introduce a binary variable, denoted ai,j,k,l, to describe a possible link between source phrase fi,j and target phrase ek,l. These variables are built from the entries of the phrase table according to selection strategies introduced in Section 2.4. In the following, index variables are so that: 0 ≤ i< j ≤ n, in the source sentence and 0 ≤ k < l ≤ m, in the target sentence, where n (resp. m) is the length of the source (resp. target) sentence. Solving the oracle decoding problem then amounts to optimizing the following objective function: mi,j,akx,li,Xj,k,lai,j,k,l· (l − k), (1) under the constraints: X ∀x ∈ J1,mK : ek,l ≤ 1 (2) = (3) 1∀,kn,lK : Xai,j,k,l = fk,l (4) ∀i,j : Xai,j,k,l (5) k,l s.tX. Xk≤x≤l ∀∀xy ∈∈ J11,,mnKK : X i,j s.tX. Xi≤y≤j fi,j 1 Xi,j = ei,j Xk,l The objective function (1) corresponds to the number of target words that are generated. The first set of constraints (2) ensures that each word in the reference e ap- pears in no more than one phrase. Maximizing the objective under these constraints amounts to maximizing the unigram recall. The second set of constraints (3) ensures that each word in the source f is translated exactly once, which guarantees that the search space of the ILP problem is the same as the search space of a phrase-based system. Constraints (4) bind the fk,l and ai,j,k,l variables, ensuring that whenever a link ai,j,k,l is active, the corresponding phrase fk,l is also active. Constraints (5) play a similar role for the reference. The Relaxed Problem Even though it accurately models the search space of a phrase-based decoder, this programs is not really useful as is: due to out-ofvocabulary words or missing entries in the phrase table, the constraint that all source words should be translated yields infeasible problems8. We propose to relax this problem and allow some source words to remain untranslated. This is done by replacing constraints (3) by: ∀y ∈ J1,nK : X i,j s.tX. Xi≤y≤j fi,j ≤ 1 To better ref∀lyec ∈t th J1e, bneKh :avior of phrase-based decoders, which attempt to translate all source words, we also need to modify the objective function as follows: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) Xi,j (6) The second term in this new objective ensures that optimal solutions translate as many source words as possible. 8An ILP problem is said to be infeasible when tion violates at least one constraint. every possible solu- 936 The Relaxed-Distortion Problem A last caveat with the Relaxed optimization program is caused by frequently occurring source tokens, such as function words or punctuation signs, which can often align with more than one target word. For lack of taking distortion information into account in our objective function, all these alignments are deemed equivalent, even if some of them are clearly more satisfactory than others. This situation is illustrated on Figure 1. le chat et the cat and le the chien dog Figure 1: Equivalent alignments between “le” and “the”. The dashed lines corresponds to a less interpretable solution. To overcome this difficulty, we propose a last change to the objective function: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) X ai,j,k,l|k − i| Xi,j −α (7) i Xk ,l X,j, Compared to the objective function of the relaxed problem (6), we introduce here a supplementary penalty factor which favors monotonous alignments. For each phrase pair, the higher the difference between source and target positions, the higher this penalty. If α is small enough, this extra term allows us to select, among all the optimal alignments of the re l axed problem, the one with the lowest distortion. In our experiments, we set α to min {n, m} to ensure that the penalty factor is always smminall{enr, ,tmha}n tthoe e rneswuarred t fhoart aligning atwltyo single iwso ardlwsa. 2.4 Selecting Indicator Variables In the approach introduced in the previous sections, the oracle decoding problem is solved by selecting, among a set of possible translation links, the ones that yield the solution with the highest unigram recall. We propose two strategies to build this set of possible translation links. In the first one, denoted exact match, an indicator ai,j,k,l is created if there is an entry (f, e) so that f spans from word position ito j in the source and e from word position k to l in the target. In this strategy, the ILP program considers exactly the same ruleset as conventional phrase-based decoders. We also consider an alternative strategy, which could help us to identify errors made during the phrase extraction process. In this strategy, denoted inside match, an indicator ai,j,k,l is created when the following three criteria are met: i) f spans from position ito j of the source; ii) a substring of e, denoted e, spans from position k to l of the reference; iii) (f, e¯) is not an entry of the phrase table. The resulting set of indicator variables thus contains, at least, all the variables used in the exact match strategy. In addition, we license here the use of phrases containing words that do not occur in the reference. In fact, using such solutions can yield higher BLEU scores when the reward for additional correct matches exceeds the cost incurred by wrong predictions. These cases are symptoms of situations where the extraction heuristic failed to extract potentially useful subphrases. 2.5 Oracle Decoding with Reordering Constraints The ILP problem introduced in the previous section can be extended in several ways to describe and test various improvements to phrase-based systems or to evaluate the impact of different parameter settings. This flexibility mainly stems from the possibility offered by our framework to express arbitrary constraints over variables. In this section, we illustrate these possibilities by describing how reordering constraints can easily be considered. As a first example, the Moses decoder uses a distortion limit to constrain the set of possible reorderings. This constraint “enforces (...) that the last word of a phrase chosen for translation cannot be more than d9 words from the leftmost untranslated word in the source” (Lopez, 2009) and is expressed as: ∀aijkl , ai0j0k0l0 s.t. k > k0, aijkl · ai0j0k0l0 · |j − i0 + 1| ≤ d, The maximum distortion limit strategy (Lopez, 2009) is also easily expressed and take the following form (assuming this constraint is parameterized by d): ∀l < m − 1, ai,j,k,l·ai0,j0,l+1,l0 · |i0 − j − 1| 71is%t e6hs.a distortion greater that Moses default distortion limit. alignment decisions enabled by the use of larger training corpora and phrase table. To evaluate the impact ofthe second heuristic, we computed the number of phrases discarded by Moses (be- cause of the default ttl limit) but used in the oracle hypotheses. In the English to French NEWSCO setting, they account for 34.11% of the total number of phrases used in the oracle hypotheses. When the oracle decoder is constrained to use the same phrase table as Moses, its BLEU-4 score drops to 42.78. This shows that filtering the phrase table prior to decoding discards many useful phrase pairs and is seriously limiting the best achievable performance, a conclusion shared with (Auli et al., 2009). Search Errors Search errors can be identified by comparing the score of the best hypothesis found by Moses and the score of the oracle hypothesis. If the score of the oracle hypothesis is higher, then there has been a search error; on the contrary, there has been an estimation error when the score of the oracle hypothesis is lower than the score of the best hypothesis found by Moses. 940 Based on the comparison of the score of Moses hypotheses and of oracle hypotheses for the English to French NEWSCO setting, our preliminary conclusion is that the number of search errors is quite limited: only about 5% of the hypotheses of our oracle decoder are actually getting a better score than Moses solutions. Again, this shows that the scoring function (model error) is one of the main bottleneck of current PBTS. Comparing these hypotheses is nonetheless quite revealing: while Moses mostly selects phrase pairs with high translation scores and generates monotonous alignments, our ILP decoder uses larger reorderings and less probable phrases to achieve better solutions: on average, the reordering score of oracle solutions is −5.74, compared to −76.78 fscoro rMeo osfe osr outputs. iGonivsen is −the5 weight assigned through MERT training to the distortion score, no wonder that these hypotheses are severely penalized. The Impact of Phrase Length The observed outputs do not only depend on decisions made during the search, but also on decisions made during training. One such decision is the specification of maximal length for the source and target phrases. In our framework, evaluating the impact of this decision is simple: it suffices to change the definition of indicator variables so as to consider only alignments between phrases of a given length. In the English-French NEWSCO setting, the most restrictive choice, when only alignments between single words are authorized, yields an oracle BLEU-4 of 48.68; however, authorizing phrases up to length 2 allows to achieve an oracle value of 66.57, very close to the score achieved when considering all extracted phrases (67.77). This is corroborated with a further analysis of our oracle alignments, which use phrases whose average source length is 1.21 words (respectively 1.31 for target words). If many studies have already acknowledged the predomi- nance of “small” phrases in actual translations, our oracle scores suggest that, for this language pair, increasing the phrase length limit beyond 2 or 3 might be a waste of computational resources. 4 Related Work To the best of our knowledge, there are only a few works that try to study the expressive power ofphrase-based machine translation systems or to provide tools for analyzing potential causes of failure. The approach described in (Auli et al., 2009) is very similar to ours: in this study, the authors propose to find and analyze the limits of machine translation systems by studying the reference reachability. A reference is reachable for a given system if it can be exactly generated by this system. Reference reachability is assessed using Moses in forced decoding mode: during search, all hypotheses that deviate from the reference are simply discarded. Even though the main goal of this study was to compare the search space of phrase-based and hierarchical systems, it also provides some insights on the impact of various search parameters in Moses, delivering conclusions that are consistent with our main results. As described in Section 1.2, these authors also propose a typology of the errors of a statistical translation systems, but do not attempt to provide methods for identifying them. The authors of (Turchi et al., 2008) study the learn- ing capabilities of Moses by extensively analyzing learning curves representing the translation performances as a function of the number of examples, and by corrupting the model parameters. Even though their focus is more on assessing the scoring function, they reach conclusions similar to ours: the current bottleneck of translation performances is not the representation power of the PBTS but rather in their scoring functions. Oracle decoding is useful to compute reachable pseudo-references in the context of discriminative training. This is the main motivation of (Tillmann and Zhang, 2006), where the authors compute high BLEU hypotheses by running a conventional decoder so as to maximize a per-sentence approximation of BLEU-4, under a simple (local) reordering model. Oracle decoding has also been used to assess the limitations induced by various reordering constraints in (Dreyer et al., 2007). To this end, the authors propose to use a beam-search based oracle decoder, which computes lower bounds of the best achievable BLEU-4 using dynamic programming techniques over finite-state (for so-called local and IBM constraints) or hierarchically structured (for ITG constraints) sets of hypotheses. Even 941 though the numbers reported in this study are not directly comparable with ours17, it seems that our decoder is not only conceptually much simpler, but also achieves much more optimistic lower-bounds of the oracle BLEU score. The approach described in (Li and Khudanpur, 2009) employs a similar technique, which is to guide a heuristic search in an hypergraph representing possible translation hypotheses with n-gram counts matches, which amounts to decoding with a n-gram model trained on the sole reference translation. Additional tricks are presented in this article to speed-up decoding. Computing oracle BLEU scores is also the subject of (Zens and Ney, 2005; Leusch et al., 2008), yet with a different emphasis. These studies are concerned with finding the best hypotheses in a word graph or in a consensus network, a problem that has various implications for multi-pass decoding and/or system combination techniques. The former reference describes an exponential approximate algorithm, while the latter proves the NPcompleteness of this problem and discuss various heuristic approaches. Our problem is somewhat more complex and using their techniques would require us to built word graphs containing all the translations induced by arbitrary segmentations and permutations of the source sentence. 5 Conclusions In this paper, we have presented a methodology for analyzing the errors of PBTS, based on the computation of an approximation of the BLEU-4 oracle score. We have shown that this approximation could be computed fairly accurately and efficiently using Integer Linear Programming techniques. Our main result is a confirmation of the fact that extant PBTS systems are expressive enough to achieve very high translation performance with respect to conventional quality measurements. The main efforts should therefore strive to improve on the way phrases and hypotheses are scored during training. This gives further support to attempts aimed at designing context-dependent scoring functions as in (Stroppa et al., 2007; Gimpel and Smith, 2008), or at attempts to perform discriminative training of feature-rich models. (Bangalore et al., 2007). We have shown that the examination of difficult-totranslate sentences was an effective way to detect errors or inconsistencies in the reference translations, making our approach a potential aid for controlling the quality or assessing the difficulty of test data. Our experiments have also highlighted the impact of various parameters. Various extensions of the baseline ILP program have been suggested and/or evaluated. In particular, the ILP formalism lends itself well to expressing various constraints that are typically used in conventional PBTS. In 17The best BLEU-4 oracle they achieve on Europarl German to English is approximately 48; but they considered a smaller version of the training corpus and the WMT’06 test set. our future work, we aim at using this ILP framework to systematically assess various search configurations. We plan to explore how replacing non-reachable references with high-score pseudo-references can improve discrim- inative training of PBTS. We are also concerned by determining how tight is our approximation of the BLEU4 score is: to this end, we intend to compute the best BLEU-4 score within the n-best solutions of the oracle decoding problem. Acknowledgments Warm thanks to Houda Bouamor for helping us with the annotation tool. This work has been partly financed by OSEO, the French State Agency for Innovation, under the Quaero program. References Tobias Achterberg. 2007. Constraint Integer Programming. Ph.D. thesis, Technische Universit a¨t Berlin. http : / / opus .kobv .de /tuberl in/vol ltexte / 2 0 0 7 / 16 11/ . Abhishek Arun and Philipp Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In Proc. of MT Summit XI, Copenhagen, Denmark. Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Koehn. 2009. A systematic analysis of translation model search spaces. In Proc. of WMT, pages 224–232, Athens, Greece. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Srinivas Bangalore, Patrick Haffner, and Stephan Kanthak. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proc. of ACL, pages 152–159, Prague, Czech Republic. L e´on Bottou and Olivier Bousquet. 2008. The tradeoffs oflarge scale learning. In Proc. of NIPS, pages 161–168, Vancouver, B.C., Canada. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proc. of WMT, pages 1–28, Athens, Greece. David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proc. of ECML, pages 610–619, Honolulu, Hawaii. John De Nero and Dan Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL: HLT, Short Papers, pages 25–28, Columbus, Ohio. Markus Dreyer, Keith B. Hall, and Sanjeev P. Khudanpur. 2007. Comparing reordering constraints for smt using efficient bleu oracle computation. In NAACL-HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 103– 110, Rochester, New York. 942 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proc. of ACL, pages 228–235, Toulouse, France. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1-2): 127– 143. Ulrich Germann. 2003. Greedy decoding for statistical machine translation in almost linear time. In Proc. of NAACL, pages 1–8, Edmonton, Canada. Kevin Gimpel and Noah A. Smith. 2008. Rich source-side context for statistical machine translation. In Proc. of WMT, pages 9–17, Columbus, Ohio. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL, pages 48–54, Edmonton, Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL, demonstration session. Philipp Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA, pages 115–124, Washington DC. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proc. of HLT, pages 161–168, Vancouver, Canada. Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. The significance of recall in automatic metrics for MT evaluation. In In Proc. of AMTA, pages 134–143, Washington DC. Gregor Leusch, Evgeny Matusov, and Hermann Ney. 2008. Complexity of finding the BLEU-optimal hypothesis in a confusion network. In Proc. of EMNLP, pages 839–847, Honolulu, Hawaii. Zhifei Li and Sanjeev Khudanpur. 2009. Efficient extraction of oracle-best translations from hypergraphs. In Proc. of NAACL, pages 9–12, Boulder, Colorado. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proc. of ACL, pages 761–768, Sydney, Australia. Adam Lopez. 2009. Translation as weighted deduction. In Proc. of EACL, pages 532–540, Athens, Greece. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. , 29(1): 19–5 1. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. Technical report, Philadelphia, Pennsylvania. D. Roth and W. Yih. 2005. Integer linear programming inference for conditional random fields. In Proc. of ICML, pages 737–744, Bonn, Germany. Nicolas Stroppa, Antal van den Bosch, and Andy Way. 2007. Exploiting source similarity for smt using context-informed features. In Andy Way and Barbara Proc. of TMI, pages Christoph Tillmann 231–240, Sk¨ ovde, and Tong Zhang. Gawronska, editors, Sweden. 2006. A discriminative global training algorithm for statistical mt. In Proc. of ACL, 721–728, Sydney, Australia. Turchi, Tijl De Bie, and Nello pages Marco Cristianini. 2008. Learn- ing performance of a machine translation system: a statistical and computational analysis. In Proc. of WMT, pages Columbus, Ohio. 35–43, Richard Zens and Hermann Ney. 2005. Word graphs for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 191–198, Ann Arbor, Michigan. 943
6 0.199265 84 emnlp-2010-NLP on Spoken Documents Without ASR
7 0.19451168 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names
8 0.18826437 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
9 0.18083766 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
10 0.1716297 80 emnlp-2010-Modeling Organization in Student Essays
11 0.16882002 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
12 0.15798198 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter
13 0.15637766 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs
14 0.15604952 73 emnlp-2010-Learning Recurrent Event Queries for Web Search
15 0.15379672 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition
16 0.15281956 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
17 0.14899606 61 emnlp-2010-Improving Gender Classification of Blog Authors
18 0.14741273 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
19 0.14520571 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
20 0.14357395 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification
topicId topicWeight
[(3, 0.029), (12, 0.04), (24, 0.288), (29, 0.095), (30, 0.01), (52, 0.038), (56, 0.15), (62, 0.03), (66, 0.084), (72, 0.057), (76, 0.027), (79, 0.011), (83, 0.013), (87, 0.021), (89, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.70819819 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning
Author: Ahmet Aker ; Trevor Cohn ; Robert Gaizauskas
Abstract: In this paper we address two key challenges for extractive multi-document summarization: the search problem of finding the best scoring summary and the training problem of learning the best model parameters. We propose an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run. Further, we propose a discriminative training algorithm which directly maximises the quality ofthe best summary, rather than assuming a sentence-level decomposition as in earlier work. Our approach leads to significantly better results than earlier techniques across a number of evaluation metrics.
2 0.56552219 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text
Author: Michael Paul ; ChengXiang Zhai ; Roxana Girju
Abstract: This paper presents a two-stage approach to summarizing multiple contrastive viewpoints in opinionated text. In the first stage, we use an unsupervised probabilistic approach to model and extract multiple viewpoints in text. We experiment with a variety of lexical and syntactic features, yielding significant performance gains over bag-of-words feature sets. In the second stage, we introduce Comparative LexRank, a novel random walk formulation to score sentences and pairs of sentences from opposite viewpoints based on both their representativeness of the collection as well as their contrastiveness with each other. Exper- imental results show that the proposed approach can generate informative summaries of viewpoints in opinionated text.
3 0.55844992 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay
Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1
4 0.55556709 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter
Author: Dmitriy Genzel ; Jakob Uszkoreit ; Franz Och
Abstract: As a prerequisite to translation of poetry, we implement the ability to produce translations with meter and rhyme for phrase-based MT, examine whether the hypothesis space of such a system is flexible enough to accomodate such constraints, and investigate the impact of such constraints on translation quality.
5 0.54969013 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification
Author: Ainur Yessenalina ; Yisong Yue ; Claire Cardie
Abstract: In this paper, we investigate structured models for document-level sentiment classification. When predicting the sentiment of a subjective document (e.g., as positive or negative), it is well known that not all sentences are equally discriminative or informative. But identifying the useful sentences automatically is itself a difficult learning problem. This paper proposes a joint two-level approach for document-level sentiment classification that simultaneously extracts useful (i.e., subjec- tive) sentences and predicts document-level sentiment based on the extracted sentences. Unlike previous joint learning methods for the task, our approach (1) does not rely on gold standard sentence-level subjectivity annotations (which may be expensive to obtain), and (2) optimizes directly for document-level performance. Empirical evaluations on movie reviews and U.S. Congressional floor debates show improved performance over previous approaches.
6 0.54135573 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
7 0.5298242 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation
8 0.52110636 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
9 0.52041805 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
10 0.51683205 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
11 0.5143894 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields
12 0.5108611 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
13 0.50975931 80 emnlp-2010-Modeling Organization in Student Essays
14 0.50973022 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification
15 0.50463951 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
16 0.50363433 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
17 0.50120711 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task
18 0.49678418 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
19 0.49553388 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
20 0.49334928 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions