emnlp emnlp2011 emnlp2011-138 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
Reference: text
sentIndex sentText sentNum sentScore
1 Tuning as Ranking Mark Hopkins and Jonathan May SDL Language Weaver Los Angeles, CA 90045 {mhopkins , Abstract We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al. [sent-1, score-0.444]
2 Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. [sent-3, score-0.444]
3 We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios. [sent-8, score-0.149]
4 1 Introduction The MERT algorithm (Och, 2003) is currently the most popular way to tune the parameters of a statistical machine translation (MT) system. [sent-9, score-0.074]
5 MERT is well-understood, easy to implement, and runs quickly, but can behave erratically and does not scale beyond a handful of features. [sent-10, score-0.041]
6 This lack of scalability is a significant weakness, as it inhibits systems from using more than a couple dozen features to discriminate between candidate translations and stymies feature development innovation. [sent-11, score-0.551]
7 (2008b) have developed tuning methods using the MIRA algorithm (Crammer and Singer, 2003) as a nucleus. [sent-16, score-0.203]
8 has been shown to perform well on large-scale tasks with hundreds or thousands of features (2009). [sent-18, score-0.043]
9 , 2010), only one paper describing a statistical MT system cited the use of MIRA for tuning (Chiang, 2010), while 15 used MERT. [sent-21, score-0.203]
10 1 Here we propose a simpler approach to tuning that scales similarly to high-dimensional feature spaces. [sent-22, score-0.245]
11 , 2009), where the explicit goal is to learn to correctly rank candidate translations. [sent-24, score-0.214]
12 Specifically, we follow the pairwise approach to ranking (Herbrich et al. [sent-25, score-0.204]
13 , 2007), in which the ranking problem is reduced to the binary classification task of deciding between candidate translation pairs. [sent-29, score-0.35]
14 Ofprimary concern to us is the ease of adoption of our proposed technique. [sent-30, score-0.027]
15 Because of this, we adhere as closely as possible to the established MERT architecture and use freely available machine learning software. [sent-31, score-0.075]
16 The end result is a technique that scales and performs just as well as MIRA-based tuning, but which can be implemented in a couple of hours by anyone with an existing MERT implementation. [sent-32, score-0.087]
17 Mindful that many would-be enhancements to the 1The remainder either did not specify their tuning method (though a number of these used the Moses toolkit (Koehn et al. [sent-33, score-0.203]
18 It is an example of a finite candidate space, defined as a candidate space for which I finite and J maps is each index of I a finite set. [sent-40, score-0.851]
19 to A policy of candidate space h∆, I, J, f,e, xi is a funAct pioonli tchya to maps iedaachte m speamcbee hr∆ ∆i ∈ ,IJ ,tof a m,xeim ibse ar foufn J(i). [sent-41, score-0.636]
20 hAa policy corresponds t oi a Ich tooiac e m oefm one candidate translation for each source sentence. [sent-42, score-0.555]
21 For 1353 the example in Figure 1, policy p1 = {1 → 2, 2 → 3} corresponds to the choice of “he =do {es1 n7 →ot f7 →or t3h}e foirrrste source soen thteen ccheo iacned o “fI “ hdoe not gfoor” tfhoer second source sentence. [sent-43, score-0.367]
22 Policy p2 = {1 → 3, 2 → 1} corresponds to the inferior transl=ati {on1s7 → →“sh 3e, 2no7 →t and “I go not. [sent-45, score-0.045]
23 ” We assume the MT system distinguishes between policies using a scoring function for candidate translations of the form hw (i, j) = w · x(i, j), where w is a weight vector of th(ei, same dwim ·e xn(sii,ojn) as fheearteuwr e vector x(i, j). [sent-46, score-1.351]
24 This scoring function extends to a policy p by summing the cost of each of the policy’s candidate translations: Hw (p) = Pi∈I hw (i, p(i)) . [sent-47, score-1.125]
25 a weight vector w such that Hw(p) assigns a high score to good poli- go” go” go” cies, and a low score to bad policies. [sent-50, score-0.156]
26 2 To do so, we need information about which policies are good and which are bad. [sent-51, score-0.095]
27 This information is provided by a “gold” scoring function G that maps each policy to a real-valued score. [sent-52, score-0.452]
28 Typically this gold function is BLEU (Papineni et al. [sent-53, score-0.146]
29 We want to find a weight vector w such that Hw behaves “similarly” to G on a candidate space s. [sent-58, score-0.446]
30 We assume a loss function ls (Hw, G) which returns the real-valued loss of using scoring function Hw when the gold scoring function is G and the candidate space is s. [sent-59, score-0.904]
31 Thus, we may say the goal of tuning is to find the weight vector w that minimizes loss. [sent-60, score-0.412]
32 3 MERT In general, the candidate space may have infinitely many source sentences, as well as infinitely many candidate translations per source sentence. [sent-61, score-0.842]
33 In practice, tuning optimizes over a finite subset of source sentences3 and a finite subset of candidate translations as well. [sent-62, score-0.796]
34 The classic tuning architecture used in the dominant MERT approach (Och, 2003) forms the translation subset and learns weight vector w via 2Without loss of generality, we assume that a higher score indicates a better translation. [sent-63, score-0.561]
35 2 for the tune sets used in this paper’s experiments. [sent-65, score-0.037]
36 Floigcaulr scoring fmunpclteio cann hdwid (ai,te ej )s p(awchee orfe w =e [−sio2n, a1li])t yan 2d. [sent-66, score-0.158]
37 a looteca:l I gold scoring (fu1)nc =tio Jn( g2()i ,= =j) {. [sent-67, score-0.202]
38 During candidate generation, candidate translations are selected from a base candidate space s and added to a finite candidate space s0 called the candidate pool. [sent-74, score-1.356]
39 During optimization, the weight vector w is optimized to minimize loss ls0 (Hw, G). [sent-75, score-0.291]
40 For its candidate generation phase, MERT generates the k-best candidate translations for each source sentence according to hw, where w is the weight vector from the previous optimization phase (or an arbitrary weight vector for the first iteration). [sent-76, score-1.195]
41 Typically the optimization phase is implemented using Och’s line optimization algorithm (2003). [sent-78, score-0.379]
42 1354 MERT has proven itself effective at tuning candidate spaces with low dimensionality. [sent-79, score-0.447]
43 To test this claim, we devised the following synthetic data experiment: 1. [sent-81, score-0.182]
44 We created a gold scoring function G that is also a linear function of the same form as Hw, i. [sent-82, score-0.296]
45 Under this assumption, the role of the optimization phase reduces to learning back the gold weight vector w∗. [sent-85, score-0.478]
46 We generated a ∆-dimensionality candidate pool with 500 source “sentences” and 100 candidate “translations” per sentence. [sent-87, score-0.547]
47 We created the corresponding feature vectors by drawing ∆ random real numbers uniformly from the interval [0, 500] . [sent-88, score-0.071]
48 We ran MERT’s line optimization on this synthetic candidate pool and compared the learned weight vector w to the gold weight vector w∗ using cosine similarity. [sent-90, score-0.986]
49 We used line optimization in the standard way, by generating 20 random starting weight vectors and hill-climbing on each independently until no further progress is made, then choosing the final weight vector that minimizes loss. [sent-91, score-0.525]
50 The results in Figure 3 indicate that as the dimensionality of the problem increases MERT rapidly loses the ability to learn w∗. [sent-94, score-0.091]
51 Note that this synthetic problem is considerably easier than a real MT scenario, where the data is noisy and interdependent, and the gold scoring function is nonlinear. [sent-95, score-0.401]
52 If MERT cannot scale in this simple scenario, it has little hope of succeeding in a high-dimensionality deployment scenario. [sent-96, score-0.054]
53 4 Optimization via Pairwise Ranking We would like to modify MERT so that it scales well to high-dimensionality candidate spaces. [sent-97, score-0.256]
54 The most prominent example of a tuning method that performs well on high-dimensionality candidate spaces is the MIRA-based approach used by Watanabe et al. [sent-98, score-0.447]
55 Unfortunately, this approach requires a complex architecture that diverges significantly from the MERT approach, and consequently has not been widely adopted. [sent-101, score-0.075]
56 With MERT as a starting point, we have a choice: modify candidate generation, optimization, or both. [sent-103, score-0.214]
57 Although alternative candidate generation methods have been proposed (Macherey et al. [sent-104, score-0.264]
58 , 2008b; Chatterjee and Cancedda, 2010), we will restrict ourselves to MERT-style candidate generation, in order to minimize divergence from the established MERT tuning architecture. [sent-106, score-0.472]
59 1 Basic Approach While intuitive, the MERT optimization module focuses attention on Hw’s best policy, and not on its overall prowess at ranking policies. [sent-109, score-0.296]
60 We will create an optimization module that directly addresses Hw’s ability to rank policies in the hope that this more holistic approach will generalize better to unseen data. [sent-110, score-0.319]
61 Assume that the gold scoring function G decomposes in the following way: = G(p) Xg(i,p(i)) (1) Xi∈I where g(i, j) is a local scoring function that scores the single candidate translation e(i, j). [sent-111, score-0.65]
62 For an arbitrary pair of candidate translations e(i, j) and e(i, j0), the local gold function g tells us which is the better translation. [sent-113, score-0.476]
63 Note that this induces a ranking on the candidate translations for each source sentence. [sent-114, score-0.495]
64 1355 We follow the pairwise approach to ranking (Herbrich et al. [sent-115, score-0.204]
65 In the pairwise approach, the learning task is framed as the classification of candidate pairs into two categories: correctly ordered and incorrectly ordered. [sent-120, score-0.349]
66 Specifically, for candidate translation pair e(i, j) and e(i, j0), we want: g(i, j) > g(i, j0) ⇔ hw(i, j) > hw(i, j0). [sent-121, score-0.251]
67 We can re-express this c)on ⇔diti hon: g(i, j) > g(i, j0) ⇔ hw(i, j) > hw(i, j0) ⇔ ⇔ hw(i, j) − hw (i, j0) > 0 w · x(i, j) − w · x(i, j0) > 0 ⇔ w · (x(i, j) − x(i, j0)) > 0 Thus optimization reduces to a classic binary classification problem. [sent-122, score-0.746]
68 We create a labeled training instance for this problem by computing difference vector x(i, j) − x(i, j0), and labeling it as a positive or negative −ins xta(ni,cje based on whether, respectively, the first or second vector is superior according to gold function g. [sent-123, score-0.28]
69 To ensure balance, we consider both possible difference vectors from a pair. [sent-124, score-0.071]
70 For example, given the candidate space of Figure 1, since g(1, 1) > g(1, 3), we would add ( [−4, 3] , +) and ([4, −3] , −) (to1 our training suledt. [sent-125, score-0.26]
71 Wadde can t4h,e3n] f,+eed) tahnids training ,d−a)tat directly ntoi any ot. [sent-126, score-0.027]
72 ff W-tehec-asnhethlfe cnlfaesesidftihcai-s tion tool that returns a linear classifier, in order to obtain a weight vector w that optimizes the above condition. [sent-127, score-0.197]
73 This weight vector can then be used directly by the MT system in the subsequent candidate generation phase. [sent-128, score-0.42]
74 The exact loss function ls0 (Hw, G) optimized depends on the choice of classifier. [sent-129, score-0.154]
75 4 Typical approaches to pairwise ranking enumerate all difference vectors as training data. [sent-130, score-0.275]
76 For tuning however, this means O(|I| ∗ Jm2ax) vectors, where Jmax eisr ,t thhei cardinality o|If| t∗h eJ largest J(i) . [sent-131, score-0.233]
77 Out of tractability considerations, we sample from the space of difference vectors, using the sampler template in Figure 4. [sent-133, score-0.117]
78 For each source sentence i, the sampler generates Γ candidate translation pairs hj, j0i, and accepts each pair with probability αi (|g(i, j) − g(i, j0) |). [sent-134, score-0.358]
79 Among rth wei accepted pairs, it keeps tjh)e − −Ξ gw(iit,hj greatest g differential, patendd adds their difference vectors to the training data. [sent-135, score-0.071]
80 5The intuition for biasing toward high score differential is Synthetic parameter learning of MERT and PRO s t isontreCyalminrgephardoftaewrli 0 . [sent-138, score-0.082]
81 641082 0N4o0isy My6EP0R 0OT801 Dimensionality Figure 3: Result of synthetic data learning experiment for MERT and PRO, with and without added noise. [sent-139, score-0.152]
82 As the dimensionality increases MERT is unable to learn the original weights but PRO still performs adequately. [sent-140, score-0.064]
83 2 Scalability We repeated the scalability study from Section 3, now using our pairwise ranking optimization (hereafter, PRO) approach. [sent-142, score-0.479]
wordName wordTfidf (topN-words)
[('hw', 0.553), ('mert', 0.385), ('candidate', 0.214), ('policy', 0.208), ('tuning', 0.203), ('pro', 0.18), ('optimization', 0.156), ('synthetic', 0.152), ('scalability', 0.119), ('translations', 0.116), ('pairwise', 0.105), ('scoring', 0.103), ('chiang', 0.101), ('gold', 0.099), ('ranking', 0.099), ('policies', 0.095), ('maps', 0.094), ('mira', 0.091), ('weight', 0.089), ('herbrich', 0.082), ('loss', 0.08), ('finite', 0.078), ('vectors', 0.071), ('jmax', 0.07), ('phase', 0.067), ('vector', 0.067), ('source', 0.066), ('dimensionality', 0.064), ('watanabe', 0.064), ('infinitely', 0.06), ('differential', 0.055), ('integers', 0.055), ('ej', 0.055), ('minimizes', 0.053), ('pool', 0.053), ('mt', 0.052), ('generation', 0.05), ('index', 0.049), ('architecture', 0.048), ('burges', 0.048), ('freund', 0.048), ('function', 0.047), ('space', 0.046), ('couple', 0.045), ('claims', 0.045), ('cao', 0.045), ('go', 0.045), ('xi', 0.044), ('thousands', 0.043), ('scales', 0.042), ('sampler', 0.041), ('optimizes', 0.041), ('handful', 0.041), ('module', 0.041), ('och', 0.04), ('ls', 0.038), ('tune', 0.037), ('translation', 0.037), ('classic', 0.037), ('infinite', 0.035), ('singer', 0.035), ('ji', 0.034), ('validate', 0.031), ('hi', 0.03), ('crammer', 0.03), ('behaves', 0.03), ('sdl', 0.03), ('weaver', 0.03), ('ibse', 0.03), ('oefm', 0.03), ('parity', 0.03), ('inhibits', 0.03), ('pdo', 0.03), ('oefs', 0.03), ('chatterjee', 0.03), ('devised', 0.03), ('tractability', 0.03), ('denkowski', 0.03), ('framed', 0.03), ('thhei', 0.03), ('spaces', 0.03), ('minimize', 0.028), ('established', 0.027), ('optimized', 0.027), ('sir', 0.027), ('aer', 0.027), ('dozen', 0.027), ('loses', 0.027), ('cies', 0.027), ('melamed', 0.027), ('biasing', 0.027), ('succeeding', 0.027), ('tahnids', 0.027), ('xg', 0.027), ('adoption', 0.027), ('interdependent', 0.027), ('diverges', 0.027), ('tfhoer', 0.027), ('eed', 0.027), ('hope', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 138 emnlp-2011-Tuning as Ranking
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
2 0.17092858 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
3 0.14432353 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
Author: Wei Lu ; Hwee Tou Ng
Abstract: This paper describes a novel probabilistic approach for generating natural language sentences from their underlying semantics in the form of typed lambda calculus. The approach is built on top of a novel reduction-based weighted synchronous context free grammar formalism, which facilitates the transformation process from typed lambda calculus into natural language sentences. Sentences can then be generated based on such grammar rules with a log-linear model. To acquire such grammar rules automatically in an unsupervised manner, we also propose a novel approach with a generative model, which maps from sub-expressions of logical forms to word sequences in natural language sentences. Experiments on benchmark datasets for both English and Chinese generation tasks yield significant improvements over results obtained by two state-of-the-art machine translation models, in terms of both automatic metrics and human evaluation.
4 0.1254736 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
5 0.096386306 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
6 0.085631363 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
7 0.075636685 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
8 0.072628453 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
9 0.061167561 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
10 0.056675926 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing
11 0.056586839 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
12 0.052461077 125 emnlp-2011-Statistical Machine Translation with Local Language Models
13 0.051732704 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
14 0.050923675 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking
15 0.050162148 136 emnlp-2011-Training a Parser for Machine Translation Reordering
16 0.049940422 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
17 0.047722027 96 emnlp-2011-Multilayer Sequence Labeling
18 0.046069145 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
19 0.043215163 129 emnlp-2011-Structured Sparsity in Structured Prediction
20 0.043195251 141 emnlp-2011-Unsupervised Dependency Parsing without Gold Part-of-Speech Tags
topicId topicWeight
[(0, 0.177), (1, 0.085), (2, 0.055), (3, -0.114), (4, 0.056), (5, -0.071), (6, 0.01), (7, -0.074), (8, 0.037), (9, 0.041), (10, -0.002), (11, 0.016), (12, -0.057), (13, -0.078), (14, 0.007), (15, 0.044), (16, 0.027), (17, 0.046), (18, 0.077), (19, -0.023), (20, -0.014), (21, 0.077), (22, 0.028), (23, 0.144), (24, 0.364), (25, -0.051), (26, -0.111), (27, 0.079), (28, 0.049), (29, -0.071), (30, -0.066), (31, -0.001), (32, 0.268), (33, -0.065), (34, 0.069), (35, -0.055), (36, 0.009), (37, -0.06), (38, 0.197), (39, 0.01), (40, 0.042), (41, 0.038), (42, -0.027), (43, 0.18), (44, -0.017), (45, 0.004), (46, 0.041), (47, 0.039), (48, 0.098), (49, 0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.98085898 138 emnlp-2011-Tuning as Ranking
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
2 0.76324779 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
3 0.44887581 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
Author: Zhifei Li ; Ziyuan Wang ; Jason Eisner ; Sanjeev Khudanpur ; Brian Roark
Abstract: Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
4 0.44853553 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
5 0.40424401 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
Author: Wei Lu ; Hwee Tou Ng
Abstract: This paper describes a novel probabilistic approach for generating natural language sentences from their underlying semantics in the form of typed lambda calculus. The approach is built on top of a novel reduction-based weighted synchronous context free grammar formalism, which facilitates the transformation process from typed lambda calculus into natural language sentences. Sentences can then be generated based on such grammar rules with a log-linear model. To acquire such grammar rules automatically in an unsupervised manner, we also propose a novel approach with a generative model, which maps from sub-expressions of logical forms to word sequences in natural language sentences. Experiments on benchmark datasets for both English and Chinese generation tasks yield significant improvements over results obtained by two state-of-the-art machine translation models, in terms of both automatic metrics and human evaluation.
6 0.34268937 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
7 0.322721 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
9 0.24906839 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
10 0.24501094 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
11 0.2429312 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
12 0.23733261 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
13 0.23440048 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
14 0.22677039 96 emnlp-2011-Multilayer Sequence Labeling
15 0.22085954 24 emnlp-2011-Bootstrapping Semantic Parsers from Conversations
16 0.21612506 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking
17 0.21362109 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing
18 0.21153118 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction
19 0.20229326 129 emnlp-2011-Structured Sparsity in Structured Prediction
20 0.19744739 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
topicId topicWeight
[(12, 0.286), (23, 0.115), (36, 0.063), (37, 0.033), (45, 0.046), (54, 0.034), (57, 0.016), (62, 0.055), (64, 0.026), (66, 0.052), (69, 0.016), (79, 0.062), (82, 0.031), (90, 0.011), (96, 0.033), (98, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.7574836 138 emnlp-2011-Tuning as Ranking
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
2 0.71890712 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
Author: Yue Zhang ; Stephen Clark
Abstract: Machine-produced text often lacks grammaticality and fluency. This paper studies grammaticality improvement using a syntax-based algorithm based on CCG. The goal of the search problem is to find an optimal parse tree among all that can be constructed through selection and ordering of the input words. The search problem, which is significantly harder than parsing, is solved by guided learning for best-first search. In a standard word ordering task, our system gives a BLEU score of 40. 1, higher than the previous result of 33.7 achieved by a dependency-based system.
3 0.49295878 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
4 0.48653933 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman
Abstract: In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (context and alignment features, the latter from parallel corpora). Using only context features, our system yields results comparable to state-of-the art, far better than a similar model without the one-class-per-type constraint. Using the additional features provides added benefit, and our final system outperforms the best published results on most of the 25 corpora tested.
5 0.4853451 136 emnlp-2011-Training a Parser for Machine Translation Reordering
Author: Jason Katz-Brown ; Slav Petrov ; Ryan McDonald ; Franz Och ; David Talbot ; Hiroshi Ichikawa ; Masakazu Seno ; Hideto Kazawa
Abstract: We propose a simple training regime that can improve the extrinsic performance of a parser, given only a corpus of sentences and a way to automatically evaluate the extrinsic quality of a candidate parse. We apply our method to train parsers that excel when used as part of a reordering component in a statistical machine translation system. We use a corpus of weakly-labeled reference reorderings to guide parser training. Our best parsers contribute significant improvements in subjective translation quality while their intrinsic attachment scores typically regress.
6 0.48457566 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
7 0.48383772 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
8 0.48262662 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
9 0.48100504 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
10 0.48100463 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
11 0.47985491 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
12 0.4795762 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
13 0.479312 70 emnlp-2011-Identifying Relations for Open Information Extraction
14 0.47914931 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
15 0.47847369 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing
16 0.47828779 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
17 0.47820708 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
18 0.47752768 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
19 0.47595844 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
20 0.47503978 128 emnlp-2011-Structured Relation Discovery using Generative Models