emnlp emnlp2011 emnlp2011-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. [sent-2, score-0.18]
2 However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. [sent-3, score-0.19]
3 In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. [sent-4, score-0.479]
4 Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. [sent-5, score-0.211]
5 1 Introduction Minimum error rate training (MERT)—also known as direct loss minimization in machine learning—is a crucial component in many complex natural language applications such as speech recognition (Chou et al. [sent-11, score-0.266]
6 The unsmoothed error count is a highly non-convex objective function and therefore difficult to optimize directly; prior work offers no algorithm with a good approximation guarantee. [sent-24, score-0.189]
7 , 1997) relies on standard convex optimization techniques applied to non-convex problems, the Och algorithm (Och, 2003) represents a significant advance for MERT since it applies a series of special line minimizations that happen to be exhaustive and efficient. [sent-27, score-0.271]
8 Since this algorithm remains inexact in the multidimensional case, much of the recent work on MERT has focused on extending Och’s algorithm to find better search directions and starting points (Cer et al. [sent-28, score-0.247]
9 In this paper, we present LP-MERT, an exact search algorithm for N-best optimization that exploits general assumptions commonly made with MERT, e. [sent-31, score-0.193]
10 Instead, LP-MERT exploits algorithmic devices such as lazy enumeration, divide-and-conquer, and linear programming to efficiently discard partial solutions that cannot be maximized by any linear model. [sent-42, score-0.34]
11 We show that this algorithm is polynomial in N and in the size of the model, but exponential in the number of tuning sentences. [sent-45, score-0.17]
12 To handle reasonably large tuning sets, we present two modifications of LP-MERT that either search only promising regions of the parameter space, or that rely on a beam-search approximation. [sent-46, score-0.195]
13 To our knowledge, it is the first known exact search algorithm for optimizing task loss on N-best lists in general dimensions. [sent-49, score-0.355]
14 We also present an approximate version of LP-MERT that offers a natural means of trading speed for accuracy, as we are guaranteed to eventually find the global optimum as we gradually increase beam size. [sent-50, score-0.2]
15 In MERT, the goal is to minimize an error count E(r, e) by scoring translation hypotheses against a set of reference translations r1S = r1 . [sent-75, score-0.2]
16 Assuming as in Och (2003) that error count is additPively decomposable by sentence—i. [sent-79, score-0.146]
17 N} The quality of this approximation is dependent on how accurately the N-best lists represent the search space of the system. [sent-90, score-0.138]
18 Therefore, the hypothesis list is iteratively grown: decoding with an initial parameter vector seeds the N-best lists; next, parameter estimation and N-best list gathering alternate until the search space is deemed representative. [sent-91, score-0.146]
19 Furthermore, this function for a single sentence may be computed efficiently by first finding the hypotheses that form the upper envelope of the model score function, then gathering the error count for each hypothesis along the range for which it is optimal. [sent-93, score-0.191]
20 efficient algorithm for finding the global optimum of the error count along any single direction. [sent-97, score-0.234]
21 Such a hill-climbing algorithm in a non-convex space has no optimality guarantee: without a perfect direction finder, even a globally-exact line search may never encounter the global optimum. [sent-98, score-0.192]
22 The only difficulty with S = 1is to know for each translation en whether its feature vector h1,n (or hn for short) can be maximized using any linear model. [sent-109, score-0.297]
23 In geometric terminology, the former points are commonly called extreme points, and the latter are interior points. [sent-116, score-0.512]
24 3 The problem of exactly optimizing a single N-best list is closely related to the convex hull problem in computational geometry, for which generic solvers such as the QuickHull algorithm exist (Eddy, 1977; Bykat, 1978; Barber et al. [sent-117, score-0.531]
25 A first approach would be to construct the convex hull conv(h1 . [sent-119, score-0.459]
26 hN) of the N-best list, then identify the point on the hull with lowest loss (h1 in Fig. [sent-122, score-0.411]
27 1) and finally compute an optimal weight vector using hull points that share common facets with the 3Specifically, a point h is extreme with respect to a convex set C (e. [sent-123, score-0.907]
28 In a minor abuse of terminology, we sometimes simply state that a given point h is extreme when the nature of C is clear from context. [sent-127, score-0.23]
29 hN) with associated losses (here, TER scores) for a single input sentence, whose convex hull is displayed with dotted lines in (a). [sent-135, score-0.496]
30 , the w in (b)), no linear model can possibly maximize any of the points strictly inside the convex hull. [sent-139, score-0.297]
31 Unfortunately, this doesn’t quite scale even with a single N-best list, since the best known convex hull algorithm runs in time (Barber et al. [sent-141, score-0.496]
32 4 Algorithms presented in this paper assume that D is unrestricted, therefore we cannot afford to build any convex hull explicitly. [sent-143, score-0.459]
33 Thus, we turn to linear programming (LP), for which we know algorithms (Karmarkar, 1984) that are polynomial in the number of dimensions and linear in the number of points, i. [sent-144, score-0.161]
34 , O(NT), where T = To check if point hi is extreme, we really only need to know whether we can define a half-space containing all points h1 . [sent-146, score-0.237]
35 hN, with hi lying on the hyperplane delimiting that halfspace, as shown in Fig. [sent-149, score-0.135]
36 Formally, a vertex hi is optimal with respect to arg maxi{w|hi} if and only if the following constraints D3. [sent-151, score-0.25]
37 The 4A convex hull algorithm polynomial in D is very unlikely. [sent-154, score-0.533]
38 Indeed, the expected number of facets of high-dimensional convex hulls grows dramatically, and—assuming a uniform distribution of points, D = 10, and a sufficiently large N—the expected number of facets is approximately 106N (Buchta et al. [sent-155, score-0.308]
39 In the worst case, the maximum number of facets of a convex hull is O(NbD/2c /bD/2c ! [sent-157, score-0.52]
40 ether a given point is extreme is presented in http : / /www . [sent-161, score-0.23]
41 The vertex hi is extreme if and only if the LP solver finds a non-zero vector w satisfying the canonical system. [sent-172, score-0.423]
42 To ensure that w is zero only when hi is interior, we set c = hi hµ, where hµ is a point known to be inside the hull− (e. [sent-173, score-0.306]
43 hN), which returns the weight vector wˆ maximizing hi, or which returns 0 if hi is interior to conv(h1 . [sent-179, score-0.422]
44 hN) to denote whether hi is extreme with respect to this hull. [sent-186, score-0.329]
45 lHom)sEenNt:) An exact search algorithm for optimizing a single N-best list is shown above. [sent-201, score-0.18]
46 It lazily enumerates feature vectors in increasing order of task loss, keeping only the extreme ones. [sent-202, score-0.194]
47 Such a vertex hj is known to be on the convex hull, and the returned vector wˆ maximizes it. [sent-203, score-0.425]
48 1, it would first run LINOPTIMIZER on h3, discard it since it is interior, and finally accept the extreme point h1. [sent-205, score-0.284]
49 Each execution of LINOPTIMIZER requires O(NT) time with the interior point 6We assume that h1 . [sent-206, score-0.33]
50 To determine which feature vectors were on the hull, we use either linear programming (Karmarkar, 1984) or one of the most efficient convex hull computation tools (Barber et al. [sent-214, score-0.538]
51 Both h1,1 and h2,1 in (a) and (b) are extreme with respect to their own N-best list, and we ask whether we can find a weight vector that maximizes both h1,1 and h2,1. [sent-247, score-0.229]
52 The algorith- mic trick is to geometrically translate one of the two N-best lists so that h1,1 = h02,1, where h02,1 is the translation ofh02,1. [sent-248, score-0.167]
53 In the case of the combination of h1,1 and h2,2, we see in (d) that the combined set of points prevents the maximization h1,1, since this point is clearly no longer on the hull. [sent-250, score-0.15]
54 Hence, the combination (h1,1,h2,2) cannot be maximized using any linear model. [sent-251, score-0.149]
55 2 Lazy enumeration, divide-and-conquer Now that we can determine whether a given combination is extreme, we must next enumerate candidate combinations to find the combination that has lowest task loss among all of those that are extreme. [sent-257, score-0.33]
56 Since the number of feature vector combinations is O(NS), exhaustive enumeration is not a reasonable 42 (a)h 1,1 h2, (bh) 2,1 (hc1),1 h’2,1 h1,1( dh)’ 2,2 Figure 3: Given two N-best lists, (a) and (b), we use linear programming to determine which hypothesis combinations are extreme. [sent-258, score-0.431]
57 For instance, the combination h1,1 and h2,1 is extreme (c), while h1,1 and h2,2 is not (d). [sent-259, score-0.242]
58 Instead, we use lazy enumeration to process combinations in increasing order of task loss, which ensures that the first extreme combination for s = 1. [sent-261, score-0.531]
59 An S-ary lazy enumeration would not be particularly efficient, since the runtime is still O(NS) in the worst case. [sent-265, score-0.246]
60 LP-MERT instead uses divide-and-conquer and binary lazy enumeration, which enables us to discard early on combinations that are not extreme. [sent-266, score-0.248]
61 For instance, if we find that (h1,1,h2,2) is interior for sentences s = 1, 2, the divide-and-conquer branch for s = 1. [sent-267, score-0.252]
62 4 never actually receives this bad combination from its left child, thus avoiding the cost of enumerating combinations that are known to be interior, e. [sent-270, score-0.136]
63 The latter function uses binary lazy enumeration in a manner similar to (Huang and Chiang, 2005), and relies on two global variables: I L. [sent-278, score-0.243]
64 O(N4) translation combinations are possible, but the LP-MERT algorithm only tests two full combinations. [sent-290, score-0.18]
65 , using 4-ary lazy enumeration—ten full combinations would have been checked unnecessarily. [sent-293, score-0.194]
66 -level costs E1s≤,ns≤:=S; E(rs, es,n) output :optimal weight vector ˆw and its loss L begin 64352. [sent-295, score-0.137]
67 Once no more cells need to be added to the frontier, LP-MERT identifies the lowest loss combination on the frontier (BESTINFRONTIER), and uses LP to determine whether it is extreme. [sent-302, score-0.187]
68 Regarding ranges of length one (s = t), lines 3-10 are similar to Algorithm 1for S = 1, but with one difference: GETNEXTBEST may be called multiple times with the same argument s, since the first output of GETNEXTBEST might not be extreme when combined with other feature vectors. [sent-307, score-0.194]
69 7 We can see that a strength of this algorithm is that inconsistent combinations are deleted as soon as possible, which allows us to discard fruitless candidates en masse. [sent-310, score-0.179]
70 We now present two 7Each N-best list is augmented with a placeholder hypothesis with loss +∞. [sent-313, score-0.148]
71 Function Combine(h,H,h0,H0) input :H, H0: constraint vertices input :h, h0: extreme vertices, wrt. [sent-315, score-0.251]
72 While this modification of our algorithm may lead to search errors, it nevertheless provides some theoretical guarantee: our algorithm finds the global optimum if it lies within the region defined by cos( wˆ, w0) ≥ t. [sent-326, score-0.232]
73 The main idea is to prune the output of COMBINE (line 26) by model score with respect to wbest, where wbest is our current best model on the entire tuning set. [sent-328, score-0.252]
74 Note that beam pruning can discard h∗ (the current best extreme vertex), in which case LINOPTIMIZER returns 0. [sent-329, score-0.307]
75 wbest is updated as follows: each time we produce a new non-zero wˆ, run wbest ← wˆ if wˆ has a lower loss than wbest on the entire tuning set. [sent-330, score-0.666]
76 The idea of using a beam here is similar to using cosine similarity (since wbest constrains the search towards a promising region), but beam pruning also helps reduce LP optimization time and thus enables us to 44 explore a wider space. [sent-331, score-0.387]
77 Since wbest often improves during search, it is useful to run multiple iterations of LP-MERT until wbest doesn’t change. [sent-332, score-0.312]
78 Before turning to large tuning sets, we first evaluate exact LP-MERT on data sizes that it can easily han- dle. [sent-373, score-0.139]
79 5 offers a comparison with 1D-MERT, for which we split the tuning set into 1,000 overlapping subsets for S = 2, 4, 8 on a combined N-best after five iterations of MERT with an average of 374 translation per sentence. [sent-375, score-0.199]
80 LP-MERT with S = 8 checks only 600K full combinations on average, much less than the total number of combinations (which is more than 1020). [sent-381, score-0.176]
81 We also evaluate the beam version of LP-MERT, which allows us to exploit tuning sets of reasonable cosine 1 Figure 7: Effect of a constraint on w (runtime on 1CPU). [sent-401, score-0.155]
82 For instance, lattices or hypergraphs may be used in place of N-best lists to form a more comprehensive view of the search space with fewer decoding runs (Macherey et al. [sent-434, score-0.178]
83 This may have to do with the fact that N-best lists with S = 2 have much fewer local maxima than with S = 4, 8, in which case 20 restarts is generally enough. [sent-441, score-0.14]
84 7 Conclusions Our primary contribution is the first known exact search algorithm for direct loss minimization on Nbest lists in multiple dimensions. [sent-458, score-0.359]
85 4 is to enumerate all O(NS) hypotheses combinations in M, discard the ones that are not extreme, and return theM M best scoring one. [sent-475, score-0.227]
86 By definition, any interior point h canP be written as a linear combination of other poinPts: h = Pi λihi, with ∀i(hi ∈ H, hi h, λi 0) and Pi λi =P P1. [sent-494, score-0.516]
87 This same c∀oi(mhbin∈at Hion of p=oi hnts als≥o d0e)monsPtrates thaPt h is interior to H ∪ G, thus conv(h; H ∪ G) isP Pfalse as well. [sent-495, score-0.252]
88 Proof: To prove this equality, it suffices to show that: (1) if gu +hv is interior wrt. [sent-511, score-0.48]
89 the first conv binary predicate in the above equation, then it is interior wrt. [sent-512, score-0.583]
90 the second conv, and (2) if gu +hv is interior wrt. [sent-513, score-0.48]
91 Claim (1) is evident, since the set of points in the first conv is a subset of the other set of points. [sent-516, score-0.397]
92 PSince t,hei ∈int [1er,iUor] point is 0, λi0 vaPlues can j be ∈ s [c1al,eVd ] so Pthat they sum to 1 (necessary condPition in the definition of interior points), which proves that the following predicate is false: conv? [sent-521, score-0.288]
93 Convex hull of a finite set of points in two dimensions. [sent-547, score-0.339]
94 Regularization and search for minimum error rate training. [sent-555, score-0.261]
95 Minimum error rate training by sampling the translation lattice. [sent-559, score-0.18]
96 Beyond loglinear models: boosted minimum error rate training for programming N-best re-ranking. [sent-584, score-0.23]
97 Efficient minimum error rate training and minimum Bayes-risk decoding for translation hypergraphs and lattices. [sent-618, score-0.362]
98 Lattice-based minimum error rate training for statistical machine translation. [sent-650, score-0.196]
99 Random restarts in minimum error rate training for statistical machine translation. [sent-667, score-0.263]
100 A simplex Armijo downhill algorithm for optimizing statistical machine translation decoding parameters. [sent-744, score-0.169]
wordName wordTfidf (topN-words)
[('conv', 0.331), ('hull', 0.273), ('interior', 0.252), ('gu', 0.228), ('hv', 0.213), ('extreme', 0.194), ('mert', 0.189), ('convex', 0.186), ('wbest', 0.156), ('hj', 0.145), ('getnextbest', 0.136), ('hi', 0.135), ('linoptimizer', 0.117), ('och', 0.116), ('lazy', 0.106), ('hn', 0.106), ('loss', 0.102), ('lp', 0.101), ('tuning', 0.096), ('enumeration', 0.095), ('xv', 0.091), ('combinations', 0.088), ('gi', 0.084), ('barber', 0.078), ('ns', 0.074), ('lists', 0.073), ('minimum', 0.071), ('fs', 0.069), ('error', 0.067), ('restarts', 0.067), ('cos', 0.066), ('points', 0.066), ('search', 0.065), ('facets', 0.061), ('vertex', 0.059), ('beam', 0.059), ('quirk', 0.059), ('chou', 0.058), ('juang', 0.058), ('karmarkar', 0.058), ('rate', 0.058), ('vertices', 0.057), ('maximized', 0.056), ('optimal', 0.056), ('translation', 0.055), ('discard', 0.054), ('optimum', 0.051), ('nst', 0.05), ('optimization', 0.048), ('offers', 0.048), ('combination', 0.048), ('bleu', 0.048), ('optimality', 0.048), ('hypothesis', 0.046), ('powell', 0.046), ('nelder', 0.046), ('runtime', 0.045), ('appendix', 0.045), ('linear', 0.045), ('enumerate', 0.044), ('exact', 0.043), ('global', 0.042), ('multidimensional', 0.042), ('decomposable', 0.042), ('execution', 0.042), ('simplex', 0.042), ('duh', 0.042), ('xu', 0.042), ('chris', 0.042), ('hypotheses', 0.041), ('ter', 0.04), ('templates', 0.04), ('chiang', 0.04), ('hypergraphs', 0.04), ('jx', 0.04), ('minimization', 0.039), ('argmwin', 0.039), ('buchta', 0.039), ('geometrically', 0.039), ('halfspace', 0.039), ('quickhull', 0.039), ('summations', 0.039), ('zens', 0.038), ('searching', 0.038), ('sx', 0.037), ('macherey', 0.037), ('cells', 0.037), ('losses', 0.037), ('algorithm', 0.037), ('count', 0.037), ('polynomial', 0.037), ('point', 0.036), ('mappings', 0.036), ('ix', 0.036), ('optimizing', 0.035), ('vector', 0.035), ('programming', 0.034), ('moore', 0.034), ('cer', 0.034), ('regions', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
2 0.17092858 138 emnlp-2011-Tuning as Ranking
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
3 0.13248159 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
4 0.11216545 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
5 0.11089152 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
Author: Keith Hall ; Ryan McDonald ; Jason Katz-Brown ; Michael Ringgaard
Abstract: We present an online learning algorithm for training parsers which allows for the inclusion of multiple objective functions. The primary example is the extension of a standard supervised parsing objective function with additional loss-functions, either based on intrinsic parsing quality or task-specific extrinsic measures of quality. Our empirical results show how this approach performs for two dependency parsing algorithms (graph-based and transition-based parsing) and how it achieves increased performance on multiple target tasks including reordering for machine translation and parser adaptation.
6 0.11031885 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
7 0.10473047 125 emnlp-2011-Statistical Machine Translation with Local Language Models
8 0.10170177 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
9 0.095268071 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
10 0.092072852 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
11 0.080228068 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
12 0.078962527 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
13 0.076764658 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
14 0.076336786 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
15 0.075975135 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
16 0.074810937 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
17 0.073050454 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning
18 0.072800554 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies
19 0.070298806 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
20 0.068980433 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
topicId topicWeight
[(0, 0.238), (1, 0.113), (2, 0.077), (3, -0.118), (4, 0.073), (5, -0.069), (6, 0.03), (7, -0.174), (8, -0.004), (9, 0.003), (10, 0.026), (11, -0.01), (12, 0.002), (13, -0.077), (14, -0.034), (15, 0.067), (16, -0.03), (17, 0.012), (18, 0.081), (19, -0.06), (20, -0.025), (21, 0.032), (22, 0.04), (23, 0.075), (24, 0.288), (25, 0.005), (26, -0.152), (27, 0.001), (28, -0.017), (29, -0.027), (30, -0.014), (31, 0.081), (32, 0.177), (33, -0.018), (34, 0.027), (35, -0.108), (36, -0.043), (37, -0.005), (38, 0.154), (39, -0.037), (40, -0.038), (41, -0.047), (42, -0.033), (43, 0.098), (44, -0.029), (45, -0.052), (46, 0.013), (47, -0.005), (48, 0.106), (49, 0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.93801033 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
2 0.8853898 138 emnlp-2011-Tuning as Ranking
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
3 0.57173854 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
Author: Zhifei Li ; Ziyuan Wang ; Jason Eisner ; Sanjeev Khudanpur ; Brian Roark
Abstract: Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
4 0.53500158 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
5 0.45437381 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
Author: Yin-Wen Chang ; Michael Collins
Abstract: This paper describes an algorithm for exact decoding of phrase-based translation models, based on Lagrangian relaxation. The method recovers exact solutions, with certificates of optimality, on over 99% of test examples. The method is much more efficient than approaches based on linear programming (LP) or integer linear programming (ILP) solvers: these methods are not feasible for anything other than short sentences. We compare our method to MOSES (Koehn et al., 2007), and give precise estimates of the number and magnitude of search errors that MOSES makes.
7 0.40043825 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
8 0.39638835 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
9 0.37612391 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
10 0.36869234 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
11 0.36826178 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
12 0.36735821 96 emnlp-2011-Multilayer Sequence Labeling
13 0.35693941 129 emnlp-2011-Structured Sparsity in Structured Prediction
14 0.35346511 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
15 0.35010183 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
16 0.34623781 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
17 0.34422779 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
18 0.34417713 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
19 0.32586387 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning
20 0.29762441 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
topicId topicWeight
[(23, 0.107), (36, 0.023), (37, 0.036), (45, 0.05), (53, 0.03), (54, 0.03), (62, 0.376), (64, 0.037), (65, 0.012), (66, 0.025), (69, 0.035), (79, 0.05), (82, 0.019), (85, 0.02), (96, 0.038), (98, 0.018)]
simIndex simValue paperId paperTitle
1 0.92237842 32 emnlp-2011-Computing Logical Form on Regulatory Texts
Author: Nikhil Dinesh ; Aravind Joshi ; Insup Lee
Abstract: The computation of logical form has been proposed as an intermediate step in the translation of sentences to logic. Logical form encodes the resolution of scope ambiguities. In this paper, we describe experiments on a modestsized corpus of regulation annotated with a novel variant of logical form, called abstract syntax trees (ASTs). The main step in computing ASTs is to order scope-taking operators. A learning model for ranking is adapted for this ordering. We design features by studying the problem ofcomparing the scope ofone operator to another. The scope comparisons are used to compute ASTs, with an F-score of 90.6% on the set of ordering decisons.
2 0.91269141 14 emnlp-2011-A generative model for unsupervised discovery of relations and argument classes from clinical texts
Author: Bryan Rink ; Sanda Harabagiu
Abstract: This paper presents a generative model for the automatic discovery of relations between entities in electronic medical records. The model discovers relation instances and their types by determining which context tokens express the relation. Additionally, the valid semantic classes for each type of relation are determined. We show that the model produces clusters of relation trigger words which better correspond with manually annotated relations than several existing clustering techniques. The discovered relations reveal some of the implicit semantic structure present in patient records.
same-paper 3 0.85238242 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
4 0.59709674 114 emnlp-2011-Relation Extraction with Relation Topics
Author: Chang Wang ; James Fan ; Aditya Kalyanpur ; David Gondek
Abstract: This paper describes a novel approach to the semantic relation detection problem. Instead of relying only on the training instances for a new relation, we leverage the knowledge learned from previously trained relation detectors. Specifically, we detect a new semantic relation by projecting the new relation’s training instances onto a lower dimension topic space constructed from existing relation detectors through a three step process. First, we construct a large relation repository of more than 7,000 relations from Wikipedia. Second, we construct a set of non-redundant relation topics defined at multiple scales from the relation repository to characterize the existing relations. Similar to the topics defined over words, each relation topic is an interpretable multinomial distribution over the existing relations. Third, we integrate the relation topics in a kernel function, and use it together with SVM to construct detectors for new relations. The experimental results on Wikipedia and ACE data have confirmed that backgroundknowledge-based topics generated from the Wikipedia relation repository can significantly improve the performance over the state-of-theart relation detection approaches.
5 0.56112748 128 emnlp-2011-Structured Relation Discovery using Generative Models
Author: Limin Yao ; Aria Haghighi ; Sebastian Riedel ; Andrew McCallum
Abstract: We explore unsupervised approaches to relation extraction between two named entities; for instance, the semantic bornIn relation between a person and location entity. Concretely, we propose a series of generative probabilistic models, broadly similar to topic models, each which generates a corpus of observed triples of entity mention pairs and the surface syntactic dependency path between them. The output of each model is a clustering of observed relation tuples and their associated textual expressions to underlying semantic relation types. Our proposed models exploit entity type constraints within a relation as well as features on the dependency path between entity mentions. We examine effectiveness of our approach via multiple evaluations and demonstrate 12% error reduction in precision over a state-of-the-art weakly supervised baseline.
6 0.55083305 40 emnlp-2011-Discovering Relations between Noun Categories
7 0.52847981 70 emnlp-2011-Identifying Relations for Open Information Extraction
8 0.50162297 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week
9 0.4943119 138 emnlp-2011-Tuning as Ranking
10 0.48846051 94 emnlp-2011-Modelling Discourse Relations for Arabic
11 0.4788098 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
12 0.47273198 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
13 0.46662641 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
14 0.46420923 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
15 0.45911676 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
16 0.45702195 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model
17 0.45701694 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
18 0.45701414 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
19 0.45680979 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
20 0.45430163 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base