emnlp emnlp2013 emnlp2013-2 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrei Simion ; Michael Collins ; Cliff Stein
Abstract: The IBM translation models have been hugely influential in statistical machine translation; they are the basis of the alignment models used in modern translation systems. Excluding IBM Model 1, the IBM translation models, and practically all variants proposed in the literature, have relied on the optimization of likelihood functions or similar functions that are non-convex, and hence have multiple local optima. In this paper we introduce a convex relaxation of IBM Model 2, and describe an optimization algorithm for the relaxation based on a subgradient method combined with exponentiated-gradient updates. Our approach gives the same level of alignment accuracy as IBM Model 2.
Reference: text
sentIndex sentText sentNum sentScore
1 A Convex Alternative to IBM Model 2 Andrei Simion Columbia University IEOR Department New York, NY, 10027 aas 2 1 8 @ columbia . [sent-1, score-0.086]
2 edu 4 Michael Collins Columbia University Computer Science New York, NY, 10027 mc3 3 5 4 @ columbia . [sent-2, score-0.086]
3 edu Clifford Stein Columbia University IEOR Department New York, NY, 10027 cs 2 0 3 5 @ columbia . [sent-3, score-0.086]
4 edu Abstract The IBM translation models have been hugely influential in statistical machine translation; they are the basis of the alignment models used in modern translation systems. [sent-4, score-0.359]
5 Excluding IBM Model 1, the IBM translation models, and practically all variants proposed in the literature, have relied on the optimization of likelihood functions or similar functions that are non-convex, and hence have multiple local optima. [sent-5, score-0.419]
6 In this paper we introduce a convex relaxation of IBM Model 2, and describe an optimization algorithm for the relaxation based on a subgradient method combined with exponentiated-gradient updates. [sent-6, score-0.86]
7 Our approach gives the same level of alignment accuracy as IBM Model 2. [sent-7, score-0.129]
8 1 Introduction The IBM translation models (Brown et al. [sent-8, score-0.08]
9 , 1993) have been tremendously important in statistical ma- chine translation (SMT). [sent-9, score-0.11]
10 The IBM models were the first generation of SMT systems; in recent work, they play a central role in deriving alignments used within many modern SMT approaches, for example phrase-based translation models (Koehn, 2008) and syntax-based translation systems (e. [sent-10, score-0.304]
11 Since the original IBM paper, there has been a large amount of research exploring the original IBM models and modern variants (e. [sent-14, score-0.086]
12 Excluding IBM Model 1, the IBM translation models, and practically all variants proposed in the literature, have relied on the optimization of likelihood functions or similar functions that are nonconvex. [sent-19, score-0.419]
13 Unfortunately, non-convex objective functions have multiple local optima, and finding a global optimum of a non-convex function is typically a computationally intractible problem. [sent-20, score-0.329]
14 Typi1574 cally, an EM algorithm is used, which often runs in a reasonable amount of time, but with no guarantees of finding a global optima (or for that matter, even a × near-optimal solution). [sent-21, score-0.178]
15 In this paper we make the following contributions: • We introduce a convex relaxation of IBM MWoed einl r2o. [sent-22, score-0.385]
16 Aucte a very high level, tthioen nre olafxa ItBioMn is derived by replacing the product t(fj |ei) d(i |j) with a relaxation that is commonly u)se ×d idn( it|hje) wlinitehar a programming tli tser caotmurme (e. [sent-23, score-0.15]
17 ) • • We describe an optimization algorithm for tWhee r deelasxceridb objective, i bzaasetido on a rciothmmbin foartion of stochastic subgradient methods with the exponentiated-gradient (EG) algorithm (Kivinen and Warmuth, 1997; Beck and Teboulle, 2003). [sent-29, score-0.406]
18 We describe experiments with the method on sWtaen ddaersdc alignment datasets, showing tthhoatd t ohen EG algorithm converges in only a few passes over the data, and that our method achieves ac- curacies that are very similar to those of IBM Model 2. [sent-30, score-0.227]
19 Framing the unsupervised learning of alignment models as a convex optimization problem, with guaranteed convergence to a global optimum, has several clear advantages. [sent-31, score-0.67]
20 First, the method is easier to analyze, as the objective function is being truly maximized. [sent-32, score-0.2]
21 Second, there is no need for initialization heuristics with the approach, given that the method will always converge to a global optimum. [sent-33, score-0.192]
22 Finally, we expect that our convexitybased approach may facilitate the further development of more convex models. [sent-34, score-0.235]
23 In view of these developments, the lack of convex methods in translation alignment models has been noticeable, and we hope that our work will open up new directions and lead to further progress in this area. [sent-38, score-0.478]
24 , 1993) introduced IBM Models 1 through 5, and optimization methods for these models based on the EM algorithm. [sent-48, score-0.188]
25 While the models were originally introduced for full translation, they are now mainly used to derive alignments which are then used by phrase-based and other modern SMT systems. [sent-49, score-0.144]
26 5, which uses a parameterization that is similar to a hidden Markov model, and which allows the value ofeach alignment variable to be conditioned on a previous alignment variable. [sent-53, score-0.258]
27 , 2006) describe a method that explicitly incorporates agreement preferences during training. [sent-55, score-0.056]
28 (Och and Ney, 2003) give a systematic comparison of several alignment models in the literature. [sent-56, score-0.129]
29 , 2010) describes a method based on posterior regularization that incorporates additional constraints within the EM algorithm for estimation of IBM models. [sent-59, score-0.1]
30 All of these approaches are unsupervised, in that they do not require labeled alignment data; however several authors have considered supervised models (e. [sent-60, score-0.129]
31 The focus of the current paper is on unsupervised learning; the unsupervised variants described above all make use of non1575 convex objective functions during training, with the usual problems with multiple local maxima. [sent-66, score-0.474]
32 3 The IBM Model 1and Model 2 Optimization Problems In this section we give a brief review of IBM Models 1 and 2, and the optimization problems arising from these models. [sent-67, score-0.188]
33 The standard approach for optimization within these models is the EM algorithm. [sent-68, score-0.188]
34 For each English word e ∈ E, we will assume thaFt D(e) ihs a dictionary specifying wthee sweitl lo fa possible French words that can be translations of e. [sent-89, score-0.051]
35 In practice, D(e) can be derived in various ways; in our experiments we simply define D(e) to include all French words f such that e and f are seen in a translation pair. [sent-91, score-0.08]
36 Given these definitions, the IBM model 2 optimization problem is given in Figure 1. [sent-92, score-0.221]
37 The parameters in this problem are t(f|e) and d(i|j) . [sent-93, score-0.079]
38 The t(f|e) parameters are tlreamns alraeti to(nf parameters specifying t|ehe) probability of English word e being translated as French word f. [sent-94, score-0.143]
39 The distortion parameters d(i|j) specify the probability odifs tthoret j’th pFarreanmchet ewrsord d( iin|j a sentence being aligned to the i’th English word. [sent-95, score-0.096]
40 We use a variant of IBM Model 2 where the distortion variables are shared across all sentence lengths (similar variants have been used in (Liang et al. [sent-96, score-0.13]
41 The objective function is then the log-likelihood of the training data (see Eq. [sent-98, score-0.2]
42 Xi=0 Crucially, while the constraints in the IBM Model 2 optimization problem are linear, the objective function in Eq. [sent-100, score-0.454]
43 Therefore, optimization methods for IBM Model 2, in particular the EM algorithm, are typically only guaranteed to reach a local maximum of the objective function. [sent-102, score-0.386]
44 For completeness, Figure 2 shows the optimization problem for IBM Model 1. [sent-103, score-0.221]
45 In IBM Model 1 the distortion parameters d(i |j) are all fixed to be tthhee dunisiftoorrtmio ndis ptarriabmuteiotenr (i. [sent-104, score-0.096]
46 eTdh eto o bbejective function for IBM Model 1is actually convex, so the EM algorithm will converge to a global maximum. [sent-107, score-0.24]
47 1576 A common heuristic is to initialize the t(f|e) param- Aete cros minm EoMn h optimization ionfi tIiaBlMize eM thoed te(lf 2| using tmheoutput from IBM Model 1. [sent-109, score-0.188]
48 We are not aware of any guarantees for this initialization heuristic, however. [sent-111, score-0.08]
49 4 A Convex Relaxation of IBM Model 2 We now introduce a convex optimization problem, the I2CR (IBM 2 Convex Relaxation) problem. [sent-112, score-0.423]
50 As its name suggests, this optimization problem is closely related to IBM Model 2, but is convex. [sent-113, score-0.221]
51 Because of this it will be relatively easy to derive an optimization algorithm that is guaranteed to converge to a global optimum. [sent-114, score-0.426]
52 Our experiments show that the relaxation gives very similar performance to the original IBM 2 optimization problem, as described in the previous section. [sent-115, score-0.338]
53 We first describe an optimization problem, I2CR-1, that illustrates the basic idea behind the convex relaxation. [sent-116, score-0.451]
54 ×× 1577 The objective function is 1nXnXmklogXlkq(i,j,k) Xk=1 Xj=1 Xi=0 which is similar to the objective function in Figure 1, d(i|j) has been replaced by q(i, j,k). [sent-124, score-0.4]
55 13-15, we had the constraint but where t(fj(k)|ei(k)) q(i,j, k) = t(fj(k)|ei(k)) d(i|j) , (17) then the I2CR-1 problem would clearly be identical to the IBM Model 2 optimization problem. [sent-127, score-0.26]
56 We have used a standard relaxation of the non-linear constraint x = y z where x, y, z are all variables cino tnhster range [0, 1], namely x x x ≤ ≤ ≥ ,, y z y+z − 1. [sent-128, score-0.184]
57 These inequalites are a relaxation in the sense that × any (x, y, z) triple that satisfies x = y z also satiasnfyies ( xth,yes,ez )c otrnipsltreati hnatst. [sent-129, score-0.15]
58 This is because the task is to maximize the objective with respect to the q variables and the objective is strictly increasing as the q values increase—thus lower bounds on their values are redundant in the I2CR-1 problem. [sent-134, score-0.346]
59 It is easily verified that the constraints in the I2CR-1 problem are linear, and that the objective function is convex. [sent-135, score-0.266]
60 In Section 5 of this paper we describe an optimization method for the problem. [sent-136, score-0.216]
61 Note that because the objective function is being maximized, and the objective increases monotonically as the q values increase, at the global optimum1 1More precisely, at any global optimum: the objective func- tion may not be strictly convex, in which case there will be multiple global optima. [sent-137, score-0.641]
62 The problem is identical to the I2CR-1 problem, but it also includes a term in the objective function that is identical to the IBM Model 1 objective. [sent-139, score-0.311]
63 2 The I2CR-2 Problem Figure 4 shows the refined optimization problem, which we call I2CR-2. [sent-144, score-0.188]
64 First, we modify the objective function to be 21nXk=n1jXm=k1log0Xi=lk0q(i,j k) + 21nkX=n1jXm=k1log0Xi=lk0t(f(Lj(k) +|e 1i()k) 1578 Thus the objective function includes a second term that is identical to the objective function for IBM Model 1 (see Figure 2). [sent-146, score-0.639]
65 In preliminary experiments with the I2CR-1 optimization problem, we found that the I2CR-1 objective was not sufficiently dependent on the t parameters: intuitively, if the d parameters achieve the min on many training examples, the values for the t variables become unimportant. [sent-147, score-0.424]
66 The optimization methods we use are gradient-based methods (or more precisely, subgradient-based methods), and we have found them to be considerably more stable when the values for gradients do not diverge to infinity. [sent-152, score-0.284]
67 5 A Stochastic Exponentiated-Gradient Algorithm for Optimization We now describe an algorithm for optimizing the I2CR-2 problem in Figure 4. [sent-154, score-0.1]
68 The algorithm is closely related to stochastic gradient ascent, but with two modifications: • • First, because the t(f|e) and d(i|j) parametFeirrss ,ha bveec simplex cto(fns|et)rai anntsd (see Figure 1), we use exponentiated gradient (EG) updates. [sent-155, score-0.385]
69 EG algorithms are gradient-based methods that maintain simplex constraints; see for example: (Kivinen and Warmuth, 1997; Beck and Teboulle, 2003; Collins et al. [sent-156, score-0.054]
70 Second, the objective function in the I2CRS2 problem ies convex, e b fuut icst onont idniff theeren It2iaCbRle(the gradient may not exist at all points). [sent-158, score-0.309]
71 For this reason we use subgradients in the place of gradients. [sent-159, score-0.15]
72 In spite of the non-differentiability of the objective function, subgradient methods still have strong convergence guarantees when combined with EG updates (e. [sent-160, score-0.36]
73 To derive the updates, recall that we are maximizing the following objective function: h(t, d) = + 2|1T |Xk∈TXjm=k1log0iX=lk0minnt(fj(k)|ei(k) ,d(i|j)o 2|1T |Xk∈TjXm=k1log0iX=lk0t(f(Lj(k +)|e 1i()k) . [sent-163, score-0.156]
74 ; w Wee use t and d to refer to the full set of t and d parameters respectively; h(t, d) is the function to be maximized. [sent-171, score-0.09]
75 Given a concave function f(x) where x ∈ Rd, a subgradient oonfc f(x) uant x oisn any )v wechtoerr g(x) ∈ Rd ssuubchg rtahdaite nfotr any y ∈ aRtd x, f(y) ≤ f(x) + g(x) · (y − x) , where u·v is the inner product between vectors u and v. [sent-173, score-0.157]
76 Subgradients are spirmoidlaurc to gradients ftoorr sduifafenrdentiable concave functions, in that gradients satisfy the above property. [sent-174, score-0.163]
77 Subgradients can be used in the place of gradients in many optimization algorithms (see for example (Bertsekas, 1999)). [sent-175, score-0.248]
78 , 2We set ∇t(f|e) and ∇d(i|j) objective efutn ∇cttio(fn| ein) Eq. [sent-181, score-0.156]
79 the subgradients for the respect stou t(f|e) antnsd d(i|j) as 1579 and ∇d(i|j) =2|1T |k:i≤lXk,j≤mk1 −Q I(j(,i,kj),k). [sent-183, score-0.15]
80 Exponentiated-gradient updates then take the following form: t(f|e) ←Ptf(ft(|fe)|e ×) × ex epx{pγ{γ × × ∇t ∇(ft(|ef)|e})} (20) and d(i|j) ←Pdi(di|(ji)|j ×) × ex epx{pγ{γ × × ∇d ∇(di|(ji|)j})}, (21) where γ > 0P Pis a constant step size in the algorithm. [sent-184, score-0.056]
81 Note that the EG updates make use of subgradients, but maintain the simplex constraints on the t and d variables. [sent-185, score-0.143]
82 The method just described is a batch gradient method, where the entire training set T = {1. [sent-186, score-0.123]
83 n} ims euthseodd t,o w dheerrivee t hthee e subgradients b seetfo Tre =the { updates in Eqs. [sent-189, score-0.206]
84 Many results in machine learning and NLP have shown that stochastic gradient methods, where a subset of the training examples is used before each gradient-based update, can converge much more quickly than batch gradient methods. [sent-191, score-0.355]
85 The algorithm then performs EG updates using each minibatch T1 . [sent-202, score-0.095]
86 As can be seen in Table 3, our experiments show that the algorithm makes very significant progress in the first pass over the data, and takes very few iterations to converge to a good solution even though we initialized with uniform parameter values. [sent-206, score-0.224]
87 6 Experiments In this section we describe experiments using the I2CR-2 optimization problem combined with the rithm for optimization of I2CR-2. [sent-207, score-0.437]
88 1 Data Sets We use data from the bilingual word alignment workshop held at HLT-NAACL 2003 (Michalcea and Pederson, 2003). [sent-212, score-0.129]
89 The development and test data have been manually aligned at the word level, annotating alignments between source and target words in the corpus as either “sure” (S) or “possible” (P) alignments, as described in (Och and Ney, 2003). [sent-215, score-0.104]
90 For the EG algorithm we use a batch size B = 250 and step size γ = 0. [sent-220, score-0.086]
91 If A is the set of alignments produced by an algorithm, S is the set of sure alignments as annotated in test data, and P is the set of possible alignments, then these quantities are defined as Recall =|A| ∩S| S|, Precision =|A| ∩A| S|, AER = 1 −|A ∩|A S|| + + | |AS| ∩ P|, F-Measure =Re. [sent-224, score-0.208]
92 Note that we report results in both AER and F-measure; however there is evidence (Fraser and Marcu, 2004) that F-measure is better correlated with translation quality when the alignments are used in a full system. [sent-227, score-0.184]
93 For the EG algorithm, we use 10 iterations over the training data for the Hansards data, and 15 iterations on the Romanian-English data (on the latter dataset results on the trial data showed that the method took slightly longer to converge). [sent-230, score-0.074]
94 It can be seen that both I2CR-2 and IBM Model 2 converge to a fairly stable result after 2-3 iterations. [sent-233, score-0.114]
95 On the right, Table 3 shows the values of the objective function at each iteration when using the EG algorithm to optimize the I2CR-2 objective. [sent-239, score-0.276]
96 The method makes a large amount of progress on the first iteration and then continues to improve. [sent-240, score-0.071]
97 Finally, we note that the memory requirements for I2CR-2 and IBM2 are about the same, but that the time for one iteration of I2CR-2 on the Hansards data is approximately one hour, while the time for one iteration of IBM2 was approximately 10 minutes. [sent-241, score-0.074]
98 7 Conclusions and Future Work We have introduced the first convex model for un- supervised learning of alignments in statistical machine translation with performance comparable to 1581 IterationIBM2I2CR-2IBM2I2CR-2 AER AER F-Measure F-Measure Test Set Statistics 10. [sent-242, score-0.449]
99 6099 Table 3: Objective values for the EG algorithm optimization of I2CR-2 at each iteration. [sent-466, score-0.227]
100 Future work will consider ways to speed up our algorithm and extensions ofthe method to more complex alignment models. [sent-472, score-0.168]
wordName wordTfidf (topN-words)
[('ibm', 0.576), ('convex', 0.235), ('fj', 0.224), ('aer', 0.214), ('eg', 0.206), ('optimization', 0.188), ('objective', 0.156), ('relaxation', 0.15), ('subgradients', 0.15), ('alignment', 0.129), ('hansards', 0.117), ('converge', 0.114), ('alignments', 0.104), ('ei', 0.099), ('exponentiated', 0.098), ('columbia', 0.086), ('translation', 0.08), ('french', 0.079), ('gradient', 0.076), ('bertsimas', 0.074), ('kivinen', 0.074), ('taskar', 0.072), ('subgradient', 0.07), ('em', 0.068), ('beck', 0.065), ('warmuth', 0.064), ('gradients', 0.06), ('ben', 0.057), ('updates', 0.056), ('simplex', 0.054), ('teboulle', 0.054), ('optima', 0.051), ('specifying', 0.051), ('distortion', 0.05), ('athena', 0.049), ('bertsekas', 0.049), ('dimitris', 0.049), ('graca', 0.049), ('ieor', 0.049), ('maxkn', 0.049), ('michalcea', 0.049), ('xlkt', 0.049), ('ganchev', 0.049), ('lk', 0.049), ('optimum', 0.049), ('moore', 0.048), ('batch', 0.047), ('parameters', 0.046), ('variants', 0.046), ('guarantees', 0.045), ('function', 0.044), ('concave', 0.043), ('cto', 0.043), ('epx', 0.043), ('joao', 0.043), ('global', 0.043), ('stochastic', 0.042), ('guaranteed', 0.042), ('vogel', 0.042), ('smt', 0.041), ('modern', 0.04), ('identical', 0.039), ('algorithm', 0.039), ('kuzman', 0.039), ('riley', 0.039), ('xi', 0.038), ('liang', 0.038), ('functions', 0.037), ('iterations', 0.037), ('xk', 0.037), ('collins', 0.037), ('iteration', 0.037), ('diverge', 0.036), ('ney', 0.036), ('ef', 0.035), ('initialization', 0.035), ('marcu', 0.035), ('amir', 0.034), ('lem', 0.034), ('lj', 0.034), ('variables', 0.034), ('log', 0.034), ('och', 0.034), ('progress', 0.034), ('problem', 0.033), ('constraints', 0.033), ('convergence', 0.033), ('convention', 0.033), ('iin', 0.033), ('passes', 0.031), ('practically', 0.031), ('stein', 0.031), ('statistical', 0.03), ('ik', 0.03), ('nonlinear', 0.03), ('english', 0.03), ('fraser', 0.029), ('describe', 0.028), ('incorporates', 0.028), ('martins', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 2 emnlp-2013-A Convex Alternative to IBM Model 2
Author: Andrei Simion ; Michael Collins ; Cliff Stein
Abstract: The IBM translation models have been hugely influential in statistical machine translation; they are the basis of the alignment models used in modern translation systems. Excluding IBM Model 1, the IBM translation models, and practically all variants proposed in the literature, have relied on the optimization of likelihood functions or similar functions that are non-convex, and hence have multiple local optima. In this paper we introduce a convex relaxation of IBM Model 2, and describe an optimization algorithm for the relaxation based on a subgradient method combined with exponentiated-gradient updates. Our approach gives the same level of alignment accuracy as IBM Model 2.
Author: Valentin I. Spitkovsky ; Hiyan Alshawi ; Daniel Jurafsky
Abstract: Many statistical learning problems in NLP call for local model search methods. But accuracy tends to suffer with current techniques, which often explore either too narrowly or too broadly: hill-climbers can get stuck in local optima, whereas samplers may be inefficient. We propose to arrange individual local optimizers into organized networks. Our building blocks are operators of two types: (i) transform, which suggests new places to search, via non-random restarts from already-found local optima; and (ii) join, which merges candidate solutions to find better optima. Experiments on grammar induction show that pursuing different transforms (e.g., discarding parts of a learned model or ignoring portions of training data) results in improvements. Groups of locally-optimal solutions can be further perturbed jointly, by constructing mixtures. Using these tools, we designed several modular dependency grammar induction networks of increasing complexity. Our complete sys- tem achieves 48.6% accuracy (directed dependency macro-average over all 19 languages in the 2006/7 CoNLL data) more than 5% higher than the previous state-of-the-art. —
3 0.099900119 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
Author: Xinyan Xiao ; Deyi Xiong
Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.
4 0.099204078 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
Author: Joern Wuebker ; Stephan Peitz ; Felix Rietig ; Hermann Ney
Abstract: Automatically clustering words from a monolingual or bilingual training corpus into classes is a widely used technique in statistical natural language processing. We present a very simple and easy to implement method for using these word classes to improve translation quality. It can be applied across different machine translation paradigms and with arbitrary types of models. We show its efficacy on a small German→English and a larger F ornenc ah s→mGalelrm Gaenrm mtarann→slEatniognli tsahsk a nwdit ha lbaortghe rst Farnednacrhd→ phrase-based salandti nhie traaskrch wiciathl phrase-based translation systems for a common set of models. Our results show that with word class models, the baseline can be improved by up to 1.4% BLEU and 1.0% TER on the French→German task and 0.3% BLEU aonnd t h1e .1 F%re nTcEhR→ on tehrem German→English Btask.
5 0.096996278 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
Author: Xuchen Yao ; Benjamin Van Durme ; Chris Callison-Burch ; Peter Clark
Abstract: We introduce a novel discriminative model for phrase-based monolingual alignment using a semi-Markov CRF. Our model achieves stateof-the-art alignment accuracy on two phrasebased alignment datasets (RTE and paraphrase), while doing significantly better than other strong baselines in both non-identical alignment and phrase-only alignment. Additional experiments highlight the potential benefit of our alignment model to RTE, paraphrase identification and question answering, where even a naive application of our model’s alignment score approaches the state ofthe art.
6 0.09216889 101 emnlp-2013-Improving Alignment of System Combination by Using Multi-objective Optimization
7 0.091431014 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
8 0.091122158 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
9 0.086507872 145 emnlp-2013-Optimal Beam Search for Machine Translation
10 0.082288906 201 emnlp-2013-What is Hidden among Translation Rules
11 0.080180191 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
12 0.074078038 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts
13 0.073010132 159 emnlp-2013-Regularized Minimum Error Rate Training
14 0.069121048 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
15 0.068568319 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
16 0.067382328 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
17 0.067253597 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
18 0.066503018 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!
19 0.063071996 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation
20 0.057764053 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
topicId topicWeight
[(0, -0.178), (1, -0.143), (2, 0.035), (3, 0.033), (4, 0.026), (5, -0.02), (6, 0.059), (7, 0.016), (8, 0.004), (9, -0.02), (10, 0.005), (11, -0.03), (12, 0.072), (13, 0.021), (14, 0.089), (15, -0.145), (16, 0.048), (17, 0.082), (18, 0.058), (19, 0.021), (20, 0.006), (21, 0.05), (22, -0.001), (23, 0.115), (24, 0.029), (25, 0.078), (26, -0.118), (27, -0.166), (28, 0.159), (29, -0.135), (30, -0.058), (31, 0.052), (32, -0.014), (33, 0.111), (34, 0.038), (35, 0.047), (36, 0.035), (37, 0.048), (38, -0.054), (39, 0.063), (40, -0.0), (41, 0.025), (42, 0.057), (43, 0.092), (44, 0.121), (45, -0.073), (46, -0.078), (47, 0.016), (48, 0.048), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.96225876 2 emnlp-2013-A Convex Alternative to IBM Model 2
Author: Andrei Simion ; Michael Collins ; Cliff Stein
Abstract: The IBM translation models have been hugely influential in statistical machine translation; they are the basis of the alignment models used in modern translation systems. Excluding IBM Model 1, the IBM translation models, and practically all variants proposed in the literature, have relied on the optimization of likelihood functions or similar functions that are non-convex, and hence have multiple local optima. In this paper we introduce a convex relaxation of IBM Model 2, and describe an optimization algorithm for the relaxation based on a subgradient method combined with exponentiated-gradient updates. Our approach gives the same level of alignment accuracy as IBM Model 2.
2 0.74631411 101 emnlp-2013-Improving Alignment of System Combination by Using Multi-objective Optimization
Author: Tian Xia ; Zongcheng Ji ; Shaodan Zhai ; Yidong Chen ; Qun Liu ; Shaojun Wang
Abstract: This paper proposes a multi-objective optimization framework which supports heterogeneous information sources to improve alignment in machine translation system combination techniques. In this area, most of techniques usually utilize confusion networks (CN) as their central data structure to compact an exponential number of an potential hypotheses, and because better hypothesis alignment may benefit constructing better quality confusion networks, it is natural to add more useful information to improve alignment results. However, these information may be heterogeneous, so the widely-used Viterbi algorithm for searching the best alignment may not apply here. In the multi-objective optimization framework, each information source is viewed as an independent objective, and a new goal of improving all objectives can be searched by mature algorithms. The solutions from this framework, termed Pareto optimal solutions, are then combined to construct confusion networks. Experiments on two Chinese-to-English translation datasets show significant improvements, 0.97 and 1.06 BLEU points over a strong Indirected Hidden Markov Model-based (IHMM) system, and 4.75 and 3.53 points over the best single machine translation systems.
3 0.62567472 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
Author: Raphael Bailly ; Xavier Carreras ; Franco M. Luque ; Ariadna Quattoni
Abstract: We derive a spectral method for unsupervised learning of Weighted Context Free Grammars. We frame WCFG induction as finding a Hankel matrix that has low rank and is linearly constrained to represent a function computed by inside-outside recursions. The proposed algorithm picks the grammar that agrees with a sample and is the simplest with respect to the nuclear norm of the Hankel matrix.
4 0.57536376 159 emnlp-2013-Regularized Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova
Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.
Author: Valentin I. Spitkovsky ; Hiyan Alshawi ; Daniel Jurafsky
Abstract: Many statistical learning problems in NLP call for local model search methods. But accuracy tends to suffer with current techniques, which often explore either too narrowly or too broadly: hill-climbers can get stuck in local optima, whereas samplers may be inefficient. We propose to arrange individual local optimizers into organized networks. Our building blocks are operators of two types: (i) transform, which suggests new places to search, via non-random restarts from already-found local optima; and (ii) join, which merges candidate solutions to find better optima. Experiments on grammar induction show that pursuing different transforms (e.g., discarding parts of a learned model or ignoring portions of training data) results in improvements. Groups of locally-optimal solutions can be further perturbed jointly, by constructing mixtures. Using these tools, we designed several modular dependency grammar induction networks of increasing complexity. Our complete sys- tem achieves 48.6% accuracy (directed dependency macro-average over all 19 languages in the 2006/7 CoNLL data) more than 5% higher than the previous state-of-the-art. —
6 0.54116333 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
7 0.51051974 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
9 0.49933386 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
10 0.46199584 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
11 0.43187955 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
12 0.42214212 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
13 0.42121351 201 emnlp-2013-What is Hidden among Translation Rules
14 0.39734411 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition
15 0.39675215 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
16 0.37983531 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
17 0.35513633 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
18 0.34535044 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
19 0.34515119 145 emnlp-2013-Optimal Beam Search for Machine Translation
topicId topicWeight
[(3, 0.029), (18, 0.037), (22, 0.034), (26, 0.017), (30, 0.106), (40, 0.302), (43, 0.013), (50, 0.035), (51, 0.145), (66, 0.05), (71, 0.035), (75, 0.03), (77, 0.037), (90, 0.013), (95, 0.014), (96, 0.018)]
simIndex simValue paperId paperTitle
1 0.78295255 94 emnlp-2013-Identifying Manipulated Offerings on Review Portals
Author: Jiwei Li ; Myle Ott ; Claire Cardie
Abstract: Recent work has developed supervised methods for detecting deceptive opinion spam— fake reviews written to sound authentic and deliberately mislead readers. And whereas past work has focused on identifying individual fake reviews, this paper aims to identify offerings (e.g., hotels) that contain fake reviews. We introduce a semi-supervised manifold ranking algorithm for this task, which relies on a small set of labeled individual reviews for training. Then, in the absence of gold standard labels (at an offering level), we introduce a novel evaluation procedure that ranks artificial instances of real offerings, where each artificial offering contains a known number of injected deceptive reviews. Experiments on a novel dataset of hotel reviews show that the proposed method outperforms state-of-art learning baselines.
same-paper 2 0.76795202 2 emnlp-2013-A Convex Alternative to IBM Model 2
Author: Andrei Simion ; Michael Collins ; Cliff Stein
Abstract: The IBM translation models have been hugely influential in statistical machine translation; they are the basis of the alignment models used in modern translation systems. Excluding IBM Model 1, the IBM translation models, and practically all variants proposed in the literature, have relied on the optimization of likelihood functions or similar functions that are non-convex, and hence have multiple local optima. In this paper we introduce a convex relaxation of IBM Model 2, and describe an optimization algorithm for the relaxation based on a subgradient method combined with exponentiated-gradient updates. Our approach gives the same level of alignment accuracy as IBM Model 2.
3 0.5478968 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu
Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.
4 0.54168671 143 emnlp-2013-Open Domain Targeted Sentiment
Author: Margaret Mitchell ; Jacqui Aguilar ; Theresa Wilson ; Benjamin Van Durme
Abstract: We propose a novel approach to sentiment analysis for a low resource setting. The intuition behind this work is that sentiment expressed towards an entity, targeted sentiment, may be viewed as a span of sentiment expressed across the entity. This representation allows us to model sentiment detection as a sequence tagging problem, jointly discovering people and organizations along with whether there is sentiment directed towards them. We compare performance in both Spanish and English on microblog data, using only a sentiment lexicon as an external resource. By leveraging linguisticallyinformed features within conditional random fields (CRFs) trained to minimize empirical risk, our best models in Spanish significantly outperform a strong baseline, and reach around 90% accuracy on the combined task of named entity recognition and sentiment prediction. Our models in English, trained on a much smaller dataset, are not yet statistically significant against their baselines.
5 0.54111516 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
Author: Will Y. Zou ; Richard Socher ; Daniel Cer ; Christopher D. Manning
Abstract: We introduce bilingual word embeddings: semantic embeddings associated across two languages in the context of neural language models. We propose a method to learn bilingual embeddings from a large unlabeled corpus, while utilizing MT word alignments to constrain translational equivalence. The new embeddings significantly out-perform baselines in word semantic similarity. A single semantic similarity feature induced with bilingual embeddings adds near half a BLEU point to the results of NIST08 Chinese-English machine translation task.
6 0.5401848 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
7 0.53891468 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
8 0.53827834 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
9 0.53606635 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
10 0.53513533 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
11 0.53486788 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media
12 0.53329784 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
13 0.53306103 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
14 0.53282273 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
15 0.53246361 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
16 0.53186494 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
17 0.53079456 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation
18 0.53002191 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
19 0.5294224 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
20 0.52823657 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech