emnlp emnlp2011 emnlp2011-93 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhifei Li ; Ziyuan Wang ; Jason Eisner ; Sanjeev Khudanpur ; Brian Roark
Abstract: Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
Reference: text
sentIndex sentText sentNum sentScore
1 zwang4 0 khudanpur@ j hu edu Abstract Discriminative training for machine translation has been well studied in the recent past. [sent-4, score-0.164]
2 We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. [sent-6, score-0.35]
3 Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. [sent-7, score-0.204]
4 Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. [sent-8, score-0.583]
5 Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks. [sent-9, score-0.296]
6 But bilingual data for such supervised training may be relatively scarce for a particular language pair (e. [sent-20, score-0.171]
7 We propose an unsupervised training approach, called minimum imputed risk training, which is conceptually straightforward: First guess x (probabilistically) from the observed y using a reverse Englishto-Chinese translation model pφ(x | y). [sent-32, score-1.294]
8 o Tdheel pθ (y | x) to do a good job at translating this imputed x back to y, as measured by a given performance metric. [sent-34, score-0.638]
9 Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a targetlanguage sentence to the source-language and back again will have low expected loss. [sent-35, score-0.204]
10 ec th2o0d1s1 i Ans Nsoactuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgaugies ti 9c2s0–929, of in-domain bilingual development data to discriminatively tune a small number of parameters in φ; and (3) a large amount of in-domain English monolingual data. [sent-39, score-0.251]
11 The novelty here is to exploit (3) to discriminatively tune the parameters θ of all translation model components,2 pθ(y|x) and pθ(y), not merely train a generative language m) aodndel pθ(y), as is the norm. [sent-40, score-0.257]
12 BLEU) on a set of (x, y) pairs with our unsupervised discriminative training using only y. [sent-43, score-0.177]
13 One may hence contrast our approach with the traditional supervised methods applied to the MT task such as minimum error rate training (Och, 2003; Macherey et al. [sent-44, score-0.164]
14 , 2008), minimum risk (Smith and Eisner, 2006; Li and Eisner, 2009), and MIRA (Watanabe et al. [sent-47, score-0.236]
15 — — 2 Supervised Discriminative Training via Minimization of Empirical Risk Let us first review discriminative training in the supervised setting—as used in MERT (Och, 2003) and subsequent work. [sent-52, score-0.193]
16 One wishes to tune the parameters θ of some complex translation system δθ (x). [sent-53, score-0.162]
17 The goal of discriminative training is to minimize the expected loss of δθ (·), under a given taskspecific loss function L(y0, y) tuhnadt measures haoskw2Note that the extra monolingual data is used only for tuning the model weights, but not for inducing new phrases or rules. [sent-56, score-0.521]
18 4 The true p(x, y) is, of course, not known and, in practice, one typically minimizes empirical risk by replacing p(x, y) above with the empirical distribution p˜(x, y) given by a supervised training set {(xi, yi) , i= 1, . [sent-64, score-0.315]
19 So we propose to replace 3This goal is different from the minimum risk training of Li and Eisner (2009) in a subtle but important way. [sent-74, score-0.272]
20 In both cases, θ∗ minimizes risk or expected loss, but the expectation is w. [sent-75, score-0.28]
21 different distributions: the expectation in Li and Eisner (2009) is under the conditional distribution p(y | x), while the expectation idne (1) ies uconndderit tihonea joint tdriisbutrtiibountio pn(y p(x, y). [sent-78, score-0.164]
22 We seek a decision rule δθ (x) that will incur low expected loss on observations x that are generated from unseen states of nature. [sent-80, score-0.192]
23 L(δθ (xi) , yi) with the expectation Xpφ(x|yi)L(δθ(x),yi), Xx (3) where pφ(· | ·) is a “reverse prediction model” that attempts t(o· impute t “hree missing xi cdtiaotna. [sent-83, score-0.55]
24 m Woed cela”ll thhaet resulting variant of (2) the minimization of imputed empirical risk, and say that θ∗= argθminN1XiN=1Xxpφ(x|yi)L(δθ(x),yi) (4) is the estimate with the minimum imputed risk6. [sent-84, score-1.273]
25 The minimum imputed risk objective of (4) could be evaluated by brute force as follows. [sent-85, score-0.86]
26 For each unsupervised example yi, use the reverse prediction model pφ(· | yi) to impute possible reverse translations Xi = {xi1, xi2, . [sent-87, score-0.985]
27 }, asnibdl ar edvde seeac thra (xij , yi) pair, weighted by pφ(xij | yi) ≤ 1, to an imputed training set . [sent-90, score-0.649]
28 Perform the supervised training of (2) on the imputed and weighted training data. [sent-92, score-0.743]
29 The second step means that we must use δθ to forward-translate each imputed xij, evaluate the loss of the translations yi0j against the corresponding true translation yi, and choose the θ that minimizes the weighted sum of these losses (i. [sent-93, score-0.965]
30 , the empirical risk when the empirical distribution p˜(x, y) is derived from the imputed training set). [sent-95, score-0.813]
31 Specific to our MT task, this tries to ensure that probabilistic “roundtrip” translation, from the target-language sentence yi to the source-language and back again, will have a low expected loss. [sent-96, score-0.287]
32 7 The trouble with this method is that the reverse model pφ generates a weighted lattice or hypergraph Xi encoding exponentially many translations ofyi, a Xnd it is computationally infeasible to forwardtranslate each xij ∈ Xi. [sent-97, score-0.653]
33 6One may exploit both supervised data {(xi , yi)} and unsupervised dmaatay {yj } tiot b perform semi-supervised training v uian an interpolation o {fy (2) at ond p (4). [sent-100, score-0.161]
34 2 The Reverse Prediction Model pφ A crucial ingredient in (4) is the reverse prediction model pφ(· |·) that attempts to impute the missing xi. [sent-106, score-0.643]
35 We will tr(·a|i·n) tthhaist mattoemdepl tisn t advance, doing tshien g be xst job we can from available data, including any outof-domain bilingual data as well as any in-domain monolingual data8 x. [sent-107, score-0.173]
36 Whereas δθ is a translation system that aims to produce a single, low-loss translation, the reverse version pφ is rather a probabilistic model. [sent-111, score-0.425]
37 It is supposed to give an accurate probability distribution over possible values xij of the missing input sentence xi. [sent-112, score-0.203]
38 All of these values are taken into account in (4), regardless of the loss that they would incur if they were evaluated for translation quality relative to the missing xi. [sent-113, score-0.351]
39 Thus, φ does not need to be trained to minimize the risk itself (so there is no circularity). [sent-114, score-0.195]
40 It may be tolerable for pφ to impute mediocre translations xij. [sent-117, score-0.353]
41 All that is necessary is that the (forward) translations generated from the imputed xij “simulate” the competing hypotheses that we would see when translating the correct Chinese input xi. [sent-118, score-0.763]
42 3 The Forward Translation System δθ and The Loss Function L(δθ(xi) , yi) The minimum empirical risk objective of (2) is quite general and various popular supervised training methods (Lafferty et al. [sent-120, score-0.371]
43 , 2006; Smith and Eisner, 8In a translation task from x to y, one usually does not make use of in-domain monolingual data x. [sent-122, score-0.173]
44 But we can exploit x to train a language model pφ (x) for the reverse translation system, which will make the imputed xij look like true Chinese inputs. [sent-123, score-1.114]
45 The generality of (2) extends to our minimum imputed risk objective of (4). [sent-125, score-0.86]
46 10 9One can manipulate the loss function to support other methods that use deterministic decoding, such as Perceptron (Collins, 2002) and MIRA (Crammer et al. [sent-142, score-0.188]
47 10Again, one may manipulate the loss function to support other probabilistic methods that use randomized decoding, such as CRFs (Lafferty et al. [sent-144, score-0.195]
48 1, it is computationally infeasible to forward-translate each of the imputed reverse translations xij. [sent-152, score-1.004]
49 For each yi, add to the imputed training set only the k most probable translations {xi1, . [sent-156, score-0.717]
50 ambiguous weighted finite-state automaton Xi, (b) athme fiogruwoaursd wtreaingshltaetidon fi system δθ uist smtruatcotunr Xed in a certain way as a weighted synchronous context-free grammar, and (c) the loss function decomposes in a certain way. [sent-176, score-0.213]
51 Intuitively, the reason why the structure-sharing in the hypergraph Xi (genewrhayted th by tuhcet reverse system) ch aynpneorgt abep exploited during forward translating is that when the forward Hiero system translates a string xi ∈ Xi, it must parse it into recursive phrases. [sent-184, score-0.925]
52 But the structure-sharing within the hypergraph of Xi has already parsed xi into recursive phrases, in a way determined by the reverse Hiero system; each translation phrase (or rule) corresponding to a hyperedge. [sent-185, score-0.622]
53 To exploit structure-sharing, we can use a forward translation system that decomposes according to that existing parse of xi. [sent-186, score-0.323]
54 We can do that by considering only forward translations that respect the hypergraph structure of Xi. [sent-187, score-0.34]
55 The simplest way to dthoe thhyisp eirsg to require complete isomorphism of the SCFG trees used for the reverse and forward translations. [sent-188, score-0.463]
56 Our deterministic test-time translation system δθ simply 12Note that the forward translation of a WFSA is tractable by using a lattice-based decoder such as that by Dyer et al. [sent-197, score-0.432]
57 For large γ, our training objective approaches the imputed risk of the deterministic test-time system while remaining differentiable. [sent-205, score-0.886]
58 EM The notion of imputing missing data is familiar from other settings (Little and Rubin, 1987), particularly the expectation maximization (EM) algorithm, a widely used generative approach. [sent-213, score-0.18]
59 So it is instructive to compare EM with minimum imputed risk. [sent-214, score-0.653]
60 (14) Notice that if we replace pθt (x|yi) with pφ(x | yi) in the equation above, and xa|dymit negated loglikelihood as a loss function, then the EM update (14) becomes identical to (4). [sent-221, score-0.218]
61 In other words, the minimum imputed risk approach of Section 3. [sent-222, score-0.819]
62 1 differs from EM in (i) using an externally-provided and static pφ, instead of refining it at each iteration based on the current pθt , and (ii) using a specific loss function, namely negated log-likelihood. [sent-223, score-0.192]
63 13 In summary, EM would impute missing data using pθ(x | y) and predict outputs using pθ (y | x), both being )co annddit piorneadli tfor omutsp otfs uthsien same joint model pθ(x, y). [sent-233, score-0.348]
64 Our minimum imputed risk training method is similar, but it instead uses a pair of 13Analogously, discriminative CRFs have become more popular than generative HMMs because they permit efficient training even with a wide variety of log-linear features (Lafferty et al. [sent-234, score-0.99]
65 By sticking to conditional models, we can efficiently use more sophisticated model features, and we can incorporate the loss function when we train θ, which should improve both efficiency and accuracy at test time. [sent-237, score-0.186]
66 1 IWSLT Task We train both reverse and forward baseline systems. [sent-243, score-0.49]
67 The translation models are built using the corpus for the IWSLT 2005 Chinese to English translation task (Eck and Hori, 2005), which comprises 40,000 pairs of transcribed utterances in the travel domain. [sent-244, score-0.23]
68 2 Target-rule Bigram Features In this paper, we do not attempt to discriminatively tune a separate parameter for each bilingual rule in the Hiero grammar. [sent-260, score-0.18]
69 Note that the reverse model φ is always trained using the supervised data of Dev φ, while the forward model θ may be trained in a supervised or semisupervised manner, as we will show below. [sent-277, score-0.579]
70 In all three data sets, each Chinese sentence xi has 16 English reference translations, so each yi is actually a set of 16 translations. [sent-278, score-0.372]
71 When we impute data from yi (in the semi-supervised scenario), we 14Ideally, we should train φ to minimize the conditional cross-entropy (5) as suggested in section 3. [sent-279, score-0.566]
72 Dev φ is used for discriminatively training of the reverse model φ, Dev θ is for the forward model, and Eval θ is for testing. [sent-283, score-0.57]
73 actually impute 16 different values of xi, by using pφ to separately reverse translate each sentence in yi. [sent-285, score-0.55]
74 4), where each xi is a different input sentence (imputed) in each case, but yi is always the original set of 16 references. [sent-287, score-0.372]
75 2 NIST Task For the NIST task, we use MT03 set (having 919 sentences) to tune the component parameters in both the forward and reverse baseline systems. [sent-290, score-0.495]
76 Additionally, we use the English side of MT04 (having 1788 sentences) to perform semi-supervised tuning of the forward model. [sent-291, score-0.168]
77 The supervised system (“Sup”) carries out discriminative training on a bilingual data set. [sent-296, score-0.298]
78 The semi-supervised system (“+Unsup”) additionally uses some monolingual English text for discriminative training (where we impute one Chinese translation per English sentence). [sent-297, score-0.591]
79 “+Unsup” means that we i 2n0cl0u×de1 6ad Ednigtli osnhal t (monolingual) English seeanntesn thceast from Dev θ for semi-supervised training; for each English sentence, we impute the 1-best Chinese translation. [sent-309, score-0.255]
80 1 Imputation with Different Reverse Models A critical component of our unsupervised method is the reverse translation model pφ(x | y). [sent-322, score-0.439]
81 We wonder how the performance of our unsupervised method changes when the quality of the reverse system varies. [sent-323, score-0.365]
82 To study this question, we used two different reverse translation systems, one with a language model trained on the Chinese side of the bitext (“WLM”), and the other one without using such a Chinese LM (“NLM”). [sent-324, score-0.397]
83 Table 4 (in the fully unsupervised case) shows that the imputed Chinese translations have a far lower BLEU score without the language model,15 and that this costs us about 1English 15The BLEU scores are low even with the language model because only one Chinese reference is available for scoring. [sent-325, score-0.723]
84 M96n7Uig with/without using a language model in the reverse × system. [sent-330, score-0.295]
85 A data size of 101 means that we use only the English sentences from a subset of Dev θ containing 101 Chinese sentences and 101 16 English translations; f1o0r1 e Cahchin English esnencteesn acned we impute tnhgeli 1sh-b tersant sClahtiinoensse; translation. [sent-331, score-0.255]
86 “WLM” means a Chinese language model is used in the reverse system, while “NLM” means no Chinese language model is used. [sent-332, score-0.295]
87 In addition to reporting the BLEU score on Eval θ, we also report “Imputed-CN BLEU”, the BLEU score of the imputed Chinese sentences against their corresponding Chinese reference sentences. [sent-333, score-0.583]
88 Still, even with the worse imputation (in the case of “NLM”), our forward translations improve as we add more monolingual data. [sent-335, score-0.446]
89 2 Imputation with Different k-best Sizes In all the experiments so far, we used the reverse translation system to impute only a single Chinese translation for each English monolingual sentence. [sent-338, score-0.853]
90 16 list, a sample, or a lattice for xi (see section 3. [sent-343, score-0.197]
91 6 Conclusions In this paper, we present an unsupervised discriminative training method that works with missing inputs. [sent-345, score-0.27]
92 The key idea in our method is to use a reverse model to impute the missing input from the observed output. [sent-346, score-0.643]
93 The training will then forward translate the imputed input, and choose the parameters of the forward model such that the imputed risk (i. [sent-347, score-1.704]
94 , 16In the present experiments, however, we simply weighted all k imputed translations equally, rather than in proportion to their posterior probabilities as suggested in Section 3. [sent-349, score-0.711]
95 r W Weaec hu English 1se6n mteonncoel we impute the k-best Chinese translations using the reverse system. [sent-354, score-0.674]
96 the expected loss of the forward translations with respect to the observed output) is minimized. [sent-355, score-0.427]
97 This matches the intuition that the probabilistic “roundtrip” translation from the target-language sentence to the source-language and back should have low expected loss. [sent-356, score-0.168]
98 In future work, we plan to test our method in settings where there are large amounts of monolingual training data (enabling many discriminative features). [sent-362, score-0.206]
99 First- and second-order expectation semirings with applications to minimumrisk training on translation forests. [sent-432, score-0.189]
100 Unsupervised discriminative language model training for machine translation using simulated confusion sets. [sent-445, score-0.237]
wordName wordTfidf (topN-words)
[('imputed', 0.583), ('reverse', 0.295), ('impute', 0.255), ('yi', 0.221), ('forward', 0.168), ('risk', 0.166), ('xi', 0.151), ('iwslt', 0.14), ('loss', 0.125), ('chinese', 0.111), ('imputation', 0.109), ('sup', 0.109), ('unsup', 0.109), ('translation', 0.102), ('dev', 0.1), ('discriminative', 0.099), ('translations', 0.098), ('bleu', 0.095), ('missing', 0.093), ('hiero', 0.089), ('xij', 0.082), ('bilingual', 0.077), ('hypergraph', 0.074), ('discriminatively', 0.071), ('monolingual', 0.071), ('minimum', 0.07), ('mt', 0.068), ('negated', 0.067), ('nist', 0.066), ('eisner', 0.065), ('zhifei', 0.061), ('supervised', 0.058), ('nlm', 0.055), ('li', 0.055), ('arg', 0.053), ('chiang', 0.052), ('expectation', 0.051), ('em', 0.048), ('english', 0.048), ('eval', 0.047), ('differentiable', 0.047), ('lattice', 0.046), ('joshua', 0.045), ('rubin', 0.043), ('unsupervised', 0.042), ('och', 0.041), ('objective', 0.041), ('khudanpur', 0.041), ('translates', 0.041), ('sanjeev', 0.039), ('randomized', 0.039), ('decoding', 0.038), ('star', 0.037), ('eng', 0.037), ('minimization', 0.037), ('imputing', 0.036), ('roundtrip', 0.036), ('rxe', 0.036), ('signicantly', 0.036), ('sseenntteenncceess', 0.036), ('strives', 0.036), ('xik', 0.036), ('xnd', 0.036), ('training', 0.036), ('expected', 0.036), ('logp', 0.035), ('inputs', 0.034), ('conditional', 0.034), ('jason', 0.033), ('deterministic', 0.032), ('tune', 0.032), ('derivation', 0.032), ('wlm', 0.031), ('ziyuan', 0.031), ('incur', 0.031), ('manipulate', 0.031), ('back', 0.03), ('weighted', 0.03), ('minimize', 0.029), ('lafferty', 0.028), ('johns', 0.028), ('dong', 0.028), ('sons', 0.028), ('eck', 0.028), ('infeasible', 0.028), ('distribution', 0.028), ('system', 0.028), ('minimizes', 0.027), ('smith', 0.027), ('train', 0.027), ('ignacio', 0.026), ('loglikelihood', 0.026), ('comprises', 0.026), ('hu', 0.026), ('approximation', 0.026), ('exploit', 0.025), ('wolfgang', 0.025), ('roark', 0.025), ('fk', 0.025), ('job', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
Author: Zhifei Li ; Ziyuan Wang ; Jason Eisner ; Sanjeev Khudanpur ; Brian Roark
Abstract: Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
2 0.14382003 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao
Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –
3 0.13372487 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
Author: Keith Hall ; Ryan McDonald ; Jason Katz-Brown ; Michael Ringgaard
Abstract: We present an online learning algorithm for training parsers which allows for the inclusion of multiple objective functions. The primary example is the extension of a standard supervised parsing objective function with additional loss-functions, either based on intrinsic parsing quality or task-specific extrinsic measures of quality. Our empirical results show how this approach performs for two dependency parsing algorithms (graph-based and transition-based parsing) and how it achieves increased performance on multiple target tasks including reordering for machine translation and parser adaptation.
4 0.13083567 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
5 0.12416626 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
Author: Xinyan Xiao ; Yang Liu ; Qun Liu ; Shouxun Lin
Abstract: Although discriminative training guarantees to improve statistical machine translation by incorporating a large amount of overlapping features, it is hard to scale up to large data due to decoding complexity. We propose a new algorithm to generate translation forest of training data in linear time with the help of word alignment. Our algorithm also alleviates the oracle selection problem by ensuring that a forest always contains derivations that exactly yield the reference translation. With millions of features trained on 519K sentences in 0.03 second per sentence, our system achieves significant improvement by 0.84 BLEU over the baseline system on the NIST Chinese-English test sets.
6 0.11785474 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
7 0.10965953 125 emnlp-2011-Statistical Machine Translation with Local Language Models
8 0.092072852 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
9 0.091848798 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
10 0.086375028 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
11 0.085362844 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
12 0.083310291 118 emnlp-2011-SMT Helps Bitext Dependency Parsing
13 0.079054616 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
14 0.076841339 60 emnlp-2011-Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation
15 0.076303944 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
16 0.075636685 138 emnlp-2011-Tuning as Ranking
17 0.075326636 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
18 0.067139663 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
19 0.064084701 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
20 0.062998526 96 emnlp-2011-Multilayer Sequence Labeling
topicId topicWeight
[(0, 0.224), (1, 0.147), (2, 0.105), (3, -0.142), (4, 0.052), (5, -0.04), (6, 0.005), (7, -0.086), (8, -0.053), (9, -0.007), (10, 0.012), (11, 0.08), (12, -0.017), (13, 0.028), (14, -0.004), (15, 0.058), (16, 0.048), (17, -0.076), (18, 0.024), (19, 0.01), (20, -0.07), (21, 0.032), (22, 0.053), (23, 0.014), (24, 0.257), (25, -0.017), (26, -0.118), (27, 0.097), (28, -0.046), (29, 0.186), (30, -0.054), (31, -0.017), (32, -0.051), (33, -0.046), (34, 0.029), (35, 0.069), (36, 0.037), (37, -0.002), (38, 0.077), (39, -0.038), (40, -0.003), (41, 0.033), (42, 0.019), (43, -0.037), (44, -0.166), (45, 0.052), (46, 0.001), (47, -0.043), (48, 0.078), (49, -0.212)]
simIndex simValue paperId paperTitle
same-paper 1 0.93964386 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
Author: Zhifei Li ; Ziyuan Wang ; Jason Eisner ; Sanjeev Khudanpur ; Brian Roark
Abstract: Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
2 0.58257395 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
3 0.51506978 60 emnlp-2011-Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation
Author: Jason Riesa ; Ann Irvine ; Daniel Marcu
Abstract: unkown-abstract
4 0.48783278 138 emnlp-2011-Tuning as Ranking
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
5 0.48777562 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao
Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –
6 0.47090295 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
7 0.46603283 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
8 0.45765117 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
9 0.4351747 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
10 0.41725987 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
11 0.4076933 118 emnlp-2011-SMT Helps Bitext Dependency Parsing
12 0.39322346 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
13 0.38906506 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
14 0.37585282 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
15 0.36224106 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
16 0.35793537 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
17 0.34524575 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
18 0.34376627 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
19 0.34253362 125 emnlp-2011-Statistical Machine Translation with Local Language Models
20 0.33874923 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
topicId topicWeight
[(23, 0.563), (36, 0.015), (37, 0.021), (45, 0.04), (53, 0.031), (54, 0.033), (62, 0.023), (64, 0.023), (66, 0.019), (69, 0.016), (79, 0.034), (82, 0.019), (87, 0.012), (90, 0.016), (96, 0.023), (98, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99664485 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
Author: Zhifei Li ; Ziyuan Wang ; Jason Eisner ; Sanjeev Khudanpur ; Brian Roark
Abstract: Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target- language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
2 0.99526924 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
Author: Matteo Negri ; Luisa Bentivogli ; Yashar Mehdad ; Danilo Giampiccolo ; Alessandro Marchetti
Abstract: We address the creation of cross-lingual textual entailment corpora by means of crowdsourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.
3 0.9933188 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling
Author: Vivek Srikumar ; Dan Roth
Abstract: This paper presents a model that extends semantic role labeling. Existing approaches independently analyze relations expressed by verb predicates or those expressed as nominalizations. However, sentences express relations via other linguistic phenomena as well. Furthermore, these phenomena interact with each other, thus restricting the structures they articulate. In this paper, we use this intuition to define a joint inference model that captures the inter-dependencies between verb semantic role labeling and relations expressed using prepositions. The scarcity of jointly labeled data presents a crucial technical challenge for learning a joint model. The key strength of our model is that we use existing structure predictors as black boxes. By enforcing consistency constraints between their predictions, we show improvements in the performance of both tasks without retraining the individual models.
4 0.98903328 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
Author: Weiwei Sun ; Jia Xu
Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
5 0.98640943 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization
Author: Rui Yan ; Liang Kong ; Congrui Huang ; Xiaojun Wan ; Xiaoming Li ; Yan Zhang
Abstract: We investigate an important and challenging problem in summary generation, i.e., Evolutionary Trans-Temporal Summarization (ETTS), which generates news timelines from massive data on the Internet. ETTS greatly facilitates fast news browsing and knowledge comprehension, and hence is a necessity. Given the collection oftime-stamped web documents related to the evolving news, ETTS aims to return news evolution along the timeline, consisting of individual but correlated summaries on each date. Existing summarization algorithms fail to utilize trans-temporal characteristics among these component summaries. We propose to model trans-temporal correlations among component summaries for timelines, using inter-date and intra-date sen- tence dependencies, and present a novel combination. We develop experimental systems to compare 5 rival algorithms on 6 instinctively different datasets which amount to 10251 documents. Evaluation results in ROUGE metrics indicate the effectiveness of the proposed approach based on trans-temporal information. 1
6 0.91719741 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
7 0.91120231 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing
8 0.9081527 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
10 0.87069792 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
11 0.86999226 136 emnlp-2011-Training a Parser for Machine Translation Reordering
12 0.86471868 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model
13 0.8629517 17 emnlp-2011-Active Learning with Amazon Mechanical Turk
14 0.86116874 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
15 0.86091012 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
16 0.86061257 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words
17 0.8600651 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
18 0.85953963 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
19 0.85470152 89 emnlp-2011-Linguistic Redundancy in Twitter
20 0.85297692 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances