emnlp emnlp2013 emnlp2013-145 knowledge-graph by maker-knowledge-mining

145 emnlp-2013-Optimal Beam Search for Machine Translation

Source: pdf

Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins

Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 @ c s ai l mit Abstract Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. [sent-3, score-0.347]

2 We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. [sent-4, score-1.285]

3 In this work we present a variant of beam search decoding for phrase- and syntax-based translation. [sent-15, score-0.814]

4 The motivation is to exploit the effectiveness and efficiency of beam search, but still maintain formal guarantees. [sent-16, score-0.522]

5 edu • • • In theory, it can provide a certificate of optimality; ieno practice, we sihdoew a ctheratti fiitc produces optimal hypotheses, with certificates of optimality, on the vast majority of examples. [sent-19, score-0.363]

6 The method only relies on having a constrained beam search algorithm and a fast unconstrained search algorithm. [sent-24, score-0.975]

7 We begin in Section 2 by describing constrained hypergraph search and showing how it generalizes translation decoding. [sent-26, score-0.53]

8 Section 3 introduces a variant of beam search that is, in theory, able to produce a certificate of optimality. [sent-27, score-0.892]

9 Section 4 shows how to improve the effectiveness of beam search by using weights derived from Lagrangian relaxation. [sent-28, score-0.661]

10 Section 5 puts everything together to derive a fast beam search algorithm that is often optimal in practice. [sent-29, score-0.794]

11 Experiments compare the new algorithm with several variants of beam search, cube pruning, A∗ search, and relaxation-based decoders on two translation tasks. [sent-30, score-0.765]

12 The optimal beam search algorithm is able to find exact solutions with certificates of optimality on 99% of translation examples, significantly more than other baselines. [sent-31, score-1.14]

13 hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 2t1ic0s–2 1, beam search algorithm is much faster than other exact methods. [sent-34, score-0.793]

14 ,v|v||iv,|v1i ← e s ← θ(e) +Xπ[vi] tXih=e2n π[v1] if s > π[v1] return π[1] + τ ← s Figure 1: Dynamic programming algorithm for unconstrained hypergraph search. [sent-89, score-0.389]

15 Next consider a variant of this problem: constrained hypergraph search. [sent-97, score-0.35]

16 In the constrained hypergraph problem, hyperpaths must fulfill additional linear hyperedge constraints. [sent-101, score-0.485]

17 Note that the constrained hypergraph search problem may be NP-Hard. [sent-106, score-0.403]

18 σ The translation decoding problem is to find the best derivation for a given source sentence. [sent-122, score-0.338]

19 We can represent this decoding problem as a constrained hypergraph using the construction of Chang and Collins (201 1). [sent-145, score-0.418]

20 The hypergraph weights encode the translation and language model scores, and its structure ensures that the count of source words translated is |w|, i. [sent-146, score-0.428]

21 While any valid derivation corresponds to a hyperpath in this graph, a hyperpath may not correspond to a valid derivation. [sent-167, score-0.467]

22 1}|w|×|E| Example: Syntax-Based Machine Translation Syntax-based machine translation with a language model can also be expressed as a constrained hypergraph problem. [sent-190, score-0.42]

23 213 3 A Variant of Beam Search This section describes a variant of the beam search algorithm for finding the highest-scoring constrained hyperpath. [sent-193, score-0.836]

24 Any solution returned by the algorithm will be a valid constrained hyperpath and a member of X0. [sent-195, score-0.408]

25 Additionally itnheed algorithm hre atunrdns a a ceemrtibfierca otef flag opt that, if true, indicates that no beam pruning was used, implying the solution returned is optimal. [sent-196, score-0.895]

26 Generally it will be hard to produce a certificate even by reducing the amount of beam pruning; however in the next section we will introduce a method based on Lagrangian relaxation to tighten the upper bounds. [sent-197, score-1.014]

27 1 Algorithm Figure 3 shows the complete beam search algorithm. [sent-200, score-0.632]

28 The beam search chart indexes hypotheses by vertex v ∈ V as well as a signature sig ∈ sw bhyer vee |b| xis v vth ∈e n Vum asbe wr eollf ca son ast sraiginntastu. [sent-202, score-1.092]

29 2 For hypothesis x, the algorithm ensures that its signature sig is equal to Ax. [sent-208, score-0.333]

30 The algorithm takes as arguments a lower bound on the optimal score lb ≤ θ>x∗ + τ, aan ldo computes upper o bpotuimndasl on teh leb o ≤ut θside score for all vertices v: ubs[v], i. [sent-218, score-0.52]

31 Note that pruning may remove optimal hypotheses, so we set the certificate flag opt to false if the chart is modified. [sent-228, score-0.643]

32 214 1: procedure BEAMSEARCH(θ, τ, lb, β) 2: ubs ← OUTSIDE(θ, τ) 3: opt ←← Otrue 4: πop[vt, ←sig t] u←e −∞ for all v ∈ V, sig ∈ R|b| 5: ππ[[vv,, s0]i ←] ← ←0 −for∞ ∞all f v a∈l lT v 6: fπo[rv e ∈] ←E i 0n topological order do 7: ohhrv e2 , . [sent-232, score-0.523]

33 ,v|v|) 9: sig ← Aδ(e) +Xsig(i) Xi=2 10: X| Xv X| s ← θ(e) +Xπ[vi,sig(i)] do ifs CH > +E u πCb[Kvis1([=sv,2i1sg]ig) ≥] ∧ l ∧bthen 11: 12: 13: 14: 15: π[v1 , sig] ← s if PR,UsiNgE](π ←, v1, sig, β) then opt lb0 ← π[1, c] + τ retu←rn π πlb[10,, opt ← false Input:l(βbAV, ∈bE)? [sent-241, score-0.565]

34 ol,Rθ,bτ)p0tlahmoyawprctreesururitxilnftgiicbnrangateoglonuopwdfpenhorvpatdbwimeroacuiamntlhidoteyrwscteofreroighcotsn traints Output: Figure 3: A variant of the beam search algorithm. [sent-242, score-0.689]

35 Uses dynamic programming to produce a lower bound on the optimal constrained solution and, possibly, a certificate of optimality. [sent-243, score-0.62]

36 Bounds lb and ubs are used to remove provably non-optimal solutions. [sent-247, score-0.337]

37 This variant on beam search satisfies the following two properties (recall is the optimal con- x∗ strained solution) Property 3. [sent-249, score-0.811]

38 The returned score lb0 lower bounds the optimal constrained score, that is lb0 ≤ θ>x∗ τ. [sent-251, score-0.389]

39 If beam search returns with opt = true, then the returned score is optimal, i. [sent-254, score-0.847]

40 1is that the output of beam search, lb0, can be used as the input lb for future runs of the algorithm. [sent-258, score-0.723]

41 × if we loosen the amount of beam pruning by adjusting the pruning parameter β we can produce tighter lower bounds and discard more hypotheses. [sent-265, score-0.999]

42 A common beam pruning strategy is to group together items into a set C and retain a (possibly complete) esumbsse int. [sent-276, score-0.636]

43 th Feo rb e asltl u vn ∈co Vns,tr waein seedt outside score ubs[v] =x∈mXa:vx∈xe∈OX(v,x)θ(e) + τ 215 This upper bound can be efficiently computed for all vertices using the standard outside dynamic programming algorithm. [sent-284, score-0.384]

44 4 Finding Tighter Bounds with Lagrangian Relaxation Beam search produces a certificate only if beam pruning is never used. [sent-288, score-0.949]

45 1 Algorithm In Lagrangian relaxation, instead of solving the constrained search problem, we relax the constraints and solve an unconstrained hypergraph problem with modified weights. [sent-296, score-0.564]

46 L Note that for all valid constrained hyperpaths x ∈ X0N Nthoete ete thrmat A foxr− albl equals 0, wstrhaicinhe implies pthataht sth xes ∈e hyperpaths hAaxve− tbheeq same score uhnimdepr tiehes tmhaotdtihfieesed weights as under the original weights, θ>x + τ = + θ0>x+τ0. [sent-302, score-0.378]

47 This leads to the following two properties, procedure LRROUND(αk , λ) x ← argmaxθ>x + τ − λ>(Ax − b) λ0 ← λ − αk(Ax − b) opt ←← λ A −x α = bA uopbt ← ← θ A>xx += τ ruebtu ←rn θ λ0, ub, opt procedure LAGRANGIANRELAXATION(α) λ(0) 0 for k← ←in 01 . [sent-303, score-0.346]

48 K do λ(k) , ub, opt ← LRROUND(αk, λ(k−1)) if opt tbh,eonp rte ←tur LnR RλR(k) , ub, opt return λ(K), ub, opt Input: α1 . [sent-306, score-0.733]

49 αK sequence of subgradient rates ← Output:uoλbptfcuinperaptliefidr cu ba otelu vn odefc to o pnrti omptailmityal constrained solution Figure 5: Lagrangian relaxation algorithm. [sent-309, score-0.37]

50 The value L(λ) upper bounds the optimal solution, that is L(λ) ≥ pθ>erx∗ b + τ Property 4. [sent-313, score-0.376]

51 1 states that L(λ) always produces some upper bound; however, to help beam search, we want as tight a bound as possible: minλ L(λ). [sent-319, score-0.682]

52 Subgradient descent iteratively solves unconstrained hypergraph search problems to compute these subgradients and updates λ. [sent-323, score-0.382]

53 However ifP the solution uses each source word exactly once (Ax = 1), then we have a certificate and the solution is optimal. [sent-330, score-0.344]

54 To utilize these improved bounds, we simply replace the weights in beam search and the outside algorithm with the modified weights from Lagrangian relaxation, θ0 and τ0. [sent-335, score-0.822]

55 Since the result of beam search must be a valid constrained hyperpath x ∈ X0, and fmoru atll b x ∈ Xlid0, + τ = θp0e>rpxa + τx0, ∈ th Xis subsfotirtu atlilon x d ∈oe sX not alter the necessary properties of the algorithm; i. [sent-336, score-0.954]

56 Additionally the computation of upper bounds now becomes oθ>nsxtr ubs[v] =x∈mXa:vx∈xe∈OX(v,x)θ0(e) + τ0 These outside paths may still violate constraints, but the modified weights now include penalty terms to discourage common violations. [sent-339, score-0.418]

57 5 Optimal Beam Search The optimality of the beam search algorithm is dependent on the tightness of the upper and lower bounds. [sent-340, score-0.822]

58 We can produce better lower bounds by varying the pruning parameter β; we can produce better upper bounds by running Lagrangian relaxation. [sent-341, score-0.528]

59 In this section we combine these two ideas and present a complete optimal beam search algorithm. [sent-342, score-0.754]

60 Our general strategy will be to use Lagrangian relaxation to compute modified weights and to use beam search over these modified weights to attempt to find an optimal solution. [sent-343, score-1.029]

61 The algorithm then iteratively runs beam search using the parameter sequence βk. [sent-346, score-0.705]

62 These parameters allow the algorithm to loosen the amount of beam pruning. [sent-347, score-0.6]

63 For example in phrase based pruning, we would raise the number of hypotheses stored per group until no beam pruning occurs. [sent-348, score-0.725]

64 A clear disadvantage of the staged approach is that it needs to wait until Lagrangian relaxation is completed before even running beam search. [sent-349, score-0.717]

65 Often beam search will be able to quickly find an optimal solution even with good but non-optimal λ. [sent-350, score-0.8]

66 In other cases, beam search may still improve the lower bound lb. [sent-351, score-0.698]

67 In each round, the algorithm alternates between computing subgradients to tighten ubs and running beam search to maximize lb. [sent-353, score-0.847]

68 In early rounds we set β for aggressive beam pruning, and as the upper bounds get tighter, we loosen pruning to try to get a certificate. [sent-354, score-0.967]

69 If at any point either a primal or dual certificate is found, the algorithm returns the optimal solution. [sent-355, score-0.501]

70 6 Related Work Approximate methods based on beam search and cube-pruning have been widely studied for phrasebased (Koehn et al. [sent-356, score-0.664]

71 Chang 217 procedure OPTBEAMSTAGED(α, β) λ, ub, opt ←LAGRANGIANRELAXATION(α) λif, opt tohpetn ← ←reLturn ub θ0 θ − A>λ τ0 −+ Aλ>b lb(←0) for k ←in 1 − . [sent-370, score-0.433]

72 K do ← ←← τ ← τ −∞ lb(k) , opt lb(k−1), BEAMSEARCH(θ0, τ0, βk) if opt then return return maxk∈{1. [sent-373, score-0.428]

73 K do λ(k) , , opt ← LRROUND(αk, λ(k−1)) if opt then return θ0 θ A>λ(k) ← lb(k) lb(k) ← ←← −∞ ub(k) ← − u LbR(kR) τ0 ←← τ −+ Aλ(k)>b lb(←k) , opt ←λ BEAMSEARCH(θ0, τ0, if opt then return return maxk∈{1. [sent-379, score-0.815]

74 optimal constrained score or lower bound Figure 6: Two versions of optimal beam search: staged and alternating. [sent-388, score-0.983]

75 Staged runs Lagrangian relaxation to find the optimal λ, uses λ to compute upper bounds, and then repeatedly runs beam search with pruning sequence β1 . [sent-389, score-1.179]

76 Alternating switches between running a round of Lagrangian relaxation and a round of beam search with the updated λ. [sent-393, score-0.783]

77 (2012) relate column generation to beam search and produce exact solutions for parsing and tagging problems. [sent-399, score-0.757]

78 The latter work also gives conditions for when beam search-style decoding is optimal. [sent-400, score-0.647]

79 7 Results To evaluate the effectiveness of optimal beam search for translation decoding, we implemented decoders for phrase- and syntax-based models. [sent-401, score-0.922]

80 The performance of optimal beam search is dependent on the sequences α and β. [sent-417, score-0.754]

81 2 Baseline Methods The experiments compare optimal beam search (OPTBEAM) to several different decoding methods. [sent-423, score-0.879]

82 For both systems we compare to: BEAM, the beam search decoder from Figure 3 using the original weights θ and τ, and β ∈ {100, 1000}; LRTIGHT, Lagrangian r,el aanxda βtion ∈ f o{1llo0w0,e1d0 by ;in LcrRe-218 Figure 7: Two graphs from phrase-based decoding. [sent-424, score-0.694]

83 Graph (b) shows the % of certificates found for sentences with differing gap sizes and beam search parameters β. [sent-426, score-0.7]

84 For phrase-based translation we compare with: MOSES-GC, the standard Moses beam search decoder with β ∈ {100, 1000} (Koehn et al. [sent-429, score-0.792]

85 For syntax-based translation we compare with: ILP, a general-purpose integer linear programming solver (Gurobi Optimization, 2013) and CUBEPRUNING, an approximate decoding method similar to beam search (Chiang, 2007), tested with β ∈ {100, 1000}. [sent-434, score-0.92]

86 For phrase-based translation, OPTBEAM decodes the optimal translation with certificate in 99% of sentences with an average time of 17. [sent-437, score-0.452]

87 CUBE (1000) finds more exact solutions, but is comparable in speed to optimal beam search. [sent-468, score-0.716]

88 2 shows the relationship between beam search optimality and duality gap. [sent-471, score-0.72]

89 Graph (b) shows how beam search is more likely to find optimal solutions with tighter bounds. [sent-473, score-0.858]

90 For both methods, beam search has the most time variance and uses more time on longer sentences. [sent-476, score-0.632]

91 For phrase-based sentences, Lagrangian relaxation is fast, and hypergraph construction dom219 TaPSBbleH2L:ygDpa. [sent-477, score-0.337]

92 8ia% rnch, including: hypergraph construction, Lagrangian relaxation, and beam search. [sent-482, score-0.708]

93 8 Conclusion In this work we develop an optimal variant of beam search and apply it to machine translation decoding. [sent-487, score-0.938]

94 The algorithm uses beam search to produce constrained solutions and bounds from Lagrangian relaxation to eliminate non-optimal solutions. [sent-488, score-1.143]

95 Exact decoding of phrase-based translation models through lagrangian relaxation. [sent-501, score-0.493]

96 Pharaoh: a beam search decoder for phrase-based statistical machine translation models. [sent-559, score-0.792]

97 Revisiting optimal decoding for machine translation IBM model 4. [sent-584, score-0.374]

98 Exact decoding of syntactic translation models through lagrangian relaxation. [sent-594, score-0.493]

99 A tutorial on dual decomposition and lagrangian relaxation for inference in natural language processing. [sent-598, score-0.454]

100 Word reordering and a dynamic programming beam search algorithm for statistical machine translation. [sent-602, score-0.748]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('beam', 0.522), ('lagrangian', 0.241), ('sig', 0.219), ('certificate', 0.203), ('hypergraph', 0.186), ('hyperpath', 0.177), ('opt', 0.173), ('lb', 0.168), ('bounds', 0.16), ('relaxation', 0.151), ('ubs', 0.131), ('translation', 0.127), ('decoding', 0.125), ('optimal', 0.122), ('pruning', 0.114), ('search', 0.11), ('constrained', 0.107), ('hyperpaths', 0.102), ('rush', 0.102), ('upper', 0.094), ('ax', 0.09), ('hypotheses', 0.089), ('optbeam', 0.087), ('ub', 0.087), ('unconstrained', 0.086), ('signature', 0.074), ('exact', 0.072), ('bound', 0.066), ('subgradient', 0.066), ('dual', 0.062), ('hyperedge', 0.061), ('hyperedges', 0.061), ('outside', 0.059), ('lrround', 0.058), ('pauvre', 0.058), ('sigs', 0.058), ('variant', 0.057), ('collins', 0.057), ('optimality', 0.056), ('solutions', 0.053), ('tighter', 0.051), ('tillmann', 0.051), ('faster', 0.049), ('source', 0.049), ('vertex', 0.047), ('solution', 0.046), ('axx', 0.044), ('beamsearch', 0.044), ('demuni', 0.044), ('iglesias', 0.044), ('staged', 0.044), ('tighten', 0.044), ('violate', 0.043), ('signatures', 0.043), ('constraints', 0.042), ('returns', 0.042), ('return', 0.041), ('decoders', 0.041), ('algorithm', 0.04), ('dynamic', 0.04), ('rounds', 0.039), ('valid', 0.038), ('gispert', 0.038), ('certificates', 0.038), ('loosen', 0.038), ('provably', 0.038), ('koehn', 0.037), ('derivation', 0.037), ('translated', 0.037), ('property', 0.036), ('programming', 0.036), ('cube', 0.035), ('xe', 0.035), ('chang', 0.035), ('check', 0.035), ('prunes', 0.035), ('tail', 0.034), ('riedel', 0.034), ('decoder', 0.033), ('modified', 0.033), ('runs', 0.033), ('mx', 0.032), ('duality', 0.032), ('primal', 0.032), ('phrasebased', 0.032), ('moses', 0.031), ('chart', 0.031), ('gurobi', 0.03), ('gap', 0.03), ('vertices', 0.03), ('weights', 0.029), ('argmaxx', 0.029), ('astar', 0.029), ('banga', 0.029), ('belanger', 0.029), ('fulfill', 0.029), ('lagrangianrelaxation', 0.029), ('maxk', 0.029), ('maxx', 0.029), ('mxa', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 145 emnlp-2013-Optimal Beam Search for Machine Translation

Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins

2 0.22894403 141 emnlp-2013-Online Learning for Inexact Hypergraph Search

Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald

Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.

3 0.16915321 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

Author: Maryam Siahbani ; Baskaran Sankaran ; Anoop Sarkar

Abstract: Left-to-right (LR) decoding (Watanabe et al., 2006b) is a promising decoding algorithm for hierarchical phrase-based translation (Hiero). It generates the target sentence by extending the hypotheses only on the right edge. LR decoding has complexity O(n2b) for input of n words and beam size b, compared to O(n3) for the CKY algorithm. It requires a single language model (LM) history for each target hypothesis rather than two LM histories per hypothesis as in CKY. In this paper we present an augmented LR decoding algorithm that builds on the original algorithm in (Watanabe et al., 2006b). Unlike that algorithm, using experiments over multiple language pairs we show two new results: our LR decoding algorithm provides demonstrably more efficient decoding than CKY Hiero, four times faster; and by introducing new distortion and reordering features for LR decoding, it maintains the same translation quality (as in BLEU scores) ob- tained phrase-based and CKY Hiero with the same translation model.

4 0.16794349 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming

Author: Kai Zhao ; James Cross ; Liang Huang

Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.

5 0.14247407 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

Author: Xinyan Xiao ; Deyi Xiong

Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.

6 0.11610742 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

7 0.11542978 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

8 0.10001766 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation

9 0.093874425 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

10 0.086507872 2 emnlp-2013-A Convex Alternative to IBM Model 2

11 0.078165188 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

12 0.07547047 201 emnlp-2013-What is Hidden among Translation Rules

13 0.075251587 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

14 0.073719203 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation

15 0.073505618 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

16 0.070132196 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

17 0.066520967 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

18 0.064217009 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

19 0.063531458 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging

20 0.062200066 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.192), (1, -0.19), (2, 0.075), (3, 0.081), (4, 0.015), (5, 0.003), (6, 0.047), (7, -0.009), (8, 0.091), (9, 0.206), (10, -0.102), (11, -0.058), (12, -0.082), (13, 0.074), (14, 0.136), (15, -0.243), (16, 0.017), (17, 0.025), (18, -0.083), (19, 0.026), (20, -0.006), (21, 0.0), (22, -0.085), (23, -0.071), (24, 0.044), (25, -0.063), (26, 0.12), (27, 0.125), (28, 0.02), (29, 0.104), (30, -0.038), (31, 0.023), (32, 0.018), (33, 0.04), (34, -0.034), (35, -0.046), (36, 0.084), (37, 0.089), (38, -0.084), (39, -0.062), (40, 0.051), (41, 0.03), (42, -0.014), (43, 0.052), (44, 0.072), (45, 0.018), (46, -0.065), (47, -0.048), (48, -0.021), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95406133 145 emnlp-2013-Optimal Beam Search for Machine Translation

Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins

2 0.80072922 141 emnlp-2013-Online Learning for Inexact Hypergraph Search

Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald

3 0.79452658 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming

Author: Kai Zhao ; James Cross ; Liang Huang

4 0.72510785 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao

Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02

5 0.56544143 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

Author: Martin Cmejrek ; Haitao Mi ; Bowen Zhou

Abstract: Machine translation benefits from system combination. We propose flexible interaction of hypergraphs as a novel technique combining different translation models within one decoder. We introduce features controlling the interactions between the two systems and explore three interaction schemes of hiero and forest-to-string models—specification, generalization, and interchange. The experiments are carried out on large training data with strong baselines utilizing rich sets of dense and sparse features. All three schemes significantly improve results of any single system on four testsets. We find that specification—a more constrained scheme that almost entirely uses forest-to-string rules, but optionally uses hiero rules for shorter spans—comes out as the strongest, yielding improvement up to 0.9 (T -B )/2 points. We also provide a detailed experimental and qualitative analysis of the results.

6 0.56301033 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

7 0.43756503 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

8 0.4327364 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

9 0.37627208 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

10 0.3743946 201 emnlp-2013-What is Hidden among Translation Rules

11 0.34343645 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

12 0.34182745 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation

13 0.32309058 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

14 0.30707973 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

15 0.30036092 2 emnlp-2013-A Convex Alternative to IBM Model 2

16 0.28896162 156 emnlp-2013-Recurrent Continuous Translation Models

17 0.27786061 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

18 0.27739257 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

19 0.26101983 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation

20 0.26035613 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.025), (9, 0.019), (18, 0.056), (22, 0.031), (26, 0.403), (30, 0.085), (43, 0.01), (45, 0.016), (50, 0.013), (51, 0.109), (66, 0.031), (71, 0.031), (75, 0.014), (77, 0.036), (90, 0.01), (95, 0.012), (96, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95367569 141 emnlp-2013-Online Learning for Inexact Hypergraph Search

Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald

same-paper 2 0.84264994 145 emnlp-2013-Optimal Beam Search for Machine Translation

Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins

3 0.70978063 122 emnlp-2013-Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation

Author: Dekai Wu ; Karteek Addanki ; Markus Saers ; Meriem Beloucif

Abstract: We present a novel model, Freestyle, that learns to improvise rhyming and fluent responses upon being challenged with a line of hip hop lyrics, by combining both bottomup token based rule induction and top-down rule segmentation strategies to learn a stochastic transduction grammar that simultaneously learns bothphrasing and rhyming associations. In this attack on the woefully under-explored natural language genre of music lyrics, we exploit a strictly unsupervised transduction grammar induction approach. Our task is particularly ambitious in that no use of any a priori linguistic or phonetic information is allowed, even though the domain of hip hop lyrics is particularly noisy and unstructured. We evaluate the performance of the learned model against a model learned only using the more conventional bottom-up token based rule induction, and demonstrate the superiority of our combined token based and rule segmentation induction method toward generating higher quality improvised responses, measured on fluency and rhyming criteria as judged by human evaluators. To highlight some of the inherent challenges in adapting other algorithms to this novel task, we also compare the quality ofthe responses generated by our model to those generated by an out-ofthe-box phrase based SMT system. We tackle the challenge of selecting appropriate training data for our task via a dedicated rhyme scheme detection module, which is also acquired via unsupervised learning and report improved quality of the generated responses. Finally, we report results with Maghrebi French hip hop lyrics indicating that our model performs surprisingly well with no special adaptation to other languages. 102

4 0.49006176 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao

5 0.43290928 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming

Author: Kai Zhao ; James Cross ; Liang Huang

6 0.40160128 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

7 0.39867038 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

8 0.38661644 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

9 0.38523912 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

10 0.37890348 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

11 0.37657499 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction

12 0.37461469 2 emnlp-2013-A Convex Alternative to IBM Model 2

13 0.37004909 150 emnlp-2013-Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries