emnlp emnlp2013 emnlp2013-146 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
Reference: text
sentIndex sentText sentNum sentScore
1 1 Introduction Best-first parsing, such as A* parsing, makes con- stituent parsing efficient, especially for bottom-up CKY style parsing (Caraballo and Charniak, 1998; Klein and Manning, 2003; Pauls and Klein, 2009). [sent-6, score-0.476]
2 Traditional CKY parsing performs cubic time exact search over an exponentially large space. [sent-7, score-0.345]
3 Best-first parsing significantly speeds up by always preferring to explore states with higher probabilities. [sent-8, score-0.393]
4 Unlike other very fast greedy parsers that produce suboptimal results, this best-first parser still guarantees optimality but requires exponential time for very long sentences in the worst case, which is intractable in practice. [sent-10, score-0.36]
5 Because it needs to explore an exponentially large space in the worst case, a bounded priority queue becomes necessary to ensure limited parsing time. [sent-11, score-0.933]
6 edu On the other hand, Huang and Sagae (2010) explore the idea of dynamic programming, which is originated in bottom-up constituent parsing algorithms like Earley (1970), but in a beam-based non best-first parser. [sent-17, score-0.313]
7 In each beam step, they enable state merging in a style similar to the dynamic programming in bottom-up constituent parsing, based on an equivalence relation defined upon feature values. [sent-18, score-0.539]
8 Although in theory they successfully reduced the underlying deductive system to polynomial time complexity, their merging method is limited in that the state merging is only between two states in the same beam step. [sent-19, score-0.779]
9 This significantly reduces the number of possible merges, because: 1) there are only a very limited number of states in the beam at the same time; 2) a lot of states in the beam with different steps cannot be merged. [sent-20, score-0.674]
10 We merge states with the same features set globally to further reduce the number of possible states in the search graph. [sent-22, score-0.428]
11 We make the following contributions: • • theoretically, we formally prove that our DP tbheesot-rfeitrsict parsing r feoacrmheasl optimality awtit ohu polynomial time complexity. [sent-24, score-0.493]
12 Here ‘ is the step index (for beam search), S is the stack, c is the score of the precedent, and sca (x) is the score of action a from derivation x. [sent-33, score-0.526]
13 W0 tiitmh einse fxaasctetr rse thaarnch a o nfo bno-uDnPde bde priority queue size, DP best-first search can reach optimality with a significantly smaller priority queue size bound, compared to non-DP bestfirst parser. [sent-38, score-1.32]
14 2 Shift-Reduce and Best-First Parsing In this section we review the basics of shift-reduce parsing, beam search, and the best-first shift-reduce parsing algorithm of Sagae and Lavie (2006). [sent-43, score-0.451]
15 Basically, shift-reduce parsing (Aho and Ullman, 1972) performs a left-to-right 759 scan of the input sentence, and at each step, chooses either to shift the next word onto the stack, or to reduce, i. [sent-46, score-0.333]
16 To improve on strictly greedy search, shift-reduce parsing is often enhanced with beam search (Zhang and Clark, 2008), where b derivations develop in parallel. [sent-53, score-0.715]
17 At each step we extend the derivations in the current beam by applying each of the three actions, and then choose the best b resulting derivations for the next step. [sent-54, score-0.588]
18 The best-first parsing algorithm is an application of the Dijkstra algorithm over the DAG above, where the score of each derivation is the priority. [sent-57, score-0.653]
19 Dijkstra algorithm requires the priority to satisfy the superiority property, which means a descendant derivation should never have a higher score than its ancestors. [sent-58, score-0.738]
20 score, where the order x ≺ y means derivation x has a higher priority etrha xn y≺. [sent-63, score-0.581]
21 1 The vanilla best-first parsing algorithm inherits the optimality directly from Dijkstra algorithm. [sent-64, score-0.417]
22 However, it explores exponentially many derivations to reach the goal configuration in the worst case. [sent-65, score-0.387]
23 1 Dynamic Programming Notations The key innovation of this paper is to extend bestfirst parsing with the “state-merging” method of dy- namic programming described in Huang and Sagae (2010). [sent-70, score-0.414]
24 The notation fk (sk) is used to indicate the features used by the parser from the tree sk on the stack. [sent-83, score-0.358]
25 Following Huang and Sagae (2010), of a derivation x is called atomic features, defineed as the smallest set of features s. [sent-85, score-0.381]
26 Since only the top d+1 trees on the stack are used in atomic features, we only need to remember the necessary information and write the state as: hi, j,sd. [sent-108, score-0.298]
27 In the rest of this paper, we always denote derivations with letters x, y, and z, and denote states with letters p, q, and r. [sent-113, score-0.365]
28 760 The deductive system for dynamic programming best-first parsing is adapted from Huang and Sagae (2010). [sent-114, score-0.587]
29 However, in practice we use one state’s best derivation found so far to represent the state. [sent-118, score-0.292]
30 pre, which is the score of the derivation to reach this state, and the inside score, p. [sent-120, score-0.396]
31 With this deductive system we extend the concept of reducible states with the following definitions: The set of all states with which a state p can legally reduce from the right is denoted L(p), or left states. [sent-127, score-0.865]
32 s00i | fk(s0k−1) =fk(sk), ∀k ∈ [1, d] } (1) in which the span of the “left” state’s top tree ends where that of the “right” state’s top tree begins, and fk (sk) = fk (s0k−1) for all k ∈ [1, d] . [sent-135, score-0.47]
33 which a state p can legally reduce from the left is denoted R(p), or =∆{hh, right states. [sent-137, score-0.376]
34 2 Algorithm 1 We constrain the searching time with a polynomial bound by transforming the original search graph with exponentially many derivations into a graph with polynomial number of states. [sent-139, score-0.553]
35 In Algorithm 1, we maintain a chart C and a priority queue Q, both of which are based on hash tables. [sent-140, score-0.597]
36 fe(p) In practice, we use the atomic features as the signature of state p, since all derivations ien the same state share the same atomic features. [sent-142, score-0.679]
37 s00,iw: j(ic:, ( )c + ξ,0)j < n sh state q: PRED state p: rexhhkk,,ij, ss0d0d. [sent-147, score-0.311]
38 cβ0,iv:0 ()c0+hiv,j, vB0+iv:) ( ,v) shift-reduce parsing (Huang and Sagae, 2010) (left, omitting rey case), compared to weighted Earley parsing (Stolcke, 1995) (right). [sent-157, score-0.528]
39 Figure 3: Illustrations of left states L(p), right states R(p), and left corner states T (p). [sent-169, score-0.755]
40 (Rb)( pL)e ifst corner softa tsetast eTs ( thpa) ti sc athne b see rte odfu csteadte ws it hhat p h soav teh tahte p same reducibility as shifted state p, i. [sent-175, score-0.294]
41 ( We use C[p] to retrieve the derivation in C that is associated with state p. [sent-181, score-0.417]
42 We sometimes abuse this notation to say C[x] to retrieve the derivation associated with signature fe(x) for derivation x. [sent-182, score-0.625]
43 Prioreity queue Q is defined similarly as C, except that it seupports the operation POP that pops the highest priority item. [sent-187, score-0.521]
44 Following Stolcke (1995) and Nederhof (2003), we use the prefix score and the inside score as the priority in Q: fe(p) x ≺ < y ⇔ x. [sent-188, score-0.397]
45 In the DP best-first parsing algorithm, once a derivation x is popped from the priority queue Q, as usual we try to expand it with shift and reduce. [sent-198, score-1.221]
46 Note that both left and right reduces are between the derivation x of state p = [x]∼ and an in-chart derivation y of left state q = [y]∼ ∈ L(p) (Line 10 of Algorithm 1), as shown in the d∈ed Luc(tpi)ve ( system (Figure 2). [sent-199, score-1.095]
47 We further expand derivation x of state p with some in-chart derivation z of state r s. [sent-201, score-0.834]
48 e c(sheaert L bineecau 1s1e o oitf is the descendant of some other derivation that has been explored before x. [sent-208, score-0.331]
49 Let LC (x) =∆ C[L( [x]∼)] be in-chart derivations of [Lxe]t∼ L L’s left st=ate Cs Let RC (x) C[R(p)] be in-chart derivations of [Lxe]∼t ’Rs right st=ates C function PARSE(w0 . [sent-212, score-0.568]
50 3 Algorithm 2: Lazy Expansion We further improve DP best-first parsing with lazy expansion. [sent-227, score-0.319]
51 Assume a shifted derivation x of state p is a direct descendant from derivation x0 of state p0, then p ∈ R(p0), and we have: ∀ys. [sent-229, score-0.957]
52 [y]∼ = q ∈ REDUCE(L(p), T (p)), x ≺ y where T (p) is the left corner states of shifted state p, hdeerfein Ted (p as =∆{hi, T (hi, i+1, sd. [sent-235, score-0.506]
53 s00i | fk (s0k) = fk (sk) , ∀k ∈ [1, d] } which represents the set of all states that have the same reducibility as a shifted state p. [sent-241, score-0.78]
54 wn−1) C← ∅ empty chart QC ←← {∅INIT} initial priority queue Qwh ←ile {QI ∅} do x ←e Q QPO6 =P ∅(Q d)o ixf G←O PAL(x) then return x found best parse if [x]∼ ∈ C then C[x] ∈← C x add x to chart SCH[xIF]T ←(x ,x Q) REDUCE(x. [sent-253, score-0.701]
55 Based on this observation, we can safely delay the REDUCE({x}, RC (x)) operation (Line 11 in Algorithm 1), u{nxt}il, Rthe derivation x of a shifted state is popped out from Q. [sent-269, score-0.649]
56 We can delay even more derivations by extending the concept of left corner states to reduced states. [sent-271, score-0.552]
57 s0 as left corner, and p, q share the same left states, then derivations of p should always have higher priority than derivations of q. [sent-274, score-0.879]
58 We can further delay the generation of q’s derivations until p’s derivations are popped out. [sent-275, score-0.568]
59 DP best-first parsing enables state merging on the previous graph. [sent-284, score-0.394]
60 1 Proof of Optimality We define a best derivation of state [x]∼ as a derivation x such that ∀y ∈ [x]∼, x ? [sent-295, score-0.709]
61 We want to prove that Algorithm 1actually fills the chart by assigning a best derivation to its state. [sent-298, score-0.469]
62 Because we use a pair of prefix score and inside score, (pre, ins), as priority (Equation 2) in the deductive system (Figure 2). [sent-305, score-0.543]
63 After derivation xk has been filled into chart, ∀x s. [sent-307, score-0.366]
64 x ∈ Q, and x is a best derivation of state [x]∼, t xhe n∈ x Q’s ,d aesndce xnda isnt as can not hivaavteio a higher priority than xk. [sent-309, score-0.706]
65 With induction we can easily show that any descendants ofx can not have a higher priority than xk. [sent-331, score-0.323]
66 Consider this derivation transition hypergraph, which starts at initial derivation x0 ∈ Ci−1, and ends at x Ci−1 . [sent-348, score-0.584]
67 However, from Lemma 1, none of x0’s descendants can have a higher priority than xi, which contradicts x ≺ xi. [sent-367, score-0.351]
68 Stepwise Optimality means Algorithm 1 only fills chart with a best derivation for each state. [sent-380, score-0.438]
69 Stepwise Completeness means every state that has its best derivation better than best derivation pi must have been filled before pi, this guarantees that the 764 hh00,h0i : (c0,v0) hh0,hi : ( ,v) rex k. [sent-381, score-0.772]
70 The first goal derivation popped off the priority queue is the optimal parse. [sent-395, score-0.888]
71 The key observation is that we delay the reduction of derivation x0 and a derivation of right states R([x0]∼) (Line 11 of Algorithm 1), ounn toilf s rhigifhtet dst derivation, x = sh(x0), is popped out (Line 11 of Algorithm 2). [sent-403, score-0.979]
72 However, this delayed reduction will not generate any derivation y, s. [sent-404, score-0.321]
73 Dynamic programming best-first parsing runs in worst-case polynomial time and space, as long as the atomic features function satisfies: • • bounded: ∀ derivation x, a constant. [sent-418, score-0.804]
74 |fe(x)| is bounded by monotonic: horizontal: ∀k, fk(s) = fk (t) ⇒ fk+1(s) = fk+1(t), for all possible trees s, t. [sent-419, score-0.322]
75 – vertical: ∀k, fk(sys0) = fk(tyt0) ⇒ fk (s) = fk (t) ,a nfd fk (sxs0) = fk (txt0) ⇒ – tf,k( ts0. [sent-420, score-0.776]
76 This is necessary because we use the features as signature to match all possible left states and right states (Equation 1). [sent-423, score-0.499]
77 6 Experiments In experiments we compare our DP best-first parsing with non-DP best-first parsing, pure greedy parsing, and beam parser of Huang and Sagae (2010). [sent-435, score-0.558]
78 We collect gold actions at different parsing configurations as positive examples from model score accuracy # states 765 time greedy−1. [sent-437, score-0.426]
79 0132 Table 1: Dynamic programming best-first parsing reach optimality faster. [sent-456, score-0.538]
80 To compare the performance we measure two sets of criteria: 1) the internal criteria consist of the model score of the parsing result, and the number of states explored; 2) the external criteria consist of the unlabeled accuracy of the parsing result, and the parsing time. [sent-466, score-0.962]
81 1 Search Quality & Speed We first compare DP best-first parsing algorithm with pure greedy parsing and non-DP best-first parsing without any extra constraints. [sent-471, score-0.843]
82 Compared to greedy parsing, DP best-first parsing reaches a significantly higher accuracy, with ∼2 itnimge res more parsing tainmtely. [sent-477, score-0.559]
83 Gigihveenr tchceu reaxctyra, wtiitmhe ∼ in2 maintaining priority queue, this is consistent with the internal criteria: DP best-first parsing reaches a significantly higher model score, which is actually optimal, exploring twice as many as states. [sent-478, score-0.559]
84 Beam parser (beam size 8) guarantees linear parsing time. [sent-481, score-0.336]
85 Furthermore, on average our DP best-first parsing is significantly faster than the beam parser, because most sentences are short. [sent-485, score-0.406]
86 As the time complexity grows exponentially with the sentence length, non-DP best-first parsing takes an extremely long time for long sentences. [sent-487, score-0.34]
87 In general DP best-first parsing manages to reach optimality in tractable time with exact search. [sent-489, score-0.443]
88 To further investigate the potential of this DP bestfirst parsing, we perform inexact search experiments with bounded priority queue. [sent-490, score-0.514]
89 2 Parsing with Bounded Priority Queue Bounded priority queue is a very practical choice when we want to parse with only limited memory. [sent-492, score-0.493]
90 We bound the priority queue size at 1, 2, 5, 10, 20, 50, 100, 500, and 1000, and once the priority queue size exceeds the bound, we discard the worst one in the priority queue. [sent-493, score-1.378]
91 The performances of non- DP best-first parsing and DP best-first parsing are illustrated in Figure 6 (a) (b). [sent-494, score-0.476]
92 Firstly, in Figure 6 (a), our DP best-first parsing reaches the optimal model score with bound 766 50, while non-DP best-first parsing fails even with bound 1000. [sent-495, score-0.653]
93 Lastly, we also compare to beam parser with beam size 1, 2, 4, 8. [sent-500, score-0.404]
94 Figure 6 (a) shows that beam parser fails to reach the optimality, while exploring significantly more states. [sent-501, score-0.307]
95 On the other hand, beam parser also fails to reach an accuracy as high as best-first parsers. [sent-502, score-0.307]
96 7 Conclusions and Future Work We have presented a dynamic programming algorithm for best-first shift-reduce parsing which is guaranteed to return the optimal solution in polynomial time. [sent-509, score-0.543]
97 # of states (edge-factored) parsing time (secs) (b) parsing accuracy vs. [sent-516, score-0.631]
98 time parsing time (secs) (d) parsing accuracy vs. [sent-517, score-0.476]
99 Note that to implement bounded priority queue we use two priority queues to keep track of the worst elements, which introduces extra overhead, so that our bounded parser is slower than the unbounded version for large priority queue size bound. [sent-522, score-1.582]
100 An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. [sent-585, score-0.325]
wordName wordTfidf (topN-words)
[('dp', 0.328), ('derivation', 0.292), ('priority', 0.289), ('parsing', 0.238), ('derivations', 0.21), ('sagae', 0.206), ('queue', 0.204), ('fk', 0.194), ('deductive', 0.179), ('beam', 0.168), ('states', 0.155), ('optimality', 0.134), ('stepwise', 0.127), ('state', 0.125), ('earley', 0.113), ('chart', 0.104), ('popped', 0.103), ('bounded', 0.096), ('shift', 0.095), ('programming', 0.095), ('polynomial', 0.09), ('atomic', 0.089), ('fe', 0.089), ('left', 0.085), ('shifted', 0.084), ('bestfirst', 0.081), ('knuth', 0.081), ('lazy', 0.081), ('tryadd', 0.081), ('hj', 0.077), ('dynamic', 0.075), ('xk', 0.074), ('reach', 0.071), ('scsh', 0.071), ('reduce', 0.07), ('parser', 0.068), ('theorem', 0.068), ('cky', 0.067), ('proof', 0.067), ('right', 0.063), ('hi', 0.061), ('sh', 0.061), ('huang', 0.061), ('exponentially', 0.059), ('ci', 0.059), ('corner', 0.057), ('dijkstra', 0.057), ('bound', 0.056), ('maxent', 0.055), ('sk', 0.055), ('completeness', 0.054), ('stack', 0.052), ('rey', 0.052), ('greedy', 0.051), ('screx', 0.049), ('secs', 0.049), ('search', 0.048), ('worst', 0.047), ('delay', 0.045), ('algorithm', 0.045), ('equivalence', 0.045), ('expansion', 0.044), ('hypergraph', 0.043), ('complexity', 0.043), ('fills', 0.042), ('lxe', 0.042), ('prefix', 0.042), ('tree', 0.041), ('signature', 0.041), ('lavie', 0.04), ('superiority', 0.04), ('descendant', 0.039), ('xi', 0.038), ('descendants', 0.034), ('expansions', 0.034), ('simulating', 0.034), ('pure', 0.033), ('score', 0.033), ('kuhlmann', 0.033), ('lefts', 0.033), ('legally', 0.033), ('qe', 0.033), ('rex', 0.033), ('xif', 0.033), ('pushed', 0.032), ('reaches', 0.032), ('trees', 0.032), ('merging', 0.031), ('prove', 0.031), ('parsers', 0.03), ('criteria', 0.03), ('guarantees', 0.03), ('reduction', 0.029), ('reduces', 0.028), ('reducibility', 0.028), ('pops', 0.028), ('caraballo', 0.028), ('aho', 0.028), ('contradicts', 0.028), ('ile', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
2 0.16794349 145 emnlp-2013-Optimal Beam Search for Machine Translation
Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins
Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.
3 0.14611612 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
4 0.12990567 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
Author: Maryam Siahbani ; Baskaran Sankaran ; Anoop Sarkar
Abstract: Left-to-right (LR) decoding (Watanabe et al., 2006b) is a promising decoding algorithm for hierarchical phrase-based translation (Hiero). It generates the target sentence by extending the hypotheses only on the right edge. LR decoding has complexity O(n2b) for input of n words and beam size b, compared to O(n3) for the CKY algorithm. It requires a single language model (LM) history for each target hypothesis rather than two LM histories per hypothesis as in CKY. In this paper we present an augmented LR decoding algorithm that builds on the original algorithm in (Watanabe et al., 2006b). Unlike that algorithm, using experiments over multiple language pairs we show two new results: our LR decoding algorithm provides demonstrably more efficient decoding than CKY Hiero, four times faster; and by introducing new distortion and reordering features for LR decoding, it maintains the same translation quality (as in BLEU scores) ob- tained phrase-based and CKY Hiero with the same translation model.
5 0.12349408 141 emnlp-2013-Online Learning for Inexact Hypergraph Search
Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald
Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.
6 0.11557831 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
7 0.10030627 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
8 0.095897615 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
9 0.08910457 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs
10 0.073755138 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
11 0.066550657 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time
12 0.066232495 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching
13 0.065354422 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
14 0.061477408 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
15 0.060399368 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
16 0.059345275 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation
17 0.053708527 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology
18 0.052777909 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
19 0.051125918 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding
20 0.049313433 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
topicId topicWeight
[(0, -0.161), (1, -0.106), (2, 0.054), (3, 0.058), (4, -0.058), (5, 0.04), (6, 0.017), (7, -0.003), (8, 0.088), (9, 0.25), (10, -0.1), (11, -0.04), (12, -0.134), (13, 0.103), (14, 0.139), (15, -0.143), (16, -0.034), (17, 0.029), (18, -0.129), (19, 0.002), (20, 0.025), (21, -0.091), (22, 0.023), (23, -0.02), (24, 0.05), (25, -0.022), (26, 0.128), (27, 0.173), (28, 0.077), (29, 0.073), (30, -0.064), (31, 0.081), (32, 0.068), (33, 0.027), (34, -0.067), (35, 0.035), (36, 0.019), (37, 0.02), (38, -0.111), (39, -0.01), (40, 0.03), (41, -0.054), (42, -0.079), (43, 0.101), (44, 0.013), (45, 0.045), (46, 0.051), (47, 0.005), (48, -0.047), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.97602779 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
2 0.75753987 141 emnlp-2013-Online Learning for Inexact Hypergraph Search
Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald
Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.
3 0.74654251 145 emnlp-2013-Optimal Beam Search for Machine Translation
Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins
Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.
4 0.55800438 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
Author: Joseph Le Roux ; Antoine Rozenknop ; Jennifer Foster
Abstract: It has recently been shown that different NLP models can be effectively combined using dual decomposition. In this paper we demonstrate that PCFG-LA parsing models are suitable for combination in this way. We experiment with the different models which result from alternative methods of extracting a grammar from a treebank (retaining or discarding function labels, left binarization versus right binarization) and achieve a labeled Parseval F-score of 92.4 on Wall Street Journal Section 23 this represents an absolute improvement of 0.7 and an error reduction rate of 7% over a strong PCFG-LA product-model baseline. Although we experiment only with binarization and function labels in this study, there is much scope for applying this approach to – other grammar extraction strategies.
5 0.54593056 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
6 0.53138733 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
7 0.49294642 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time
8 0.46757653 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
9 0.4421632 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
10 0.4193449 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
11 0.34936383 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
12 0.33299044 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
13 0.32842109 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding
14 0.31358543 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
15 0.29715785 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs
16 0.2927025 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
17 0.2792075 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
18 0.2669529 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching
19 0.25978819 14 emnlp-2013-A Synchronous Context Free Grammar for Time Normalization
20 0.24210057 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment
topicId topicWeight
[(3, 0.027), (18, 0.047), (22, 0.05), (26, 0.047), (30, 0.123), (45, 0.02), (50, 0.018), (51, 0.132), (66, 0.03), (69, 0.326), (71, 0.014), (75, 0.021), (77, 0.022), (95, 0.018), (96, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.81150746 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
2 0.76146698 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems
Author: Julien Gaillard ; Marc El-Beze ; Eitan Altman ; Emmanuel Ethis
Abstract: Recommendation systems (RS) take advantage ofproducts and users information in order to propose items to consumers. Collaborative, content-based and a few hybrid RS have been developed in the past. In contrast, we propose a new domain-independent semantic RS. By providing textually well-argued recommendations, we aim to give more responsibility to the end user in his decision. The system includes a new similarity measure keeping up both the accuracy of rating predictions and coverage. We propose an innovative way to apply a fast adaptation scheme at a semantic level, providing recommendations and arguments in phase with the very recent past. We have performed several experiments on films data, providing textually well-argued recommendations.
3 0.51785457 122 emnlp-2013-Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation
Author: Dekai Wu ; Karteek Addanki ; Markus Saers ; Meriem Beloucif
Abstract: We present a novel model, Freestyle, that learns to improvise rhyming and fluent responses upon being challenged with a line of hip hop lyrics, by combining both bottomup token based rule induction and top-down rule segmentation strategies to learn a stochastic transduction grammar that simultaneously learns bothphrasing and rhyming associations. In this attack on the woefully under-explored natural language genre of music lyrics, we exploit a strictly unsupervised transduction grammar induction approach. Our task is particularly ambitious in that no use of any a priori linguistic or phonetic information is allowed, even though the domain of hip hop lyrics is particularly noisy and unstructured. We evaluate the performance of the learned model against a model learned only using the more conventional bottom-up token based rule induction, and demonstrate the superiority of our combined token based and rule segmentation induction method toward generating higher quality improvised responses, measured on fluency and rhyming criteria as judged by human evaluators. To highlight some of the inherent challenges in adapting other algorithms to this novel task, we also compare the quality ofthe responses generated by our model to those generated by an out-ofthe-box phrase based SMT system. We tackle the challenge of selecting appropriate training data for our task via a dedicated rhyme scheme detection module, which is also acquired via unsupervised learning and report improved quality of the generated responses. Finally, we report results with Maghrebi French hip hop lyrics indicating that our model performs surprisingly well with no special adaptation to other languages. 102
4 0.50030696 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
5 0.49994543 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu
Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.
6 0.49964112 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
7 0.49811 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
8 0.49696067 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
9 0.49612105 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
10 0.49599007 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
11 0.49508092 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification
12 0.49425721 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
13 0.49391562 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation
14 0.49328071 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
15 0.49256769 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
16 0.49198291 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation
17 0.49049282 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
18 0.49014094 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
19 0.48822752 143 emnlp-2013-Open Domain Targeted Sentiment
20 0.48788106 2 emnlp-2013-A Convex Alternative to IBM Model 2