emnlp emnlp2013 emnlp2013-141 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald
Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.
Reference: text
sentIndex sentText sentNum sentScore
1 Online Learning for Inexact Hypergraph Search Hao Zhang Liang Huang Kai Zhao Ryan McDonald Google City University of New York Google hao zhang@ google . [sent-1, score-0.202]
2 com Abstract Online learning algorithms like the perceptron are widely used for structured prediction tasks. [sent-6, score-0.627]
3 For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al. [sent-7, score-1.424]
4 However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. [sent-9, score-1.126]
5 In this paper, we generalize the violation-fixing perceptron of Huang et al. [sent-10, score-0.409]
6 (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). [sent-11, score-0.102]
7 1 Introduction Structured prediction problems generally deal with exponentially many outputs, often making exact search infeasible. [sent-15, score-0.483]
8 For sequential search problems, such as tagging and incremental parsing, beam search coupled with perceptron algorithms that account for potential search errors have been shown to be a powerful combination (Collins and Roark, 2004; Daum e´ and Marcu, 2005; Zhang and Clark, 2008; Huang et al. [sent-16, score-1.54]
9 However, sequential search algorithms, and in particular left-to-right beam search (Collins and Roark, 2004; Zhang and Clark, 2008), squeeze inference into a very narrow space. [sent-18, score-0.841]
10 To address this, Huang (2008) formulated constituency parsing as approximate bottom-up inference in order to compactly represent an exponential number of outputs while scoring features of arbitrary scope. [sent-19, score-0.597]
11 This idea was adapted to graph-based 908 dependency parsers by Zhang and McDonald (2012) and shown to outperform left-to-right beam search. [sent-20, score-0.318]
12 Both these examples, bottom-up approximate dependency and constituency parsing, can be viewed as specific instances of inexact hypergraph search. [sent-21, score-1.106]
13 Typically, the approximation is accomplished by cube-pruning throughout the hypergraph (Chiang, 2007). [sent-22, score-0.463]
14 Unfortunately, as the scope of features at each node increases, the inexactness of search and its negative impact on learning can potentially be exacerbated. [sent-23, score-0.269]
15 Unlike sequential search, the impact on learning of approximate hypergraph search as well as methods to mitigate any ill effects has not been studied. [sent-24, score-0.962]
16 Motivated by this, we develop online learning algorithms for inexact hypergraph search by generalizing the violation-fixing percepron of Huang et al. [sent-25, score-1.146]
17 We empirically validate the benefit of this approach within the cube-pruning dependency parser of Zhang and McDonald (2012). [sent-27, score-0.137]
18 – – 2 Structured Perceptron for Inexact Hypergraph Search The structured perceptron algorithm (Collins, 2002) is a general learning algorithm. [sent-28, score-0.448]
19 nTdh eY perceptron update rllu lvea li ds simply: w′ = w + f(x, yˆ) − f(x, y′) . [sent-30, score-0.477]
20 The convergence of original perceptron algorithm relies on the argmax function being exact so that the condition w · f(x, y′) > w · f(x, yˆ) (modulo ties) always dhiotilodns. [sent-31, score-0.602]
21 w T·fh(isx ,cyo)nd >iti won· fis( xc,a yˆ l )ed (m a dvuiolola ttiieosn) because the prediction y′ scores higher than the correct label yˆ. [sent-32, score-0.077]
22 Each perceptron update moves weights ProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3. [sent-33, score-0.481]
23 hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 9t0ic8s–913, Figure 1: A hypergraph showing the union of the gold and Viterbi subtrees. [sent-35, score-0.47]
24 The hyperedges in bold and dashed are from the gold and Viterbi trees, respectively. [sent-36, score-0.221]
25 away from y′ and towards ˆy to fix such violations. [sent-37, score-0.078]
26 But when search is inexact, y′ could be suboptimal so that sometimes w · f(x, y′) < w · f(x, ˆy ). [sent-38, score-0.25]
wordName wordTfidf (topN-words)
[('inexact', 0.435), ('perceptron', 0.373), ('hypergraph', 0.365), ('beam', 0.192), ('search', 0.188), ('roark', 0.167), ('huang', 0.16), ('mcdonald', 0.155), ('sequential', 0.152), ('zhang', 0.142), ('collins', 0.118), ('hao', 0.111), ('constituency', 0.108), ('viterbi', 0.094), ('outputs', 0.091), ('google', 0.091), ('sfu', 0.085), ('isx', 0.085), ('parsing', 0.081), ('modulo', 0.078), ('iti', 0.078), ('cuny', 0.078), ('prediction', 0.077), ('approximate', 0.077), ('structured', 0.075), ('zhao', 0.073), ('compactly', 0.072), ('narrow', 0.072), ('qc', 0.068), ('argmax', 0.068), ('mitigate', 0.068), ('ill', 0.068), ('hyperedges', 0.068), ('gc', 0.068), ('gold', 0.063), ('clark', 0.063), ('kai', 0.062), ('accommodate', 0.062), ('ey', 0.062), ('uas', 0.062), ('suboptimal', 0.062), ('accomplished', 0.06), ('las', 0.06), ('fh', 0.06), ('won', 0.058), ('solves', 0.058), ('update', 0.057), ('hypergraphs', 0.056), ('problems', 0.055), ('online', 0.054), ('wd', 0.054), ('yin', 0.054), ('generalizing', 0.054), ('wsj', 0.052), ('com', 0.052), ('exact', 0.051), ('moves', 0.051), ('algorithms', 0.05), ('dependency', 0.05), ('dashed', 0.049), ('ties', 0.049), ('daum', 0.049), ('inference', 0.049), ('formulated', 0.048), ('marcu', 0.047), ('coupled', 0.047), ('ds', 0.047), ('aid', 0.047), ('parser', 0.046), ('exponentially', 0.045), ('impact', 0.044), ('convergence', 0.044), ('tagging', 0.043), ('ryan', 0.042), ('union', 0.042), ('incremental', 0.042), ('bold', 0.041), ('validate', 0.041), ('fix', 0.04), ('adapted', 0.039), ('city', 0.039), ('powerful', 0.039), ('errors', 0.038), ('away', 0.038), ('throughout', 0.038), ('nd', 0.038), ('scope', 0.037), ('chiang', 0.037), ('parsers', 0.037), ('valid', 0.037), ('instances', 0.036), ('generalize', 0.036), ('exponential', 0.036), ('viewed', 0.035), ('arbitrary', 0.035), ('condition', 0.034), ('generally', 0.034), ('deal', 0.033), ('relies', 0.032), ('motivated', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 141 emnlp-2013-Online Learning for Inexact Hypergraph Search
Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald
Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.
2 0.2820172 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
3 0.22894403 145 emnlp-2013-Optimal Beam Search for Machine Translation
Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins
Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.
4 0.12349408 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
5 0.10606561 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
Author: Xinyan Xiao ; Deyi Xiong
Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.
6 0.080170505 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
7 0.079718761 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding
8 0.066581383 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
9 0.064886324 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
10 0.063515738 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
11 0.063515253 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision
12 0.060178697 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation
13 0.055641718 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
14 0.050068948 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
15 0.049119864 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
16 0.047649868 201 emnlp-2013-What is Hidden among Translation Rules
17 0.044894602 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
18 0.04426188 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries
19 0.042346932 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time
20 0.041697152 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
topicId topicWeight
[(0, -0.144), (1, -0.1), (2, 0.058), (3, 0.05), (4, -0.038), (5, 0.023), (6, 0.051), (7, 0.042), (8, 0.067), (9, 0.259), (10, -0.13), (11, -0.049), (12, -0.12), (13, 0.075), (14, 0.167), (15, -0.281), (16, -0.056), (17, 0.058), (18, -0.068), (19, -0.008), (20, 0.068), (21, 0.015), (22, -0.099), (23, -0.087), (24, 0.061), (25, -0.008), (26, 0.127), (27, 0.208), (28, -0.027), (29, 0.185), (30, -0.008), (31, 0.111), (32, 0.011), (33, 0.084), (34, -0.071), (35, -0.173), (36, -0.017), (37, 0.049), (38, 0.023), (39, -0.007), (40, -0.005), (41, -0.023), (42, -0.135), (43, 0.035), (44, -0.079), (45, -0.006), (46, -0.079), (47, 0.019), (48, 0.076), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.98819655 141 emnlp-2013-Online Learning for Inexact Hypergraph Search
Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald
Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.
2 0.76839244 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
3 0.73997653 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
4 0.73431873 145 emnlp-2013-Optimal Beam Search for Machine Translation
Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins
Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.
5 0.40336797 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time
Author: Mohammad Sadegh Rasooli ; Joel Tetreault
Abstract: We introduce a novel method to jointly parse and detect disfluencies in spoken utterances. Our model can use arbitrary features for parsing sentences and adapt itself with out-ofdomain data. We show that our method, based on transition-based parsing, performs at a high level of accuracy for both the parsing and disfluency detection tasks. Additionally, our method is the fastest for the joint task, running in linear time.
6 0.37146211 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
7 0.31180012 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding
8 0.28920764 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
9 0.28300112 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
10 0.27338085 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
11 0.23826599 203 emnlp-2013-With Blinkers on: Robust Prediction of Eye Movements across Readers
12 0.23418978 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
13 0.23235637 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
14 0.22810303 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries
15 0.20442288 188 emnlp-2013-Tree Kernel-based Negation and Speculation Scope Detection with Structured Syntactic Parse Features
16 0.18765944 159 emnlp-2013-Regularized Minimum Error Rate Training
17 0.18729469 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching
18 0.18611994 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
19 0.18373895 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
20 0.18181494 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
topicId topicWeight
[(18, 0.034), (22, 0.025), (26, 0.539), (30, 0.096), (51, 0.109), (66, 0.023), (71, 0.02), (75, 0.016), (77, 0.027), (96, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.92999965 141 emnlp-2013-Online Learning for Inexact Hypergraph Search
Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald
Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.
2 0.78484237 145 emnlp-2013-Optimal Beam Search for Machine Translation
Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins
Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.
3 0.64375633 122 emnlp-2013-Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation
Author: Dekai Wu ; Karteek Addanki ; Markus Saers ; Meriem Beloucif
Abstract: We present a novel model, Freestyle, that learns to improvise rhyming and fluent responses upon being challenged with a line of hip hop lyrics, by combining both bottomup token based rule induction and top-down rule segmentation strategies to learn a stochastic transduction grammar that simultaneously learns bothphrasing and rhyming associations. In this attack on the woefully under-explored natural language genre of music lyrics, we exploit a strictly unsupervised transduction grammar induction approach. Our task is particularly ambitious in that no use of any a priori linguistic or phonetic information is allowed, even though the domain of hip hop lyrics is particularly noisy and unstructured. We evaluate the performance of the learned model against a model learned only using the more conventional bottom-up token based rule induction, and demonstrate the superiority of our combined token based and rule segmentation induction method toward generating higher quality improvised responses, measured on fluency and rhyming criteria as judged by human evaluators. To highlight some of the inherent challenges in adapting other algorithms to this novel task, we also compare the quality ofthe responses generated by our model to those generated by an out-ofthe-box phrase based SMT system. We tackle the challenge of selecting appropriate training data for our task via a dedicated rhyme scheme detection module, which is also acquired via unsupervised learning and report improved quality of the generated responses. Finally, we report results with Maghrebi French hip hop lyrics indicating that our model performs surprisingly well with no special adaptation to other languages. 102
4 0.4091177 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
5 0.35360366 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
6 0.31160825 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
7 0.31153235 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding
8 0.29918155 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
9 0.29574326 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
10 0.29367757 2 emnlp-2013-A Convex Alternative to IBM Model 2
11 0.28980562 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
12 0.2892122 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
13 0.28542691 150 emnlp-2013-Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries
14 0.27752128 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
15 0.27611047 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
16 0.27217031 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
17 0.2684809 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
18 0.26847255 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation
19 0.26805183 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
20 0.26804599 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation