acl acl2013 acl2013-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
Reference: text
sentIndex sentText sentNum sentScore
1 j p Abstract Most statistical machine translation (SMT) systems are modeled using a loglinear framework. [sent-11, score-0.271]
2 Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. [sent-12, score-0.259]
3 A neural network is a reasonable method to address these pitfalls. [sent-13, score-0.541]
4 However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. [sent-14, score-0.722]
5 In this paper, we propose a variant of a neural network, i. [sent-15, score-0.368]
6 additive neural networks, for SMT to go beyond the log-linear translation model. [sent-17, score-0.821]
7 In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. [sent-18, score-0.681]
8 Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks. [sent-19, score-0.813]
9 Thus, it casts complex translation between a pair of languages as feature engineering, which facilitates research and development for SMT. [sent-22, score-0.264]
10 On the one hand, features are required to be linear with respect to the objective of the translation model (Nguyen et al. [sent-25, score-0.349]
11 This induces modeling inadequacy (Duh and Kirchhoff, 2008), in which the translation performance may not improve, or may even decrease, after one integrates additional features into the model. [sent-27, score-0.248]
12 What may happen is that a feature p does initially not improve the translation performance, but after a nonlinear operation, e. [sent-29, score-0.285]
13 Situations such as this confuse explanations for feature designing, since it is unclear whether such a feature contributes to a translation or not. [sent-33, score-0.267]
14 A neural network (Bishop, 1995) is a reasonable method to overcome the above shortcomings. [sent-34, score-0.541]
15 Decod- ing in SMT is considered as the expansion of translation states and it is handled by a heuristic search (Koehn, 2004a). [sent-38, score-0.197]
16 In the search procedure, frequent computation of the model score is needed for the search heuristic function, which will be challenged by the decoding efficiency for the neural network based translation model. [sent-39, score-0.975]
17 Actually, even for the (log-) linear model, efficient decoding with the language model is not trivial (Chiang, 2007). [sent-41, score-0.295]
18 In this paper, we propose a variant of neural networks, i. [sent-42, score-0.368]
19 additive neural networks (see Section 3 for details), for SMT. [sent-44, score-0.752]
20 , neural nework) which encodes lo791 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-47, score-0.334]
21 Compared with the log-linear model, it has more powerful expressive abilities and can deeply interpret and represent features with hidden units in neural networks. [sent-55, score-0.535]
22 Moreover, our method is simple to implement and its decoding efficiency is comparable to that of the log-linear model. [sent-56, score-0.218]
23 We also integrate word embedding into the model by representing each word as a feature vector (Collobert and Weston, 2008). [sent-57, score-0.403]
24 The biggest contribution of this paper is that it goes beyond the log-linear model and proposes a non-linear translation model instead of re-ranking model (Duh and Kirchhoff, 2008; Sokolov et al. [sent-61, score-0.395]
25 On both Chinese-to-English and Japanese-toEnglish translation tasks, experiment results show that our model can leverage the shortcomings suffered by the log-linear model, and thus achieves significant improvements over the log-linear based translation. [sent-63, score-0.283]
26 1 Log-linear Translation Model Och and Ney (2002) proposed the log-linear translation model, which can be formalized as follows: a collection of synchronous rules for Hiero grammar (Chiang, 2005), or phrase pairs in Moses (Koehn et al. [sent-65, score-0.227]
27 , 1993), the loglinear model does not assume strong independency holds, and allows arbitrary features to be integrated into the model easily. [sent-70, score-0.237]
28 In other words, it can transform complex language translation into feature engineering: it can achieve high translation performance if reasonable features are chosen and appropriate parameters are assigned for the weight vector. [sent-71, score-0.593]
29 2 Decoding By Search Given a source sentence f and a weight W, decoding finds the best translation candidate via the programming problem: e h ˆe,dˆi = argem,daxP(e,d|f;W) = argem,dax? [sent-73, score-0.394]
30 (2) Since the range of he, di is exponential with respect t toh eth rea nsgizee off h f, thie i se exxapcot decoding tihs iren-tractable and an inexact strategy such as beam search is used instead in practice. [sent-76, score-0.209]
31 urce translation over the sentence, candidate; pair hf, ei, ? [sent-84, score-0.197]
32 , ing the state expansion 792 process, the score wi · hi(f, e, d) for a partial translation is calculated repeatedly. [sent-88, score-0.197]
33 The main reason why cube-pruning works is that the translation model is linear and the model score for the language model is approximately monotonic (Chiang, 2007). [sent-91, score-0.41]
34 1 Motivation Although the log-linear model has achieved great progress for SMT, it still suffers from some pitfalls: it requires features be linear with the model and it can not interpret and represent features deeply. [sent-93, score-0.3]
35 The neural network model is a reasonable method to overcome these pitfalls. [sent-94, score-0.597]
36 However, the neural network based machine translation is far from easy. [sent-95, score-0.701]
37 As mentioned in Section 2, the decoding procedure performs an expansion of translation states. [sent-96, score-0.362]
38 Firstly, let us consider a simple case in neural network based translation where all the features in the translation model are independent of the translation state, i. [sent-97, score-1.202]
39 In this way, we can easily define the following translation model with a single-layer neural network: S(f, e, d; W, M, B) = W> · σ(M · h(f, e, d) + B), (3) where M ∈ Ru×K is a matrix, and B ∈ Ru is a vector, Mi. [sent-100, score-0.587]
40 Suppose the current translation state is en- coded as he1, d1i, which is expanded into he2, d2i using t ahse reule r2 (d2 = d1 x∪p {r2}). [sent-118, score-0.197]
41 ditive property holds in F, and then we can obtain a new translation model via the following recursive equation: S(f, e2, d2; W, M, B) = S(f, e1, d1; W, M, B) + S? [sent-131, score-0.253]
42 In loglinear translation model, Chiang (2007) proposed a cube-pruning method for scoring the language model. [sent-138, score-0.271]
43 However, if scoring the language model with a neural network, this premise is difficult to hold. [sent-140, score-0.422]
44 2 Definition According to the above analysis, we propose a variant of a neural network model for machine translation, and we call it Additive Neural Networks or AdNN for short. [sent-143, score-0.594]
45 The AdNN model is a combination of a linear model and a neural network: non-local fea- tures, e. [sent-144, score-0.491]
46 LM, are linearly modeled for the cubepruning strategy, and local features are modeled by the neural network for deep interpretation and representation. [sent-146, score-0.641]
47 Formally, the AdNN based translation model is discriminative but non-probabilistic, and it can be defined as follows: W> · h(f, e, d)+ XW0> · σ? [sent-147, score-0.253]
48 (5) is similar to both additive models (Buja et al. [sent-159, score-0.29]
49 , 1989) and generalized additive neural networks (Potts, 1999): it consists of many additive terms, and each term is either a linear or a nonlinear (a neural network) model. [sent-160, score-1.513]
50 That is the reason why our model is called “additive neural networks”. [sent-161, score-0.39]
51 Secondly, some of its additive terms share the same parameters (M, B). [sent-164, score-0.334]
52 under the log-linear translation framework, which firstly learn features or sub-models and then tune the log-linear model including the learned features in two separate steps. [sent-181, score-0.397]
53 By joint training, AdNN can learn the features towards the translation evaluation metric, which is the main advantage of our model over the log-linear model. [sent-182, score-0.304]
54 (5) includes 8 default features, which consist of translation probabilities, lexical translation probabilities, word penalty, glue rule penalty, synchronous rule penalty and language model. [sent-185, score-0.551]
55 For the local feature vector h0 in Eq (5), we employ word embedding features as described in the following subsection. [sent-187, score-0.436]
56 3 Word Embedding features for AdNN Word embedding can relax the sparsity introduced by the lexicalization in NLP, and it improves the systems for many tasks such as language model, named entity recognition, and parsing (Collobert and Weston, 2008; Turian et al. [sent-189, score-0.363]
57 Here, we propose embedding features for rules in SMT by combining word embeddings. [sent-191, score-0.363]
58 Firstly, we will define the embedding for the source side α of a rule r : X → hα, γi . [sent-192, score-0.356]
59 Let VS brcee th sied vocabulary ilne t hre source language with sVize |VS |; Rn×|VS| be the word embedding matrix, esaizceh | cVol|u;m Rn of which is the word embedding (ndimensional vector) for the corresponding word in VS; and maxSource be the maximal length of α fVor all rules. [sent-193, score-0.624]
60 Wrdes “deNfiUneL tLhe” embedding of α as the concatenation of the word embedding of each word in α. [sent-195, score-0.624]
61 In particular, for the non-terminal in α, we define its word embedding as the vector whose components are 0. [sent-196, score-0.312]
62 1; and − × we define the word embedding of “NULL” as 0. [sent-197, score-0.312]
63 Then, we similarly define the embedding for the target side of a rule, given an embedding matrix for the target vocabulary. [sent-198, score-0.624]
64 Finally, we define the embedding of a rule as the concatenation of the embedding of its source and target sides. [sent-199, score-0.668]
65 In this paper, we apply the word embedding matrices from the RNNLM toolkit (Mikolov et al. [sent-200, score-0.341]
66 It would be potentially better to train the word embedding matrix from a much larger corpus as (Collobert and Weston, 2008), and we will leave this as a future task. [sent-202, score-0.312]
67 n its corre1In the RNNLM toolkit, the default dimension for word embedding is n = 30. [sent-207, score-0.351]
68 (6) defined over the sampled preference pairs 4: θt+1 = CG(θt, Obj, ∆, CGIter) 5: end for Output: θT+1 In detail, line 2 in Algorithm 1 firstly follows PRO to sample a set of preference pairs from k-best-list, and then uniformly samples batch-size pairs from the preference pair set. [sent-248, score-0.252]
69 Actually, this problem is inherent and is one many works based on the neural network for other NLP tasks such as language model and parsing, also suffer from. [sent-255, score-0.56]
70 1 Experimental Setting We conduct our experiments on the Chinese-toEnglish and Japanese-to-English translation tasks. [sent-294, score-0.197]
71 In our experiments, the translation performances are measured by case-sensitive BLEU4 metric4 (Papineni et al. [sent-301, score-0.197]
72 We use an in-house developed hierarchical phrase-based translation (Chiang, 2005) for our baseline system, which shares the similar setting as Hiero (Chiang, 2005), e. [sent-304, score-0.229]
73 Further, we integrate the embedding features (See Section 3. [sent-322, score-0.363]
74 AdNN-Hiero-E is our implementation of the AddNN model with embedding features, as discussed in Section 3, and it shares the same codebase and settings as L-Hiero. [sent-325, score-0.4]
75 Although there are several parameters in AdNN which may limit its practicability, according to many of our internal studies, most parameters are insensitive to AdNN except λ and MaxIter, which are common in other tuning toolkits such as MIRA and can be tuned5 on a development test dataset. [sent-329, score-0.212]
76 2 Results and Analysis As discussed in Section 3, our AdNN-Hiero-E shares the same decoding strategy and pruning method as L-Hiero. [sent-334, score-0.197]
77 t640 5386+ Table 2: The BLEU comparisons between AdNNHiero-E and Log-linear translation models on the Chinese-to-English and Japanese-to-English tasks. [sent-351, score-0.197]
78 these features are not dependent on the translation states, they are computed and saved to memory when loading the translation model. [sent-354, score-0.445]
79 Therefore, the decoding efficiency of AdNN-Hiero-E is almost the same as that of L-Hiero. [sent-356, score-0.218]
80 Word embedding features can improve the performance on other NLP tasks (Turian et al. [sent-360, score-0.363]
81 the loglinear model requires the features to be linear with the model and thus limits its expressive abilities. [sent-365, score-0.311]
82 However, after the single-layer non-linear operator (sigmoid functions) on the embedding features for deep interpretation and representation, AdNNHiero-E gains improvements over both L-Hiero and L-Hiero-E, as depicted in Table 2. [sent-366, score-0.452]
83 Although there are serious overlaps between h and h0 for AdNN-Hiero-D which may limit its generalization abilities, as shown in Table 3, it is still comparable to L-Hiero on the Japanese-to-English task, and significantly outperforms L-Hiero on the Chinese-to-English translation task. [sent-393, score-0.197]
84 To investigate the reason why the gains for AdNN-Hiero-D on the two different translation tasks differ, we calculate the perplexities between the target side of training data and test datasets on both translation tasks. [sent-394, score-0.467]
85 Based on these similarity statistics, we conjecture that the log-linear model does not fit well for difficult translation tasks (e. [sent-398, score-0.253]
86 Unlike these works, we propose a variant neural network, i. [sent-412, score-0.368]
87 additive neural networks, starting from SMT itself and taking both of the model definition and its inference (decoding) together into account. [sent-414, score-0.68]
88 Our variant of neural network, AdNN, is highly related to both additive models (Buja et al. [sent-415, score-0.658]
89 , 1989) and generalized additive neural networks (Potts, 1999; Waal and Toit, 2007), in which an additive term is either a linear model or a neural network. [sent-416, score-1.516]
90 The idea of the neural network in machine translation has already been pioneered in previ- ous works. [sent-418, score-0.701]
91 (1997) introduced a neural network for example-based machine translation. [sent-420, score-0.504]
92 (2012) and Schwenk (2012) employed a neural network to model the phrase translation probability on the rule level hα, γi instead of the bilingual sentence level hf, ei as ,inγ Eq. [sent-422, score-0.801]
93 Unlike their post-processing models (either a re-ranking or a system combination model) in SMT, we propose a non-linear translation model which can be easily incorporated into the existing SMT framework. [sent-429, score-0.253]
94 One advantage of our model is that it jointly learns features and tunes the translation model and thus learns features towards the translation evaluation metric. [sent-432, score-0.608]
95 Additionally, the decoding of our model is as efficient as that of the log-linear model. [sent-433, score-0.25]
96 For Chinese-toEnglish and Japanese-to-English translation tasks, our model significantly outperforms the log-linear model, with the help of word embedding. [sent-434, score-0.253]
97 We plan to explore more work on the additive neural networks in the future. [sent-435, score-0.752]
98 Acknowledgments We would like to thank our colleagues in both HIT and NICT for insightful discussions, Tomas Mikolov for the helpful discussion about the word embedding in RNNLM, and three anonymous reviewers for many invaluable comments and suggestions to improve our paper. [sent-437, score-0.312]
99 Online large-margin training of syntactic and structural translation features. [sent-476, score-0.229]
100 A unified architecture for natural language processing: Deep neural networks with multitask learning. [sent-502, score-0.462]
wordName wordTfidf (topN-words)
[('adnn', 0.418), ('neural', 0.334), ('embedding', 0.312), ('additive', 0.29), ('translation', 0.197), ('network', 0.17), ('decoding', 0.165), ('maxiter', 0.156), ('cg', 0.129), ('networks', 0.128), ('pro', 0.115), ('smt', 0.114), ('cgiter', 0.11), ('collobert', 0.1), ('chiang', 0.085), ('mert', 0.077), ('loglinear', 0.074), ('preference', 0.07), ('weston', 0.07), ('conjugate', 0.068), ('buja', 0.066), ('maxsource', 0.066), ('sokolov', 0.066), ('substructure', 0.058), ('duh', 0.057), ('model', 0.056), ('stroudsburg', 0.055), ('tuning', 0.054), ('rnnlm', 0.054), ('och', 0.054), ('efficiency', 0.053), ('nonlinear', 0.053), ('features', 0.051), ('deep', 0.048), ('mikolov', 0.048), ('friendly', 0.046), ('abilities', 0.046), ('hiero', 0.046), ('linear', 0.045), ('hf', 0.045), ('parameters', 0.044), ('csg', 0.044), ('deselaers', 0.044), ('eters', 0.044), ('hager', 0.044), ('lhiero', 0.044), ('odel', 0.044), ('param', 0.044), ('rectangle', 0.044), ('thie', 0.044), ('tomas', 0.044), ('waal', 0.044), ('pa', 0.044), ('rule', 0.044), ('kirchhoff', 0.043), ('firstly', 0.042), ('interpret', 0.041), ('gains', 0.041), ('subgradient', 0.041), ('turian', 0.041), ('bleu', 0.04), ('moses', 0.04), ('default', 0.039), ('casta', 0.039), ('hhe', 0.039), ('overloading', 0.039), ('generalized', 0.039), ('toolkits', 0.038), ('local', 0.038), ('reasonable', 0.037), ('koehn', 0.037), ('vs', 0.036), ('algorithm', 0.036), ('fujii', 0.036), ('feature', 0.035), ('boosting', 0.034), ('association', 0.034), ('deeply', 0.034), ('potts', 0.034), ('variant', 0.034), ('solution', 0.033), ('premise', 0.032), ('accelerate', 0.032), ('cooperation', 0.032), ('nonlocal', 0.032), ('development', 0.032), ('training', 0.032), ('shares', 0.032), ('weight', 0.032), ('le', 0.031), ('nict', 0.031), ('decomposable', 0.031), ('goes', 0.03), ('synchronous', 0.03), ('runs', 0.03), ('suffered', 0.03), ('hi', 0.03), ('toolkit', 0.029), ('efficient', 0.029), ('expressive', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
2 0.29992512 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
3 0.2174392 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
4 0.20215482 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
Author: Kevin Duh ; Graham Neubig ; Katsuhito Sudoh ; Hajime Tsukada
Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling un- known word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.
5 0.17513672 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa
Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.
6 0.15156662 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
7 0.1512191 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
9 0.1457506 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
10 0.14330158 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
11 0.13512713 275 acl-2013-Parsing with Compositional Vector Grammars
12 0.13305856 24 acl-2013-A Tale about PRO and Monsters
13 0.12756996 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
14 0.12590292 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
15 0.12468314 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation
16 0.12290535 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
17 0.11928929 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models
18 0.1163165 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation
19 0.11559714 314 acl-2013-Semantic Roles for String to Tree Machine Translation
20 0.11532369 328 acl-2013-Stacking for Statistical Machine Translation
topicId topicWeight
[(0, 0.258), (1, -0.164), (2, 0.175), (3, 0.104), (4, -0.028), (5, 0.023), (6, 0.051), (7, -0.03), (8, -0.051), (9, 0.127), (10, -0.011), (11, -0.032), (12, 0.055), (13, -0.126), (14, 0.016), (15, 0.133), (16, -0.118), (17, 0.013), (18, 0.039), (19, -0.191), (20, 0.099), (21, -0.027), (22, -0.122), (23, -0.016), (24, 0.052), (25, -0.074), (26, 0.141), (27, -0.089), (28, 0.088), (29, 0.014), (30, -0.05), (31, -0.011), (32, -0.012), (33, -0.119), (34, 0.066), (35, -0.094), (36, -0.021), (37, -0.056), (38, 0.024), (39, -0.089), (40, 0.009), (41, 0.066), (42, 0.032), (43, 0.074), (44, -0.017), (45, -0.074), (46, -0.008), (47, 0.035), (48, -0.066), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.92254633 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
2 0.83278894 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
Author: Kevin Duh ; Graham Neubig ; Katsuhito Sudoh ; Hajime Tsukada
Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling un- known word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.
3 0.79473174 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
4 0.68735349 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
Author: Heike Adel ; Ngoc Thang Vu ; Tanja Schultz
Abstract: In this paper, we investigate the application of recurrent neural network language models (RNNLM) and factored language models (FLM) to the task of language modeling for Code-Switching speech. We present a way to integrate partof-speech tags (POS) and language information (LID) into these models which leads to significant improvements in terms of perplexity. Furthermore, a comparison between RNNLMs and FLMs and a detailed analysis of perplexities on the different backoff levels are performed. Finally, we show that recurrent neural networks and factored language models can . be combined using linear interpolation to achieve the best performance. The final combined language model provides 37.8% relative improvement in terms of perplexity on the SEAME development set and a relative improvement of 32.7% on the evaluation set compared to the traditional n-gram language model. Index Terms: multilingual speech processing, code switching, language modeling, recurrent neural networks, factored language models
6 0.63290972 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
7 0.62159187 328 acl-2013-Stacking for Statistical Machine Translation
8 0.60989285 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
9 0.60599989 275 acl-2013-Parsing with Compositional Vector Grammars
10 0.58843988 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
11 0.58483976 24 acl-2013-A Tale about PRO and Monsters
12 0.57897347 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models
13 0.56894308 219 acl-2013-Learning Entity Representation for Entity Disambiguation
14 0.56696397 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals
15 0.55966413 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
16 0.54525232 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
17 0.54165322 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
18 0.52856916 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
19 0.52101833 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
20 0.52037299 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation
topicId topicWeight
[(0, 0.048), (6, 0.03), (11, 0.066), (15, 0.026), (24, 0.039), (26, 0.069), (29, 0.02), (35, 0.059), (40, 0.149), (42, 0.103), (48, 0.06), (67, 0.039), (70, 0.041), (88, 0.024), (90, 0.057), (95, 0.068)]
simIndex simValue paperId paperTitle
1 0.91790748 94 acl-2013-Coordination Structures in Dependency Treebanks
Author: Martin Popel ; David Marecek ; Jan StÄłpanek ; Daniel Zeman ; ZdÄłnÄłk Zabokrtsky
Abstract: Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too.
2 0.91056442 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation
Author: Kenneth Heafield ; Ivan Pouzyrevsky ; Jonathan H. Clark ; Philipp Koehn
Abstract: We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.
3 0.87825853 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models
Author: Matthew R. Gormley ; Jason Eisner
Abstract: Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent. We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ?) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum. As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.
4 0.87711978 163 acl-2013-From Natural Language Specifications to Program Input Parsers
Author: Tao Lei ; Fan Long ; Regina Barzilay ; Martin Rinard
Abstract: We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. We model the problem as a joint dependency parsing and semantic role labeling task. Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. Our results show that our approach achieves 80.0% F-Score accu- , racy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
same-paper 5 0.86673802 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
6 0.83269119 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
7 0.77083302 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation
8 0.75939703 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
9 0.75355715 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
10 0.75169647 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction
11 0.75119966 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
12 0.7510404 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
13 0.74695134 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation
14 0.746503 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
15 0.74628872 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
16 0.74583441 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction
17 0.74519616 225 acl-2013-Learning to Order Natural Language Texts
18 0.74488938 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
19 0.74488801 80 acl-2013-Chinese Parsing Exploiting Characters
20 0.74406654 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation