acl acl2013 acl2013-388 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
Reference: text
sentIndex sentText sentNum sentScore
1 cn Abstract In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al. [sent-4, score-0.374]
2 , 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. [sent-7, score-0.898]
3 Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score. [sent-9, score-0.336]
4 1 Introduction Recent years research communities have seen a strong resurgent interest in modeling with deep (multi-layer) neural networks. [sent-10, score-0.42]
5 The unsupervised pretraining trains the network one layer at a time, and helps to guide the parameters of the layer towards better regions in parameter space (Bengio, 2009). [sent-19, score-0.752]
6 , 2012) proposed context-dependent neural network with large vocabulary, which achieved 16. [sent-26, score-0.545]
7 Word embedding is usually first learned from huge amount of monolingual texts, and then fine-tuned with taskspecific objectives. [sent-31, score-0.261]
8 Inspired by successful previous works, we propose a new DNN-based word alignment method, which exploits contextual and semantic similarities between words. [sent-36, score-0.299]
9 Figure 1: Two examples of word alignment the English word “mammoth” is not, so it is very hard to align them correctly. [sent-53, score-0.384]
10 As we mentioned in the last paragraph, word embedding (trained with huge monolingual texts) has the ability to map a word into a vector space, in which, similar words are near each other. [sent-55, score-0.373]
11 In the rest of this paper, related work about DNN and word alignment are first reviewed in Section 2, followed by a brief introduction of DNN in Section 3. [sent-60, score-0.299]
12 We then introduce the details of leveraging DNN for word alignment, including the details of our network structure in Section 4 and the training method in Section 5. [sent-61, score-0.308]
13 , 2006) proposed to use multi-layer neural network for language modeling task. [sent-75, score-0.545]
14 (Niehues and Waibel, 2012) shows that machine translation results can be improved by combining neural language model with n-gram traditional language. [sent-78, score-0.429]
15 , 2012) improves translation quality of n-gram translation model by using a bilingual neural language model. [sent-80, score-0.574]
16 , 2012) learns a context-free cross-lingual word embeddings to facilitate cross-lingual information retrieval. [sent-82, score-0.35]
17 Word embeddings often implicitly encode syntactic or semantic knowledge of the words. [sent-93, score-0.294]
18 Assuming a finite sized vocabulary V , word embeddings form a (L |V | )-dimension embedding mbeadtdriixn WV, mwh ae (rLe ×L iVs a pre-determined × embedding length; mapping words to embeddings is done by simply looking up their respective columns in the embedding matrix WV. [sent-94, score-1.15]
19 The lookup process is called a lookup layer LT , which is usually the first layer after the input layer in neural network. [sent-95, score-1.099]
20 If input must be of variable length, convolution layer and max layer can be used, (Collobert et al. [sent-101, score-0.402]
21 Multi-layer neural networks are trained with the standard back propagation algorithm (LeCun, 1985). [sent-103, score-0.362]
22 , 1998) have been developed to train better neural networks. [sent-107, score-0.351]
23 Besides that, neural network training also involves some hyperparameters such as learning rate, the number of hidden layers. [sent-108, score-0.649]
24 4 DNN for word alignment Our DNN word alignment model extends classic HMM word alignment model (Vogel et al. [sent-110, score-1.118]
25 Given a sentence pair (e, f), HMM word alignment takes the following form: P(a,e|f) =Y|e|Plex(ei|fai)Pd(ai− ai−1) (4) iY= Y1 where Plex is the lexical translation probability and Pd is the jump distance distortion probability. [sent-112, score-0.63]
26 One straightforward way to integrate DNN into HMM is to use neural network to compute the emission (lexical translation) probability Plex. [sent-113, score-0.545]
27 Such approach requires a softmax layer in the neural network to normalize over all words in source vocabulary. [sent-114, score-0.775]
28 Hence we give up the probabilistic interpretation and resort to a nonprobabilistic, discriminative view: sNN(a|e,f) =Y|e|tlex(ei,fai|e,f)td(ai,ai−1|e,f) Yi=1 (5) where tlex is a lexical translation score computed by neural network, and td is a distortion score. [sent-116, score-0.745]
29 In the classic HMM word alignment model, context is not considered in the lexical translation probability. [sent-117, score-0.554]
30 a Idn V contrast, our model does not maintain a separate translation score parameters for every source-target word pair, but computes tlex through a multi-layer network, which naturally handles contexts on both sides without explosive growth of number of parameters. [sent-120, score-0.363]
31 Figure 2 shows the neural network we used to compute context dependent lexical translation score tlex. [sent-123, score-0.653]
32 For word pair (ei, fj), we take fixed length windows surrounding both ei and fj as input: (ei−s2w, . [sent-124, score-0.307]
33 For the distortion td, we could use a lexicalized distortion model: td(ai, ai−1 |e, f) = td(ai − ai−1 |window(fa )) (7) which can be computed by a neural network similar to the one used to compute lexical translation scores. [sent-132, score-0.907]
34 If we map jump distance (ai ai−1) to B buckets, we can change the length o af the output layer to B, where each dimension in the output stands for a different bucket of jump distances. [sent-133, score-0.389]
35 But we found in our initial experiments on small scale data, lexicalized distortion does not produce better alignment over the simple jumpdistance based model. [sent-134, score-0.389]
36 So we drop the lexicalized distortion and reverse to the simple version: − td(ai, ai−1 |e, f) = td(ai − ai−1) (8) Vocabulary V of our alignment model consists of a source vocabulary Ve and a target vocabulary Vf. [sent-135, score-0.52]
37 To decode our model, the lexical translation scores are computed for each source-target word pair in the sentence pair, which requires going through the neural network (|e| |f|) times; after that, tthhee forward-backward algorithm can a bfeused to find the viterbi path as in the classic HMM model. [sent-139, score-0.856]
38 The majority of tunable parameters in our model resides in the lookup table LT, which is a (L ( |Ve | + |Vf |))-dimension matrix. [sent-140, score-0.24]
39 In fact, discriminative word alignment can model contexts by deploying arbitrary features (Moore, 2005). [sent-144, score-0.408]
40 Different from previous discriminative word alignment, our model does not use manually engineered features, but learn “features” automatically from raw words by the neural net- work. [sent-145, score-0.451]
41 , 1996) use a maximum entropy model to model the bag-of-words context for word alignment, but their model treats each word as a distinct feature, which can not leverage the similarity between words as our model. [sent-147, score-0.261]
42 , 2011) can be adapted to train 1In practice, the number of non-zero parameters in classic HMM model would be much smaller, as many words do not co-occur in bilingual sentence pairs. [sent-149, score-0.36]
43 In our experiments, the number of non-zero parameters in classic HMM model is about 328 millions, while the NN model only has about 4 millions. [sent-150, score-0.293]
44 169 our model from raw sentence pairs, they are too computational demanding as the lexical translation probabilities must be computed from neural networks. [sent-151, score-0.429]
45 As we do not have a large manually word aligned corpus, we use traditional word alignment models such as HMM and IBM model 4 to generate word alignment on a large parallel corpus. [sent-153, score-0.691]
46 We obtain bidirectional alignment by running the usual growdiag-final heuristics (Koehn et al. [sent-154, score-0.243]
47 , 2012), where training data for neural network model is generated by forced decoding with traditional Gaussian mixture models. [sent-157, score-0.611]
48 Tunable parameters in neural network alignment model include: word embeddings in lookup table LT, parameters Wl, bl for linear transformations in the hidden layers of the neural network, and distortion parameters sd of jump distance. [sent-158, score-2.32]
49 One nuance here is that the gold alignment after grow-diag-final contains many-to-many links, which cannot be generated by any path. [sent-161, score-0.243]
50 Our solution is that for each source word alignment multiple target, we randomly choose one link among all candidates as the golden link. [sent-162, score-0.299]
51 Because our multi-layer neural network is inherently non-linear and is non-convex, directly training against the above criteria is unlikely to yield good results. [sent-163, score-0.637]
52 1 Pre-training initial word embedding with monolingual data Most parameters reside in the word embeddings. [sent-166, score-0.413]
53 To get a good initial value, the usual approach is to pre-train the embeddings on a large monolingual corpus. [sent-167, score-0.37]
54 , 2011) and train word embeddings for source and target languages from their monolingual corpus respectively. [sent-169, score-0.455]
55 We set word embedding length to 20, window size to 5, and the length of the only hidden layer to 40. [sent-171, score-0.655]
56 Note that embedding for null word in either Ve and Vf cannot be trained from monolingual corpus, and we simply leave them at the initial value untouched. [sent-177, score-0.327]
57 Word embeddings from monolingual corpus learn strong syntactic knowledge of each word, which is not always desirable for word alignment between some language pairs like English and Chinese. [sent-178, score-0.669]
58 For example, many Chinese words can act as a verb, noun and adjective without any change, while their English counter parts are distinct words with quite different word embeddings due to their different syntactic roles. [sent-179, score-0.35]
59 Thus we have to modify the word embeddings in subsequent steps according to bilingual data. [sent-180, score-0.425]
60 2 Training neural network based on local criteria Training the network against the sentence level criteria Eq. [sent-182, score-0.894]
61 This training criteria essentially means our model suffers loss unless it gives correct word pairs a higher score than random pairs from the same sentence pair with some margin. [sent-186, score-0.277]
62 We initialize the lookup table with embeddings obtained from monolingual training, and randomly initialize all Wl and bl in linear layers to [-0. [sent-187, score-0.717]
63 We randomly cy- cle through all sentence pairs in training data; for each correct word pair (including null alignment), we generate a positive example, and generate two negative examples by randomly corrupting either 170 side of the pair with another word in the sentence pair. [sent-191, score-0.259]
64 To make our model concrete, there are still hyper-parameters to be determined: the window size sw and tw, the length of each hidden layer Ll. [sent-196, score-0.48]
65 3 Training distortion parameters We fix neural network parameters obtained from the last step, and tune the distortion parameters sd with respect to the sentence level loss using standard stochastic gradient descent. [sent-199, score-1.194]
66 4 Tuning neural network based on sentence level criteria Up-to-now, parameters in the lexical translation neural network have not been trained against the sentence level criteria Eq. [sent-205, score-1.358]
67 We could achieve this by re-using the same online training method used to train distortion parameters, except that we now fix the distortion parameters and let the loss back-propagate through the neural networks. [sent-207, score-0.798]
68 This tuning is quite slow, and it did not improve alignment on an initial small scale experiment; so, we skip this step in all subsequent experiment in this work. [sent-209, score-0.243]
69 6 Experiments and Results We conduct our experiment on Chinese-to-English word alignment task. [sent-210, score-0.299]
70 We use the manually aligned Chinese-English alignment corpus (Haghighi et al. [sent-211, score-0.243]
71 The monolingual corpus to pre-train word embeddings are also crawled from web, which amounts to about 1. [sent-216, score-0.426]
72 We train our proposed model from results of classic HMM and IBM model 4 separately. [sent-221, score-0.25]
73 3 Alignment Result It can be seen from Table 1, the proposed model consistently outperforms its corresponding baseline whether it is trained from alignment of classic HMM or IBM model 4. [sent-225, score-0.464]
74 In future we would like to explore whether our method can improve other word alignment models. [sent-234, score-0.299]
75 Despite different alignment scores, we do not obtain significant difference in translation performance. [sent-237, score-0.313]
76 307 for models trained from IBM-4 and NN alignment results. [sent-241, score-0.243]
77 The result is not surprising considering our parallel corpus is quite large, and similar observations have been made in previous work as (DeNero and Macherey, 2011) that better alignment quality does not necessarily lead to better end-to-end result. [sent-242, score-0.243]
78 By analyzing the results, we found out that for both baseline and our model, a large part of missing alignment links involves stop words like English words “the”, “a”, “it” and Chinese words “de”. [sent-247, score-0.243]
79 Stop words are inherently hard to align, which often requires grammatical judgment unavailable to our models; as they are also extremely frequent, our model fully learns their alignment patterns of the baseline models, including errors. [sent-248, score-0.28]
80 In our model, different person names have very similar word embeddings on both English side and Chinese side, due to monolingual pre-training; what is more, different person names often appear in similar contexts. [sent-252, score-0.598]
81 As our model considers both word embeddings and contexts, it learns that English person names should be aligned to Chinese person names, which corrects errors of baseline models and leads to better precision. [sent-253, score-0.51]
82 2 Effect of context To examine how context contribute to alignment quality, we re-train our model with different window size, all from result of IBM model 4. [sent-256, score-0.495]
83 74 1 3 5 7 9 11 13 Figure 3: Effect of different window sizes on word alignment F-score. [sent-264, score-0.401]
84 With larger window size, our model is able to produce more accurate translation scores based on more contexts, which leads to better alignment despite the simpler distortions. [sent-270, score-0.452]
85 Two hidden layers outperform one hidden layer, while three hidden layers do not bring further improvement. [sent-273, score-0.489]
86 3 Effect of number of hidden layers Our neural network contains two hidden layers besides the lookup layer. [sent-276, score-1.046]
87 For 1-hidden-layer setting, we set the hidden layer length to 120; and for 3-hidden-layer setting, we set hidden layer lengths to 120, 100, 10 respectively. [sent-279, score-0.586]
88 As can be seen from Table 3, 2hidden-layer outperforms the 1-hidden-layer setting, while another hidden layer does not bring 172 Table 2: Nearest neighbors of several words according to their embedding distance. [sent-280, score-0.485]
89 LM shows neighbors of word embeddings trained by monolingual language model method; WA shows neighbors of word embeddings trained by our word alignment model. [sent-281, score-1.224]
90 Due to time constraint, we have not tuned the hyper-parameters such as length of hidden layers in 1 and 3-hidden-layer settings, nor have we tested settings with more hidden-layers. [sent-283, score-0.273]
91 While this is true for relatively frequent nouns such as “lab” and “labs”, rarer nouns still remain near their monolingual embeddings as they are only modified a few times during the bilingual training. [sent-293, score-0.445]
92 7 Conclusion In this paper, we explores applying deep neural network for word alignment task. [sent-295, score-0.942]
93 Our model integrates a multi-layer neural network into an HMM-like framework, where context dependent lexical translation score is computed by neural network, and distortion is modeled by a simple jump-distance scheme. [sent-296, score-1.158]
94 Our model is discriminatively trained on bilingual corpus, while huge monolingual data is used to pre-train wordembeddings. [sent-297, score-0.258]
95 Experiments on large-scale Chineseto-English task show that the proposed method produces better word alignment results, compared with both classic HMM model and IBM model 4. [sent-298, score-0.52]
96 Secondly, we want to explore the possibility of unsupervised training of our neural word alignment model, without reliance of alignment result of other models. [sent-300, score-0.893]
97 Neurocomputing: Algorithms, architectures and applications, chapter probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. [sent-329, score-0.259]
98 Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. [sent-344, score-0.46]
99 Advances in neural information processing systems, 20: 1185–1 192. [sent-397, score-0.322]
100 Parsing natural scenes and natural language with recursive neural networks. [sent-422, score-0.36]
wordName wordTfidf (topN-words)
[('dnn', 0.349), ('neural', 0.322), ('embeddings', 0.294), ('alignment', 0.243), ('network', 0.223), ('layer', 0.201), ('hmm', 0.195), ('embedding', 0.153), ('classic', 0.147), ('distortion', 0.146), ('layers', 0.132), ('collobert', 0.122), ('dahl', 0.11), ('bengio', 0.109), ('window', 0.102), ('deep', 0.098), ('tlex', 0.092), ('lookup', 0.087), ('ibm', 0.086), ('zl', 0.085), ('ai', 0.084), ('mammoth', 0.083), ('yibula', 0.083), ('td', 0.079), ('jump', 0.077), ('monolingual', 0.076), ('hidden', 0.075), ('bilingual', 0.075), ('parameters', 0.072), ('translation', 0.07), ('vf', 0.07), ('yann', 0.064), ('yoshua', 0.064), ('criteria', 0.063), ('htanh', 0.062), ('juda', 0.062), ('krizhevsky', 0.062), ('fj', 0.062), ('ei', 0.06), ('plex', 0.058), ('surrounding', 0.057), ('socher', 0.057), ('word', 0.056), ('neighbors', 0.056), ('hyperbolic', 0.055), ('lecun', 0.055), ('pretraining', 0.055), ('lt', 0.055), ('loss', 0.054), ('chinese', 0.053), ('fl', 0.051), ('hinton', 0.05), ('bl', 0.05), ('names', 0.049), ('fai', 0.048), ('kavukcuoglu', 0.048), ('imagenet', 0.048), ('ve', 0.048), ('vocabulary', 0.047), ('stochastic', 0.047), ('tunable', 0.044), ('null', 0.042), ('boureau', 0.042), ('nongmin', 0.042), ('seide', 0.042), ('itg', 0.041), ('networks', 0.04), ('sd', 0.04), ('della', 0.039), ('initialize', 0.039), ('discriminatively', 0.038), ('optimizer', 0.038), ('wl', 0.038), ('pair', 0.038), ('context', 0.038), ('recursive', 0.038), ('person', 0.037), ('distortions', 0.037), ('rbm', 0.037), ('model', 0.037), ('architectures', 0.036), ('activation', 0.036), ('contexts', 0.036), ('discriminative', 0.036), ('nearest', 0.035), ('length', 0.034), ('denero', 0.034), ('huge', 0.032), ('settings', 0.032), ('niehues', 0.032), ('koray', 0.032), ('convolutional', 0.032), ('boltzmann', 0.032), ('vogel', 0.031), ('sw', 0.031), ('shujie', 0.029), ('softmax', 0.029), ('training', 0.029), ('align', 0.029), ('train', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
2 0.37220952 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
3 0.29992512 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
4 0.19534577 219 acl-2013-Learning Entity Representation for Entity Disambiguation
Author: Zhengyan He ; Shujie Liu ; Mu Li ; Ming Zhou ; Longkai Zhang ; Houfeng Wang
Abstract: We propose a novel entity disambiguation model, based on Deep Neural Network (DNN). Instead of utilizing simple similarity measures and their disjoint combinations, our method directly optimizes document and entity representations for a given similarity measure. Stacked Denoising Auto-encoders are first employed to learn an initial document representation in an unsupervised pre-training stage. A supervised fine-tuning stage follows to optimize the representation towards the similarity measure. Experiment results show that our method achieves state-of-the-art performance on two public datasets without any manually designed features, even beating complex collective approaches.
5 0.1839762 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
Author: Tiberiu Boros ; Radu Ion ; Dan Tufis
Abstract: Radu Ion Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy radu@ racai . ro Dan Tufi? Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy tufi s @ racai . ro Networks (Marques and Lopes, 1996) and Conditional Random Fields (CRF) (Lafferty et Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, ?????? ?????? ??????? ??????, 1999), exploits a reduced set of tags derived by removing several recoverable features from the lexicon morpho-syntactic descriptions. A second phase is aimed at recovering the full set of morpho-syntactic features. In this paper we present an alternative method to Tiered Tagging, based on local optimizations with Neural Networks and we show how, by properly encoding the input sequence in a general Neural Network architecture, we achieve results similar to the Tiered Tagging methodology, significantly faster and without requiring extensive linguistic knowledge as implied by the previously mentioned method. 1
6 0.1833403 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning
7 0.1721358 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
8 0.16224593 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
9 0.15241854 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model
10 0.14641367 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals
11 0.14201215 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner
12 0.13906401 275 acl-2013-Parsing with Compositional Vector Grammars
13 0.13685043 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
14 0.13522601 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
15 0.13246627 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
17 0.1273853 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
18 0.12695554 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment
19 0.10738189 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
20 0.10735064 125 acl-2013-Distortion Model Considering Rich Context for Statistical Machine Translation
topicId topicWeight
[(0, 0.255), (1, -0.105), (2, 0.121), (3, 0.067), (4, 0.018), (5, -0.031), (6, -0.042), (7, -0.024), (8, -0.072), (9, 0.083), (10, 0.009), (11, -0.231), (12, 0.114), (13, -0.242), (14, 0.008), (15, 0.099), (16, -0.056), (17, 0.017), (18, -0.016), (19, -0.338), (20, 0.013), (21, -0.096), (22, -0.162), (23, -0.055), (24, 0.039), (25, -0.062), (26, 0.084), (27, -0.087), (28, 0.138), (29, 0.016), (30, -0.241), (31, -0.021), (32, -0.01), (33, -0.066), (34, 0.013), (35, -0.043), (36, -0.043), (37, -0.096), (38, 0.066), (39, -0.04), (40, 0.013), (41, 0.035), (42, 0.054), (43, 0.047), (44, 0.009), (45, -0.112), (46, 0.039), (47, -0.033), (48, -0.085), (49, -0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.93163627 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
2 0.75249332 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
3 0.72389543 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
Author: Tiberiu Boros ; Radu Ion ; Dan Tufis
Abstract: Radu Ion Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy radu@ racai . ro Dan Tufi? Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy tufi s @ racai . ro Networks (Marques and Lopes, 1996) and Conditional Random Fields (CRF) (Lafferty et Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, ?????? ?????? ??????? ??????, 1999), exploits a reduced set of tags derived by removing several recoverable features from the lexicon morpho-syntactic descriptions. A second phase is aimed at recovering the full set of morpho-syntactic features. In this paper we present an alternative method to Tiered Tagging, based on local optimizations with Neural Networks and we show how, by properly encoding the input sequence in a general Neural Network architecture, we achieve results similar to the Tiered Tagging methodology, significantly faster and without requiring extensive linguistic knowledge as implied by the previously mentioned method. 1
4 0.71350074 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
Author: Kevin Duh ; Graham Neubig ; Katsuhito Sudoh ; Hajime Tsukada
Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling un- known word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.
5 0.70059603 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
7 0.57088858 219 acl-2013-Learning Entity Representation for Entity Disambiguation
8 0.56396008 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment
9 0.5634923 275 acl-2013-Parsing with Compositional Vector Grammars
10 0.55803454 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals
11 0.54731226 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning
12 0.50026846 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
13 0.5 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner
14 0.48614866 349 acl-2013-The mathematics of language learning
15 0.47505251 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model
16 0.45746645 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment
17 0.43789354 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
18 0.42578727 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation
19 0.41610557 390 acl-2013-Word surprisal predicts N400 amplitude during reading
20 0.4148654 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
topicId topicWeight
[(0, 0.053), (6, 0.08), (11, 0.056), (15, 0.01), (24, 0.042), (26, 0.04), (28, 0.013), (35, 0.085), (42, 0.077), (48, 0.054), (67, 0.226), (70, 0.036), (88, 0.022), (90, 0.028), (95, 0.081), (97, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.84053463 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
2 0.82510674 138 acl-2013-Enriching Entity Translation Discovery using Selective Temporality
Author: Gae-won You ; Young-rok Cha ; Jinhan Kim ; Seung-won Hwang
Abstract: This paper studies named entity translation and proposes “selective temporality” as a new feature, as using temporal features may be harmful for translating “atemporal” entities. Our key contribution is building an automatic classifier to distinguish temporal and atemporal entities then align them in separate procedures to boost translation accuracy by 6. 1%.
3 0.81017607 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
Author: Rudolf Rosa ; David Marecek ; Ales Tamchyna
Abstract: Deepfix is a statistical post-editing system for improving the quality of statistical machine translation outputs. It attempts to correct errors in verb-noun valency using deep syntactic analysis and a simple probabilistic model of valency. On the English-to-Czech translation pair, we show that statistical post-editing of statistical machine translation leads to an improvement of the translation quality when helped by deep linguistic knowledge.
4 0.80884445 100 acl-2013-Crowdsourcing Interaction Logs to Understand Text Reuse from the Web
Author: Martin Potthast ; Matthias Hagen ; Michael Volske ; Benno Stein
Abstract: unkown-abstract
5 0.75292361 252 acl-2013-Multigraph Clustering for Unsupervised Coreference Resolution
Author: Sebastian Martschat
Abstract: We present an unsupervised model for coreference resolution that casts the problem as a clustering task in a directed labeled weighted multigraph. The model outperforms most systems participating in the English track of the CoNLL’ 12 shared task.
6 0.73155206 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
7 0.68648148 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
8 0.6693756 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
9 0.6660955 275 acl-2013-Parsing with Compositional Vector Grammars
10 0.6587283 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals
11 0.64436591 219 acl-2013-Learning Entity Representation for Entity Disambiguation
12 0.64350992 294 acl-2013-Re-embedding words
13 0.63753641 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
14 0.6367197 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization
15 0.63667929 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
16 0.63653433 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain
17 0.63032728 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
18 0.62902707 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation
19 0.62768805 172 acl-2013-Graph-based Local Coherence Modeling
20 0.62727982 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning