emnlp emnlp2010 emnlp2010-108 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon
Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.
Reference: text
sentIndex sentText sentNum sentScore
1 Training continuous space language models: some practical issues Le Hai Son and Alexandre Allauzen and Guillaume Wisniewski and Fran ¸cois Yvon Univ. [sent-1, score-0.333]
2 fr Abstract Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. [sent-4, score-0.208]
3 However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. [sent-5, score-0.231]
4 In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. [sent-6, score-0.252]
5 The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. [sent-7, score-0.228]
6 A new initialization scheme and new training techniques are then introduced. [sent-8, score-0.223]
7 n-gram language models rely on a discrete space representation of the vocabulary, where each word is associated with a discrete index. [sent-17, score-0.241]
8 These representations and the associated probability estimates are jointly computed in a multi-layer neural network architecture. [sent-28, score-0.271]
9 Hence, continuous space language models are becoming increasingly used. [sent-32, score-0.337]
10 In this paper, we empirically study the convergence behavior of two multi-layer neural networks for statistical language modeling, comparing the standard model of (Bengio et al. [sent-37, score-0.419]
11 We first investigate a re-initialization method which allows to escape from the local extremum the standard model converges to. [sent-41, score-0.213]
12 We therefore introduce a different initialization strategy, called one vector initialization. [sent-43, score-0.285]
13 Experimental results show that these novel training strategies drastically reduce the total training time, while delivering significant improvements both in terms of perplexity and in a large-scale translation task. [sent-44, score-0.408]
14 2 Continuous space language models Learning a language model amounts to estimate the parameters of the discrete conditional distribution over words given each possible history, where the history corresponds to some function of the preceding words. [sent-52, score-0.418]
15 parameters correspond to P(wl Ceo mntoindue-l ous space language models aim| watl −con+m1puting these estimates based on a distributed representation of words (Bengio et al. [sent-54, score-0.22]
16 In this approach, each word in the vocabulary is mapped into a real-valued vector and the conditional probability distributions are then expressed as a (parameterized) smooth function of these feature vectors. [sent-56, score-0.207]
17 The formalism of neural networks allows to express these two steps in a well-known framework, where, crucially, the mapping and the model parameters can be learned in conjunction. [sent-57, score-0.243]
18 In the next paragraphs, we describe the two continuous space language models considered in our study and present the various issues associated with the training of such models, as well as their most common remedies. [sent-58, score-0.371]
19 , 2003), the feed-forward network takes as input the n−1 word history and delnievtewrso an aeskteims aaste in pofu t h thee probability P(wl as its output. [sent-63, score-0.269]
20 The first layer builds a continuous representation of the history by mapping each word into its realvalued representation. [sent-65, score-0.48]
21 This mapping is defined by RTv, where R ∈ RV ×m is a projection matrix and m iws htheree edi Rmen ∈sio Rn of the continuous projection word space. [sent-66, score-0.558]
22 The output of this layer is a vector iof (n − 1)m real numbers obtained by concatenating (thne representations obfe rthse o bcotanitneexdt wbyo crdosn. [sent-67, score-0.278]
23 aTtehen projection matrix R is shared along all positions in the history vector and is learned automatically. [sent-68, score-0.451]
24 The second layer introduces a non-linear transform, where the output layer activation values are defined by h = tanh (Wihi + bih) , where iis the input vector, Wih ∈ RH×(n−1)m and bih ∈ RH are the parameters of th∈is R layer. [sent-69, score-0.469]
25 The wth component in P corresponds to the estimated probability of the wth word of the vocabulary given the input history vector. [sent-72, score-0.382]
26 In this model, the projection matrices R and Who play similar roles as they define maps between the vocabulary and the hidden representation. [sent-74, score-0.283]
27 The fact that R assigns similar representations to history words w1 and w2 implies that these words can be exchanged with little impact on the resulting probability distribution. [sent-75, score-0.234]
28 In the remainder, we will therefore refer to R 780 as the matrix representing the context space, and to Who as the matrix for the prediction space. [sent-79, score-0.402]
29 According to (Mnih and Hinton, 2007), this model, termed the log-bilinear language model (LBL), achieves, for large vocabulary tasks, better results in terms of perplexity than the standard model, even if the reasons beyond this improvement remain unclear. [sent-83, score-0.567]
30 − vlTbv (5) where R is the projection matrix introduced above, (vk)l−n+1≤k≤l−1 are the 1-of-V coding vectors for the hli−stno+r1y≤ wko≤rld−s1 and vl is the coding vector for wl; Ck ∈ Rm×m is a combination matrix and br and bv deno∈te R R bias vectors. [sent-87, score-0.67]
31 Wonicthat ethnaestein new notations, equations (4) and (3) can be rewritten as: h = Wihi + bih o = Rh + bho P(wl= k|wl −−1n+1) =Pke0xepx(po(ko)k0) This formulation allows to highlight the similarity of the LBL model and the standard model. [sent-92, score-0.318]
32 Let x denote the binary vector formed by stacking the (n-1) 1-of-V encodings of the history words; then the conditional probability distributions estimated in the model are proportional to exp F(x), where F is an affine transform of x. [sent-97, score-0.247]
33 Learning starts with a random initialization of the 781 parameters under the uniform distribution and converges to a local maximum of the log-likelihood function. [sent-105, score-0.316]
34 4 Complexity issues The main problem with neural language models is their computational complexity. [sent-108, score-0.214]
35 The projection of the context words amounts to select one row of the projection matrix R, as the words are represented with a 1-of-V coding vector. [sent-110, score-0.51]
36 In this case, two vocabularies need to be considered, corresponding respectively to the context vocabulary Vc used to dinegfin rees tpheec history; a thnde the prediction vocabulary Vp. [sent-115, score-0.496]
37 In practice, neural network 1Recall that learning requires to repeatedly predict the label for all the examples in the training set. [sent-117, score-0.228]
38 3 A head-to-head comparison In this section, we analyze a first experimental study of the two neural network language models introduced in Section 2 in order to better understand the differences between these models especially in terms of the word representations they induce. [sent-126, score-0.382]
39 The perplexity is computed with respect to the 2006 NIST test data, which is used here as our development data. [sent-132, score-0.287]
40 The same vocabulary is used to constrain the words occurring in the history and the words to be predicted. [sent-140, score-0.298]
41 The size of hidden layer is set to m = H = 200, the history contains the 3 preceding words, we use a batch size of 64, a resampling rate of 5% and no weight decay. [sent-141, score-0.391]
42 Figure 1 displays the perplexity convergence curve measured on the development data for the standard and the LBL models4. [sent-142, score-0.499]
43 The convergence perplexities after the combination with the standard back-off model are also provided for all the models in table 2 (see section 4. [sent-143, score-0.282]
44 We can observe that the LBL model converges faster than the stan- dard model: the latter needs 13 epochs to reach the stopping criteria, while the former only needs 6 epochs. [sent-145, score-0.234]
45 However, upon convergence, the standard model reaches a lower perplexity than the LBL model. [sent-146, score-0.38]
46 Perplexity 180 0 2 4 6 epochs 8 10 12 14 Figure 1: Convergence rate of the standard and the LBL models evaluated by the evolution of the perplexity on a development set As described in Section 2. [sent-147, score-0.561]
47 This difference in convergence can be explained by the scarcity of the updates in the projection matrix R in the standard model: during backpropagation, only those weights that are associated with words in the history are updated. [sent-151, score-0.63]
48 By contrast, each training sample updates all the weights in the prediction matrix Who. [sent-152, score-0.257]
49 For instance, the date 1947 seems to be randomly associated in the context space, while the 5 nearest words in the prediction space form a consistent set of dates. [sent-157, score-0.321]
50 By contrast, the similarities in the (unique) projection space of the LBL remain consistent for all frequency ranges, and are very similar to the prediction space of the standard model. [sent-160, score-0.559]
51 This seems to validate our hypothesis that in the standard model, the prediction space is learned much faster than the context space and corroborates our interpretation of the impact of the scarce updates of rare words. [sent-161, score-0.604]
52 However, this would also increase the size of the vocabulary and cause two new issues: on one hand, the time complexity would drastically increase for the LBL model, and on the other hand, both models would not be comparable in terms of perplexity as their vocabulary would be different. [sent-163, score-0.683]
53 783 between the context space and the target function: the context space is learned only indirectly by backpropagation. [sent-164, score-0.382]
54 As a result, due to the random initialization of the parameters and to data sparsity, many vectors of R might be blocked in some local maxima, meaning that similar vectors cannot be grouped in a consistent way and that the induced similarity is more “loose”. [sent-165, score-0.352]
55 Both effects can be attributed to the particular parameterization of this model, which uses the same projection matrix both for the context and for the prediction spaces. [sent-169, score-0.496]
56 In this section, we propose several new learning regimes that allowed us to improve the standard model in terms of both speed and prediction capacity. [sent-170, score-0.277]
57 Thus, we introduce a new learning regime, called re-initialization which aims to improve the context space by re-injecting the information on word neighborhoods that emerges in the prediction space. [sent-176, score-0.386]
58 use the prediction space of this model to initialize the context space of a new model; the prediction space is chosen randomly; 3. [sent-179, score-0.713]
59 Figure 2: Evolution of the perplexity on a development set for various initialization regimes. [sent-182, score-0.51]
60 The evolution of the perplexity with respect to training epochs for this new method is plotted on Figure 2, where we only represent the evolution of the perplexity during the third training step. [sent-183, score-0.819]
61 As can be seen, at convergence, the perplexity the model estimated with this technique is about 10% smaller than the perplexity of the standard model. [sent-184, score-0.667]
62 784 This result can be explained by considering the reinitialization as a form of annealing technique: reinitializing the context space allows to escape from the local extrema the standard model converges to. [sent-185, score-0.404]
63 The fact that the prediction space provides a good initialization of the context space also confirms our analysis that one difficulty with the standard model is the estimation of the context space parameters. [sent-186, score-0.943]
64 As we now know that the parameters of the prediction space are faster to converge, we introduce a second training regime called iterative re-initialization which aims to take advantage of this property. [sent-189, score-0.446]
65 Use the prediction space parameters to reini- tialize the context space. [sent-193, score-0.357]
66 Figure 3: Evolution of the perplexity on the training data for various initialization regimes. [sent-196, score-0.51]
67 This relationship is however not expressed through the tying of the corresponding parameters; instead we let the prediction space guide the convergence of the context space. [sent-198, score-0.472]
68 As a consequence, we hope that it can achieve a convergence speed as fast as the one of the LBL model without degrading its prediction capacity. [sent-199, score-0.313]
69 Figure 3 displays the perplexity convergence curve measured on the training data for the standard learning regime as well as for the re-initialization and iterative re-initialization. [sent-201, score-0.624]
70 These results show the same trend as for the perplexity measured on the development data, and suggest a regularization effect of the re-initialization schemes rather than allowing the models to escape local optima. [sent-202, score-0.388]
71 3 One vector initialization Principle The new training regimes introduced above outperform the standard training regime both in terms of perplexity and of training time. [sent-204, score-0.847]
72 However, exchanging information between the context and 785 prediction spaces is only possible when the same vocabulary is used in both spaces. [sent-205, score-0.412]
73 It is nonetheless possible to continue drawing inspirations from the observations made in Section 3, and, crucially, to question the random initialization strategy. [sent-210, score-0.223]
74 As discussed above, this strategy may explain why the neighborhoods in the induced context space for the less frequent types were difficult to interpret. [sent-211, score-0.324]
75 As a straightforward alternative, we consider a different initialization strategy where all the words in the context vocabulary are initially projected onto the same (random) point in the context space. [sent-212, score-0.553]
76 This model is termed the one vector initialization model. [sent-214, score-0.359]
77 Experimental evaluation To validate this approach, we compare the convergence of a standard model trained (with the standard learning regime) with the one vector initialization regime. [sent-215, score-0.59]
78 The context vocabulary is defined by the 532, 557 words occurring in the training data and the prediction vocabulary by the 10, 000 most frequent words6. [sent-216, score-0.496]
79 Based on the curves displayed on Figure 4, we can observe that the model obtained with the one vector initialization regime outperforms the model trained with a completely random initialization. [sent-218, score-0.474]
80 Moreover, the latter reaches convergence in only 14 epochs, while the learning regime we propose only needs 9 epochs. [sent-219, score-0.276]
81 Convergence is even faster than when we used the standard training regime and a small context vocabulary. [sent-220, score-0.302]
82 6In this case, the distinction between the context and the prediction vocabulary rules out the possibility of a relevant compar- ison based on perplexity between the continuous space language model and a standard back-off language model. [sent-221, score-1.03]
83 Perpelxtiy 180 100 0 5 epochs 10 15 Figure 4: Perplexity with all-10, 000, 200 − 200 models Table 2: Summary of the perplexity (PPX) results measured on the same development set with the different continuous space language models. [sent-222, score-0.729]
84 #ep1 96341ochsP2 6271P30761X93 To illustrate the impact of our initialization scheme, we also used a principal component analysis to represent the induced word representations in a two dimensional space. [sent-225, score-0.339]
85 Two different models are used: the standard model on the left, and the one vector initialization model on the right. [sent-227, score-0.448]
86 On the opposite, for the one vector initialization model, points associated with numbers are much more concentrated: this is simply because all the points are originally identical, and the training aim to spread the point around this starting point. [sent-229, score-0.285]
87 786 (a) with the standard model (b) with the one vector initialization model Figure 5: Comparison of the word embedding in the context space for numbers (red points). [sent-232, score-0.601]
88 It is finally noteworthy to mention that when used with a small context vocabulary (as in the experimental setting of Section 4. [sent-234, score-0.221]
89 This is simply due to the much greater data sparsity in the large context vocabulary experiments, where the rarer word types are really rare (they typically occur once or twice). [sent-236, score-0.275]
90 By contrast, the rarer words in the small vocabulary tasks occurred more than several hundreds times in the training corpus, which was more than sufficient to guide the model towards satisfactory projection matrices. [sent-237, score-0.369]
91 This finally suggests that there still exists room for improvement if we can find more efficient initialization strategies than starting from one or several random points. [sent-238, score-0.26]
92 The integration of a neural network language model in such a system is far from easy, given the computational cost of computing word probabilities, a task that is performed repeatedly during the search of the best translation. [sent-242, score-0.26]
93 For the continuous space language model, the training data consists in the parallel corpus used to train the translation model (previously described in section 3. [sent-245, score-0.379]
94 The best result is obtained by the one vector initialization standard model which achieves a 0. [sent-257, score-0.378]
95 5 Conclusion In this work, we proposed three new methods for training neural network language models and showed their efficiency both in terms of computational complexity and generalization performance in a real-word machine translation task. [sent-260, score-0.346]
96 These methods rely on conclusions drawn from a careful study of the convergence rate of two state-of-the-art models and are based on the idea of sharing the distributed word representations during training. [sent-261, score-0.263]
97 Our work highlights the impact of the initialization and the training scheme for neural network language models. [sent-262, score-0.489]
98 A unified architecture for natural language processing: deep neural networks with multitask learning. [sent-302, score-0.205]
99 Empirical study of neural network language models for Arabic speech recognition. [sent-308, score-0.299]
100 Connec- tionist language modeling for large vocabulary continuous speech recognition. [sent-377, score-0.362]
wordName wordTfidf (topN-words)
[('lbl', 0.46), ('perplexity', 0.287), ('initialization', 0.223), ('continuous', 0.184), ('wl', 0.179), ('mnih', 0.167), ('schwenk', 0.161), ('history', 0.153), ('convergence', 0.151), ('vocabulary', 0.145), ('layer', 0.143), ('neural', 0.142), ('projection', 0.138), ('prediction', 0.13), ('regime', 0.125), ('space', 0.115), ('hinton', 0.111), ('bih', 0.105), ('epochs', 0.105), ('matrix', 0.098), ('bengio', 0.097), ('network', 0.086), ('bho', 0.084), ('rh', 0.084), ('context', 0.076), ('rv', 0.072), ('evolution', 0.07), ('neighborhoods', 0.065), ('emami', 0.063), ('epoch', 0.063), ('escape', 0.063), ('wll', 0.063), ('vector', 0.062), ('standard', 0.061), ('spaces', 0.061), ('coding', 0.06), ('converges', 0.057), ('nist', 0.055), ('bilmes', 0.054), ('collobert', 0.054), ('rarer', 0.054), ('parameterization', 0.054), ('wih', 0.054), ('br', 0.054), ('regimes', 0.054), ('batch', 0.053), ('nowadays', 0.048), ('allauzen', 0.048), ('holger', 0.048), ('translation', 0.048), ('discrete', 0.044), ('representations', 0.043), ('activation', 0.042), ('andriy', 0.042), ('embeddings', 0.042), ('kuo', 0.042), ('lau', 0.042), ('lidia', 0.042), ('lxl', 0.042), ('neuronal', 0.042), ('oparin', 0.042), ('termed', 0.042), ('wihi', 0.042), ('wth', 0.042), ('resampling', 0.042), ('geoffrey', 0.042), ('faster', 0.04), ('impact', 0.038), ('models', 0.038), ('arabic', 0.037), ('strategies', 0.037), ('parameters', 0.036), ('decay', 0.036), ('weston', 0.036), ('mangu', 0.036), ('bv', 0.036), ('caveats', 0.036), ('highlight', 0.036), ('drastically', 0.036), ('bleu', 0.035), ('induced', 0.035), ('introduced', 0.035), ('issues', 0.034), ('red', 0.034), ('networks', 0.033), ('conventional', 0.033), ('speech', 0.033), ('strategy', 0.033), ('hardly', 0.032), ('model', 0.032), ('complexity', 0.032), ('distributed', 0.031), ('architecture', 0.03), ('concatenating', 0.03), ('morphological', 0.03), ('cois', 0.03), ('thee', 0.03), ('notations', 0.03), ('vectors', 0.029), ('updates', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues
Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon
Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.
2 0.08054582 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
Author: Wei Lu ; Hwee Tou Ng
Abstract: This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random fields. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence type prediction, and punctuation prediction on speech utterances. We performed evaluations on a transcribed conversational speech domain consisting of both English and Chinese texts. Empirical results show that our method outperforms an approach based on linear-chain conditional random fields and other previous approaches.
3 0.075859733 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
Author: Samidh Chatterjee ; Nicola Cancedda
Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.
4 0.072044045 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
Author: Marco Baroni ; Roberto Zamparelli
Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.
5 0.071155898 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
Author: John Platt ; Kristina Toutanova ; Wen-tau Yih
Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corre- sponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.
6 0.070336834 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
7 0.067235634 39 emnlp-2010-EMNLP 044
8 0.066486321 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
9 0.06538938 77 emnlp-2010-Measuring Distributional Similarity in Context
10 0.062491845 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation
11 0.061136089 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
12 0.060894523 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
13 0.053801127 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
14 0.053120013 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
15 0.050755754 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment
16 0.050647266 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text
17 0.050175656 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
18 0.050025716 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models
19 0.049621928 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar
20 0.049461443 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora
topicId topicWeight
[(0, 0.192), (1, -0.002), (2, -0.03), (3, -0.016), (4, -0.061), (5, 0.005), (6, -0.047), (7, 0.011), (8, -0.022), (9, 0.002), (10, 0.005), (11, 0.007), (12, -0.069), (13, 0.019), (14, 0.007), (15, -0.08), (16, -0.064), (17, 0.019), (18, -0.04), (19, -0.057), (20, 0.032), (21, 0.082), (22, -0.07), (23, 0.064), (24, -0.063), (25, -0.232), (26, 0.067), (27, -0.232), (28, -0.086), (29, -0.123), (30, 0.118), (31, 0.098), (32, -0.049), (33, 0.211), (34, 0.035), (35, 0.02), (36, -0.023), (37, 0.09), (38, 0.12), (39, -0.143), (40, 0.029), (41, -0.121), (42, -0.149), (43, 0.14), (44, 0.03), (45, -0.095), (46, 0.103), (47, 0.084), (48, -0.204), (49, -0.21)]
simIndex simValue paperId paperTitle
same-paper 1 0.94389844 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues
Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon
Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.
2 0.52017713 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
Author: Marco Baroni ; Roberto Zamparelli
Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.
3 0.41710567 77 emnlp-2010-Measuring Distributional Similarity in Context
Author: Georgiana Dinu ; Mirella Lapata
Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.
4 0.41275507 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.
5 0.40394291 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
Author: Wei Lu ; Hwee Tou Ng
Abstract: This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random fields. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence type prediction, and punctuation prediction on speech utterances. We performed evaluations on a transcribed conversational speech domain consisting of both English and Chinese texts. Empirical results show that our method outperforms an approach based on linear-chain conditional random fields and other previous approaches.
6 0.37883464 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
7 0.37869853 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
8 0.32587463 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification
9 0.32476243 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text
10 0.30873448 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
11 0.29251355 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
12 0.27987581 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
13 0.2742396 39 emnlp-2010-EMNLP 044
14 0.25891316 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
15 0.2547462 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
16 0.23604985 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs
17 0.23312563 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing
18 0.23078273 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models
19 0.21691947 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
20 0.20623177 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing
topicId topicWeight
[(10, 0.012), (12, 0.021), (29, 0.118), (30, 0.029), (32, 0.463), (52, 0.037), (56, 0.048), (62, 0.011), (66, 0.09), (72, 0.034), (76, 0.018), (87, 0.011), (89, 0.016), (92, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.75251633 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues
Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon
Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.
2 0.74371219 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock
Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.
Author: Marco Baroni ; Roberto Zamparelli
Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.
4 0.40020803 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
Author: John Platt ; Kristina Toutanova ; Wen-tau Yih
Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corre- sponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.
5 0.3875199 84 emnlp-2010-NLP on Spoken Documents Without ASR
Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church
Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.
6 0.38095629 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
7 0.37431735 4 emnlp-2010-A Game-Theoretic Approach to Generating Spatial Descriptions
9 0.36730152 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
10 0.36728349 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
11 0.36635977 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
12 0.36550462 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech
13 0.36541083 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue
14 0.36003914 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
15 0.35739979 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
16 0.35615286 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
17 0.35593173 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts
18 0.35525501 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment
19 0.35488769 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation
20 0.35434657 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation