emnlp emnlp2011 emnlp2011-46 knowledge-graph by maker-knowledge-mining

46 emnlp-2011-Efficient Subsampling for Training Complex Language Models


Source: pdf

Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur

Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The advantage of the proposed method comes from a more robust and efficient subsampling technique. [sent-5, score-0.471]

2 The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. [sent-6, score-0.27]

3 We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. [sent-7, score-0.206]

4 For complex models such as NNLM, it has been shown that subsampling can speed up training greatly, at the cost of some degradation in predictive performance (Schwenk, 2007), allowing for trade-off between computational cost and LM quality. [sent-27, score-0.509]

5 Our contribution is a novel way to train complex LMs such as MELM and NNLM which allows much more aggressive subsampling without incurring as high a cost in predictive performance. [sent-28, score-0.571]

6 The key to our approach is reducing the multi-class LM problem into a set of binary problems. [sent-29, score-0.109]

7 tc ho2d0s11 in A Nsasotuciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinagguesis 1ti1c2s8–1136, the vocabulary, we train V binary classifiers, each one of which performs a one-against-all classification. [sent-32, score-0.141]

8 The V trained binary probabilities are then renormalized to obtain a valid distribution over the V words. [sent-33, score-0.146]

9 Since the majority of training examples are negative for each of the binary classifiers, we can achieve substantial computational saving by only keeping subsets of them. [sent-35, score-0.22]

10 For certain types ofLMs such as MELM, there are more benefits–the binarization leads to a set of completely independent classifiers to train, which allows easy parallelization and significantly lowers the memory requirement. [sent-37, score-0.145]

11 The goal of this paper is to show that a similar technique can also be used for language modeling and that it enables us to subsample data much more efficiently. [sent-40, score-0.096]

12 In section 2, we describe our binarization and subsampling techniques for language models with MELM and NNLM as two specific examples. [sent-43, score-0.502]

13 2 Approximating Language Models with Binary Classifiers Suppose we have an LM that can be written in the form P(w|h) =Pwe0xpexapwa(hw0;(θh);θ), (2) where aw (h; θ) is a paPrametrized history representation for word w. [sent-45, score-0.081]

14 Given a training corpus of word history pairs with empirical distribution P˜(h, w), the regularized log likelihood of the training set can be written as L= XP˜(h)XP˜(w|h)logP(w|h) Xh −r(θ), (3) Xw 1129 where r(θ) is the regularizing function over the parameters. [sent-46, score-0.118]

15 (4) Xw For each word w, we can define a binary classifier that predicts whether the next word is w by Pb(w|h) =1 +ex epxapwa(wh(;hθ;)θ). [sent-50, score-0.14]

16 (5) The regularized training set log likelihood for all the binary classifiers is given by Lb=XwXhP˜(h)? [sent-51, score-0.26]

17 The regularized MLE for the binary classifiers satisfies − XP˜(h)XPb(w|h)∇θaw(h;θ) Xh Xw =XP˜(w,h)∇θaw(h;θ) −X∇θrw(θ). [sent-55, score-0.24]

18 Thus, taking P0(w|h) = Pb(w|h) from ML trained binary classifiers(w gives an PLM(w t|hha)t mfreomets MthLe MLE constraints for language models. [sent-57, score-0.109]

19 Therefore, if Pw Pb(w|h) = 1, ML training for the language moPdel is equivalent ,t oM LM tLr training ro tfh eth lea binary claPssifiers and using the probabilities given by the classifiers as our LM probabilities. [sent-58, score-0.238]

20 Note that in practice, the probabilities given by the binary classifiers are not guaranteed to sum up to one. [sent-59, score-0.198]

21 Our hope is that for large enough data sets and rich ePnough history representation aw (h; θ), we will get Pw Pb(w| h) ≈ 1 so that renormalizing the classifPiers to get P0(w|h) =Pw0Pb(wP|bh(w)0|h) (8) will not change the MLEP constraint too much. [sent-61, score-0.081]

22 The complexity of estimating each of the V binary classifiers is O(T) per iteration, also giving O(V T) per iteration in total. [sent-65, score-0.257]

23 However, as mentioned earlier, we are able to maximally subsample negative examples for each classifier. [sent-66, score-0.107]

24 Thus the classifier for w is trained using the C(w) positive examples and a proportion α of the T − C(w) negative examples. [sent-67, score-0.089]

25 Thus, our complexity for estimating all V classifiers is O(αV T). [sent-72, score-0.129]

26 The resulting training set for each binary classifier is a stratified sample (Neyman, 1934), and our estimate needs to be calibrated to account for this. [sent-73, score-0.187]

27 Since the training set subsamples negative examples by α, the resulting classifier will have a likelihood ratio 1 −Pb P(wb(|wh)|h)= expaw(h;θ) (9) α1. [sent-74, score-0.109]

28 As described earlier, the complexity for each iteration of training is at O(V T), where T is the size of training corpus. [sent-91, score-0.099]

29 For arbitrary feature sets, however, it may not be possible to establish the required hierarchical relations and the normalizer still needs to be computed explicitly. [sent-93, score-0.086]

30 We propose a way to do this without incurring a significant loss of modeling power, by reframing the problem in terms of binary classification. [sent-98, score-0.155]

31 As mentioned above, we build V binary classifiers of the form in (5) to model the distribution over the V words. [sent-99, score-0.198]

32 The binary classifiers use the same features as the MELM of (10), and are given by: Pb(w|h) =1 +ex epxPpPiθi fθi (fhi(,hw,)w). [sent-100, score-0.198]

33 This gives an important advan- = tage in terms of parallelization–we have a set of binary classifiers with no feature sharing, and can be trained separately on different machines. [sent-104, score-0.198]

34 Through a nonlinear hidden layer, the neural network constructs a multinomial distribution at the output layer. [sent-115, score-0.109]

35 ,wi−n+1) =Pmeakeam, (nX−P1)d Xh ak (13) = bk+XWkltanh(cl+ Xl=1 X Uljrj), (14) Xj=1 where h denotes the hidden layer size, b and c are the bias vectors for the output nodes and hidden nodes respectively. [sent-119, score-0.153]

36 Stochastic gradient descent is often used to maximize the training data likelihood under such a model. [sent-121, score-0.086]

37 The four terms in the complexity correspond to computing the hidden layer, applying the nonlinearity, computing the output layer and normaliza- tion, respectively. [sent-124, score-0.16]

38 For typical values as used in our experiments, namely n = 3, d = 50, h = 200, V = 10000, the majority of the complexity per iteration comes from the term hV . [sent-126, score-0.077]

39 A similar technique has been introduced even earlier which took the idea of factorizing output layer to the extreme (Morin, 2005) by replacing the V -way prediction by a tree-style hierarchical prediction. [sent-134, score-0.113]

40 The idea is to select random subsets of the training data in each epoch of stochastic gradient descent. [sent-137, score-0.165]

41 We will show that our binary classifier representation leads to a more robust and promising subsampling strategy. [sent-139, score-0.611]

42 As with MELM, we notice that the parameters of (14) can be interpreted as also defining a set of V per-word binary classifiers Pb(wi= k|wi−1,. [sent-140, score-0.198]

43 ,wi−n+1) =1 +eak eak, (16) but with a common hidden layer representation. [sent-143, score-0.12]

44 Since the hidden layer is shared, the classifiers are not independent, and the computations can not be easily parallelized to multiple machines. [sent-146, score-0.268]

45 However, subsampling can be done differently for each classifier. [sent-147, score-0.471]

46 Each training instance serves as a positive example for one classifier and as a negative exam1132 ple for only a fraction α of the others. [sent-148, score-0.09]

47 We calibrate the classifiers after subsampled training as described above for MELM. [sent-150, score-0.17]

48 We want to point out that compared with MELM, subsampling the negatives here does not always reduce the complexity proportionally. [sent-152, score-0.538]

49 In cases where the vocabulary is very small, as shown in (15), computing the hidden layer can no longer be ignored. [sent-153, score-0.152]

50 Nonetheless, real world applications such as speech recognition, usually involves a vocabulary ofconsiderable size, therefore, subsampling in the binary setting can still achieve substantial speedup for NNLM. [sent-154, score-0.674]

51 This is one of the standard setups on which many researchers have reported perplexity results on (Mikolov et al. [sent-160, score-0.093]

52 The binary MELM is trained using stochastic gradient descent, no explicit regularization is performed (Zhang, 2004). [sent-162, score-0.176]

53 1 and is halved every time the perplexity on the validation set stops decreasing. [sent-164, score-0.117]

54 We compare perplexity with both the standard interpolated Kneser-Ney trigram model and the standard MELM. [sent-167, score-0.194]

55 The MELM is L2 regularized and estimated using a variant of generalized iterative scaling, the regularizer is tuned on the validation data. [sent-168, score-0.092]

56 To demonstrate the effectiveness of our subsampling approach, we compare the subsampled versions of the binary MELM and the standard MELM. [sent-169, score-0.664]

57 In order to obtain valid perplexities, the binary LMs are first renormalized explicitly according to equation (8) for each test history. [sent-170, score-0.146]

58 Table 1shows the perplexity results when no subsampling is performed. [sent-179, score-0.541]

59 With only n-gram features, the binary MELM is able to match both standard MELM and the Kneser-Ney model. [sent-180, score-0.132]

60 We can also see that by adding features that are known to be able to improve the standard MELM, we can get the same improvement in the binary setting. [sent-181, score-0.132]

61 In contrast, the binary n-gram MELM(Feat-I) does not appear to be hurt by aggressive subsampling, even when 99% of the negative examples are discarded. [sent-184, score-0.192]

62 This suggests a very efficient way of training MELM– with only 1% of the computational cost, we are able to train an LM as powerful as the standard MELM. [sent-186, score-0.094]

63 This set of experiments is intended to demonstrate that the binary subsampling technique is useful on a large text corpus where training a standard MELM is not practical, and gives a better LM than the commonly used Kneser-Ney baseline. [sent-190, score-0.649]

64 Subsampled Standard MELM The binary MELM is trained in the same way as described in the previous experiment. [sent-192, score-0.109]

65 We were unable to train a standard MELM with feat-III or a binary MELM without subsampling because of the computational cost. [sent-196, score-0.635]

66 However, with our binary subsampling technique, as shown in Table 2, we are able to benefit from skip n-gram features with only 5% of the standard MELM complexity. [sent-197, score-0.603]

67 To show that such improvement in perplexity translates into gains in practical applications, we conducted a set of speech recognition experiments. [sent-199, score-0.091]

68 72EarRe interpolated with KN 4-gram is 20K, for the purpose of rescoring, we are only interested in the words that exist in the n-best list, therefore, for the binary MELM, we only have to train about 5300 binary classifiers. [sent-207, score-0.295]

69 The features for the binary MELM are n-gram features up to 4-grams plus skip-1 bigrams and skip1trigrams. [sent-209, score-0.109]

70 Table 3 demonstrates the word error rate(WER) improvement enabled by our binary subsampling technique. [sent-212, score-0.58]

71 More specifically, with only 50 machines, such a reduction in complexity allows us to train a binary MELM with skip n-gram features in less than two hours, which is not possible for the standard MELM on 37M words. [sent-215, score-0.204]

72 2 NNLM We evaluate our binary subsampling technique on the same Penn Treebank corpus as described for the MELM experiments. [sent-219, score-0.606]

73 Taking random subsets of the training data with the standard model is our primary baseline to compare with. [sent-220, score-0.096]

74 The NNLM we train is a trigram LM with tanh hidden units. [sent-221, score-0.098]

75 The size of word representation and the size of hidden layer are tuned minimally on the validation set(Hidden layer size 200; Representation size 50). [sent-222, score-0.236]

76 We adopt the same learning rate strategy as for training MELM, and the validation set is used to track perplexity performance and adjust learning rate correspondingly. [sent-223, score-0.119]

77 As with binary MELM, binary NNLM are explicitly renormalized to obtain valid perplexities. [sent-237, score-0.255]

78 For the standard NNLM, it means only a subset of the data is seen by the model and it does not change through epochs; For binary NNLM, it means the subset of negative examples for each binary classifier does not change. [sent-239, score-0.33]

79 Table 4 shows the perplexity results by NNLM itself and the interpolated results are shown in Table 5. [sent-240, score-0.115]

80 With binary NNLM, we are able to retain all the gain after interpolation with only 20% of the negative examples. [sent-243, score-0.148]

81 Notice that with a fixed random subset, we are not replicating the experiments of Schwenk (Schwenk, 2007) exactly, although it is reasonable to expect both models are able to benefit from seeing different random subsets of the training data. [sent-244, score-0.093]

82 The standard NNLM benefits quite a lot going from using a fixed random subset to a variable random subset, but still demonstrates a clear tendency to deteriorate as we discard more and more data. [sent-246, score-0.088]

83 On the constrast, the binary NNLM maintains all the performance gain with only 5% of the negative examples and still clearly outperforms its counterpart. [sent-247, score-0.167]

84 4 Discussion For the standard models, the amount of existent patterns fed into training heavily depends on the subsampling rate α. [sent-263, score-0.539]

85 Taking variable random subsets in each epoch can alleviate this problem to some extent, but still can not solve the fundamental problem. [sent-265, score-0.078]

86 In the binary setting, we are able to do subsampling differently. [sent-266, score-0.58]

87 While the complexity remains the same without subsampling, the majority of the complexity comes from processing negatives examples for each binary classifier. [sent-267, score-0.235]

88 Therefore, we can achieve the same level of speedup as standard subsampling by only subsampling negative examples, and most importantly, it allows us to keep all the existent patterns(positive examples) in the training data. [sent-268, score-1.09]

89 Of course, negative examples are important and even in the binary case, we benefit from including more of them, but since we have so many of them, they might not be as critical as positive examples in determining the distribution. [sent-269, score-0.186]

90 The explanation here might lead us to wonder whether for the multi-class problem, subsampling the terms in the normalizer would achieve the same results. [sent-274, score-0.557]

91 However, in the multi-class setup, subsampling like this has to be very careful. [sent-277, score-0.471]

92 It is very unlikely that an arbitrary random subsampling will not harm the model. [sent-279, score-0.491]

93 Fortunately, in the binary case, the effect of random subsampling is much easier to analyze. [sent-280, score-0.6]

94 More generally, for many large-scale multiclass problems, binarization and subsampling can be an effective combination to consider. [sent-286, score-0.521]

95 5 Conclusion We propose efficient subsampling techniques for training large multi-class classifiers such as maximum entropy language models and neural network language models. [sent-287, score-0.683]

96 The main idea is to replace a multi-way decision by a set of binary decisions. [sent-288, score-0.109]

97 Since most of the training instances in the binary setting are negatives examples, we can achieve substantial speedup by subsampling only the negatives. [sent-289, score-0.668]

98 We show by extensive experiments that this is more robust than subsampling subsets of training data for the original multi-class classifier. [sent-290, score-0.524]

99 Adaptive importance sampling to accelerate training of a neural probabilistic language model IEEE Transaction on Neural Network, Apr. [sent-301, score-0.108]

100 Solving large scale linear prediction problems using stochastic gradient descent algorithms. [sent-352, score-0.088]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('melm', 0.642), ('subsampling', 0.471), ('nnlm', 0.414), ('lm', 0.12), ('binary', 0.109), ('classifiers', 0.089), ('layer', 0.087), ('pb', 0.087), ('normalizer', 0.086), ('schwenk', 0.082), ('perplexity', 0.07), ('subsampled', 0.061), ('bengio', 0.058), ('khudanpur', 0.056), ('lms', 0.052), ('mikolov', 0.049), ('subsample', 0.049), ('gradient', 0.045), ('aw', 0.045), ('interpolated', 0.045), ('xw', 0.044), ('xp', 0.043), ('goodman', 0.043), ('senecal', 0.043), ('regularized', 0.042), ('speedup', 0.041), ('complexity', 0.04), ('negative', 0.039), ('neural', 0.039), ('kn', 0.039), ('xh', 0.039), ('sanjeev', 0.038), ('network', 0.037), ('parallelized', 0.037), ('renormalized', 0.037), ('history', 0.036), ('hidden', 0.033), ('trigram', 0.033), ('subsets', 0.033), ('vocabulary', 0.032), ('train', 0.032), ('classifier', 0.031), ('binarization', 0.031), ('epochs', 0.031), ('validation', 0.029), ('pw', 0.029), ('allwein', 0.029), ('eak', 0.029), ('logpb', 0.029), ('morin', 0.029), ('neyman', 0.029), ('puyang', 0.029), ('rifkin', 0.029), ('negatives', 0.027), ('stratified', 0.027), ('entropy', 0.027), ('wu', 0.027), ('technique', 0.026), ('trick', 0.026), ('mle', 0.026), ('wi', 0.025), ('rosenfeld', 0.025), ('aggressive', 0.025), ('deteriorate', 0.025), ('accelerate', 0.025), ('parallelization', 0.025), ('epoch', 0.025), ('existent', 0.025), ('incurring', 0.025), ('klautau', 0.025), ('rw', 0.024), ('sampling', 0.024), ('standard', 0.023), ('expensive', 0.022), ('computations', 0.022), ('stochastic', 0.022), ('modeling', 0.021), ('fi', 0.021), ('speech', 0.021), ('adaptive', 0.021), ('descent', 0.021), ('plm', 0.021), ('regularizer', 0.021), ('training', 0.02), ('xu', 0.02), ('random', 0.02), ('binarized', 0.019), ('trigrams', 0.019), ('multiclass', 0.019), ('iteration', 0.019), ('examples', 0.019), ('powerful', 0.019), ('stops', 0.018), ('acoustics', 0.018), ('yoshua', 0.018), ('derivative', 0.018), ('continuous', 0.018), ('namely', 0.018), ('ieee', 0.018), ('predictive', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur

Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.

2 0.1070381 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church

Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.

3 0.068139769 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

4 0.054241035 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

5 0.053714547 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Author: Denis Filimonov ; Mary Harper

Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.

6 0.03822571 129 emnlp-2011-Structured Sparsity in Structured Prediction

7 0.031441741 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

8 0.030819081 125 emnlp-2011-Statistical Machine Translation with Local Language Models

9 0.028779916 100 emnlp-2011-Optimal Search for Minimum Error Rate Training

10 0.025508085 96 emnlp-2011-Multilayer Sequence Labeling

11 0.024542745 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

12 0.024224313 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

13 0.024070218 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

14 0.023928264 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

15 0.023523347 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases

16 0.023158632 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

17 0.022775294 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

18 0.021460431 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

19 0.021167329 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

20 0.020936452 17 emnlp-2011-Active Learning with Amazon Mechanical Turk


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.093), (1, -0.003), (2, 0.013), (3, -0.039), (4, 0.029), (5, -0.005), (6, -0.02), (7, -0.09), (8, -0.025), (9, -0.041), (10, 0.052), (11, 0.007), (12, -0.029), (13, 0.034), (14, -0.082), (15, 0.081), (16, 0.064), (17, -0.001), (18, -0.002), (19, 0.044), (20, 0.133), (21, -0.067), (22, 0.099), (23, -0.148), (24, -0.05), (25, 0.025), (26, -0.072), (27, -0.082), (28, 0.117), (29, 0.115), (30, -0.004), (31, -0.127), (32, -0.117), (33, -0.091), (34, -0.103), (35, -0.096), (36, -0.039), (37, 0.042), (38, 0.206), (39, -0.113), (40, 0.093), (41, -0.14), (42, 0.026), (43, -0.042), (44, 0.027), (45, 0.1), (46, -0.145), (47, 0.155), (48, -0.062), (49, -0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93341148 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur

Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.

2 0.59942228 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church

Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.

3 0.50993156 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Author: Denis Filimonov ; Mary Harper

Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.

4 0.45334423 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

5 0.37453902 129 emnlp-2011-Structured Sparsity in Structured Prediction

Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar

Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.

6 0.29806593 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article

7 0.27776325 32 emnlp-2011-Computing Logical Form on Regulatory Texts

8 0.25846165 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

9 0.23929688 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

10 0.19126977 27 emnlp-2011-Classifying Sentences as Speech Acts in Message Board Posts

11 0.18896732 96 emnlp-2011-Multilayer Sequence Labeling

12 0.18427807 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

13 0.18036428 26 emnlp-2011-Class Label Enhancement via Related Instances

14 0.17927793 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

15 0.17291865 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

16 0.17156117 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

17 0.16403785 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

18 0.15935661 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues

19 0.15271454 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering

20 0.15173422 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.258), (23, 0.117), (36, 0.026), (37, 0.089), (45, 0.069), (53, 0.024), (54, 0.024), (57, 0.019), (62, 0.023), (64, 0.022), (66, 0.025), (69, 0.027), (79, 0.036), (82, 0.048), (85, 0.018), (90, 0.012), (96, 0.03), (98, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85705745 26 emnlp-2011-Class Label Enhancement via Related Instances

Author: Zornitsa Kozareva ; Konstantin Voevodski ; Shanghua Teng

Abstract: Class-instance label propagation algorithms have been successfully used to fuse information from multiple sources in order to enrich a set of unlabeled instances with class labels. Yet, nobody has explored the relationships between the instances themselves to enhance an initial set of class-instance pairs. We propose two graph-theoretic methods (centrality and regularization), which start with a small set of labeled class-instance pairs and use the instance-instance network to extend the class labels to all instances in the network. We carry out a comparative study with state-of-the-art knowledge harvesting algorithm and show that our approach can learn additional class labels while maintaining high accuracy. We conduct a comparative study between class-instance and instance-instance graphs used to propagate the class labels and show that the latter one achieves higher accuracy.

same-paper 2 0.73684549 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur

Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.

3 0.54618895 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

Author: Yufan Guo ; Anna Korhonen ; Thierry Poibeau

Abstract: Documents Anna Korhonen Thierry Poibeau Computer Laboratory LaTTiCe, UMR8094 University of Cambridge, UK CNRS & ENS, France alk2 3 @ cam . ac .uk thierry .po ibeau @ ens . fr tific literature according to categories of information structure (or discourse, rhetorical, argumentative or Argumentative Zoning (AZ) analysis of the argumentative structure of a scientific paper has proved useful for a number of information access tasks. Current approaches to AZ rely on supervised machine learning (ML). – – Requiring large amounts of annotated data, these approaches are expensive to develop and port to different domains and tasks. A potential solution to this problem is to use weaklysupervised ML instead. We investigate the performance of four weakly-supervised classifiers on scientific abstract data annotated for multiple AZ classes. Our best classifier based on the combination of active learning and selftraining outperforms our best supervised classifier, yielding a high accuracy of 81% when using just 10% of the labeled data. This result suggests that weakly-supervised learning could be employed to improve the practical applicability and portability of AZ across different information access tasks.

4 0.54352951 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

Author: Marco Dinarelli ; Sophie Rosset

Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.

5 0.53176802 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

6 0.52728379 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

7 0.52598232 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

8 0.5251233 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

9 0.52491939 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction

10 0.52466124 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation

11 0.52428567 136 emnlp-2011-Training a Parser for Machine Translation Reordering

12 0.52136385 107 emnlp-2011-Probabilistic models of similarity in syntactic context

13 0.51994634 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

14 0.51368344 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

15 0.51353395 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

16 0.51308298 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning

17 0.51282108 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus

18 0.51141179 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering

19 0.51048732 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing

20 0.50896311 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives