acl acl2011 acl2011-38 knowledge-graph by maker-knowledge-mining

38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models


Source: pdf

Author: Greg Durrett ; Dan Klein

Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We investigate the empirical behavior of ngram discounts within and across domains. [sent-3, score-0.688]

2 When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. [sent-4, score-0.767]

3 However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. [sent-5, score-0.67]

4 We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines. [sent-6, score-0.318]

5 1 Introduction Discounting, or subtracting from the count of each n-gram, is one of the core aspects of Kneser-Ney language modeling (Kneser and Ney, 1995). [sent-7, score-0.141]

6 For all but the smallest n-gram counts, Kneser-Ney uses a single discount, one that does not grow with the ngram count, because such constant-discounting was seen in early experiments on held-out data (Church and Gale, 1991). [sent-8, score-0.114]

7 However, due to increasing computational power and corpus sizes, language modeling today presents a different set of challenges than it did 20 years ago. [sent-9, score-0.022]

8 In particular, modeling crossdomain effects has become increasingly more important (Klakow, 2000; Moore and Lewis, 2010), and deployed systems must frequently process data that is out-of-domain from the standpoint of the language model. [sent-10, score-0.022]

9 In this work, we perform experiments on heldout data to evaluate how discounting behaves in the 24 cross-domain setting. [sent-11, score-0.214]

10 We find that, when training and testing on corpora that are as similar as possible, empirical discounts indeed do not grow with ngram count, which validates the parametric assumption of Kneser-Ney smoothing. [sent-12, score-0.932]

11 However, when the train and evaluation corpora differ, even slightly, discounts generally exhibit linear growth in the count of the n-gram, with the amount of growth being closely correlated with the corpus divergence. [sent-13, score-1.059]

12 Finally, we build a language model exploiting a parametric form ofthe growing discount and show perplexity gains of up to 5. [sent-14, score-0.75]

13 2 Discount Analysis Underlying discounting is the idea that n-grams will occur fewer times in test data than they do in training data. [sent-16, score-0.304]

14 Suppose that we have collected counts on two corpora of the same size, which we will call our train and test corpora. [sent-18, score-0.221]

15 , wn), let ktrain(w) denote the number of occurrences of w in the training corpus, and ktest(w) denote the number of occurrences of w in the test corpus. [sent-22, score-0.122]

16 We define the empirical discount of w to be d(w) = ktrain(w) − ktest(w) ; this will be negative when the n-gram occurs more in the test data than in the training data. [sent-23, score-0.576]

17 Let Wi = {w : ktrain(w) = i} be the set of n-grams with =co {uwnt i: i kn the( training corpus. [sent-24, score-0.025]

18 We define the average empirical discount function as d¯(i) =|W1i|wX∈Wid(w) Proceedings ofP tohretl 4an9tdh, O Anrneguoanl, M Jueentein 19g- o2f4 t,h 2e0 A1s1s. [sent-25, score-0.552]

19 cc ia2t0io1n1 f Aors Cocoimatpiounta ftoiorn Caolm Lipnugtuaitsiotincasl:s Lhionrgtpuaisptiecrs , pages 24–29, Kneser-Ney implicitly makes two assumptions: first, that discounts do not depend on n-gram count, i. [sent-27, score-0.571]

20 Modified Kneser-Ney relaxes this assumption slightly by having independent parameters for 1-count, 2-count, and manycount n-grams, but still assumes that d¯(i) is constant for igreater than two. [sent-30, score-0.19]

21 Second, by using the same discount for all n-grams with a given count, KneserNey assumes that the distribution of d(w) for w in a particular Wi is well-approximated by its mean. [sent-31, score-0.465]

22 In this section, we analyze whether or not the behavior of the average empirical discount function supports these two assumptions. [sent-32, score-0.552]

23 We perform experiments on various subsets of the documents in the English Gigaword corpus, chiefly drawn from New York Times (NYT) and Agence France Presse (AFP). [sent-33, score-0.053]

24 Similar corpora To begin, we consider the NYT documents from Gigaword for the year 1995. [sent-36, score-0.121]

25 In order to create two corpora that are maximally domain-similar, we randomly assign half of these documents to train and half of them to test, yielding train and test corpora of approximately 50M words each, which we denote by NYT95 and NYT950. [sent-37, score-0.383]

26 Figure 1 shows the average empirical discounts d¯(i) for trigrams on this pair of corpora. [sent-38, score-0.713]

27 In this setting, we recover the results of Church and Gale (1991) in that discounts are approximately constant for ngram counts of two or greater. [sent-39, score-0.807]

28 Divergent corpora In addition to these two corpora, which were produced from a single contiguous batch of documents, we consider testing on corpus pairs with varying degrees of domain difference. [sent-40, score-0.118]

29 We construct additional corpora NYT96, NYT06, AFP95, AFP96, and AFP06, by taking 50M words from documents in the indicated years of NYT and AFP data. [sent-41, score-0.121]

30 We then collect training counts on NYT95 and alternately take each ofour five new corpora as the test data. [sent-42, score-0.232]

31 Figure 1also shows the average empirical discount curves for these train/test pairs. [sent-43, score-0.586]

32 Even within NYT newswire data, we see growing discounts when the train and test corpora are drawn 1Gigaword is drawn from six newswire sources and contains both miscellaneous text and complete, contiguous documents, sorted chronologically. [sent-44, score-1.118]

33 Our experiments deal exclusively with the document text, which constitutes the majority of Gigaword and is of higher quality than the miscellaneous text. [sent-45, score-0.035]

34 25 Trigram count in train d¯(i) Figure 1: Average empirical trigram discounts for six configurations, training on NYT95 and testing on the indicated corpora. [sent-46, score-0.892]

35 For each n-gram count k, we compute the average number of occurrences in test for all n-grams occurring k times in training data, then report k minus this quantity as the discount. [sent-47, score-0.357]

36 from different years, and between the NYT and AFP newswire, discounts grow even more quickly. [sent-49, score-0.623]

37 We observed these trends continuing steadily up into ngram counts in the hundreds, beyond which point it becomes difficult to robustly estimate discounts due to fewer n-gram types in this count range. [sent-50, score-0.831]

38 This result is surprising in light of the constant discounts observed for the NYT95/NYT950 pair. [sent-51, score-0.658]

39 Both of these factors are at play in the NYT95/NYT950 experiment, and yet only a small, constant discount is observed. [sent-53, score-0.552]

40 Our growing discounts must therefore be caused by other, larger-scale phenomena, such as shifts in the subjects of news articles over time or in the style of the writing between newswire sources. [sent-54, score-0.807]

41 The increasing rate of discount growth as the source changes and temporal divergence increases lends credence to this hypothesis. [sent-55, score-0.608]

42 Discounting by a single value is plausible in the case of similar train and test corpora, where the mean of the distribution (8. [sent-58, score-0.07]

43 0), but not in the case of divergent corpora, where the mean (6. [sent-60, score-0.068]

44 In Figure 2, we investigate the second assumption, namely that the distribution over discounts for a given n-gram count is well-approximated by its mean. [sent-64, score-0.712]

45 For similar corpora, this seems to be true, with a histogram of test counts for trigrams of count 10 that is nearly symmetric. [sent-65, score-0.284]

46 For divergent corpora, the data exhibit high skew: almost 40% of the trigrams simply never appear in the test data, and the distribution has very high standard deviation (17. [sent-66, score-0.187]

47 Using a discount that depends only on the n-gram count is less appropriate in this case. [sent-68, score-0.606]

48 In combination with the growing discounts of section 2. [sent-69, score-0.713]

49 1, these results point to the fact that modified Kneser-Ney does not faithfully model the discounting in even a mildly cross-domain setting. [sent-70, score-0.316]

50 3 Correlation of Divergence and Discounts Intuitively, corpora that are more temporally distant within a particular newswire source should perhaps be slightly more distinct, and still a higher degree of divergence should exist between corpora from different newswire sources. [sent-72, score-0.362]

51 We now ask whether growth in discounts is correlated with train/test dissimilarity in a more quantitative way. [sent-74, score-0.716]

52 More negative values of the log likelihood indicate more dissimilar corpora, as the trained model is doing less well relative to the jackknife model. [sent-76, score-0.187]

53 count for n-grams occurring 30 times in training. [sent-77, score-0.212]

54 This dissimilarity metric resembles the cross-entropy difference used by Moore and Lewis (2010) to subsample for domain adaptation. [sent-79, score-0.041]

55 We compute this canonicalization for each of twenty pairs of corpora, with each corpus containing 240M trigram tokens between train and test. [sent-80, score-0.124]

56 The corpus pairs were chosen to span varying numbers of newswire sources and lengths of time in order to capture a wide range of corpus divergences. [sent-81, score-0.065]

57 The log likelihood difference and d¯(30) are negatively correlated with a correlation coefficient value of r = −0. [sent-83, score-0.108]

58 However, there was sufficient nonlinearity in the average empirical discount curves that neither of these parameters was an accurate proxy for d¯(i). [sent-86, score-0.662]

59 1point to a remarkably pervasive phenomenon of growing empirical discounts, except in the case of extremely similar corpora. [sent-90, score-0.229]

60 Growing discounts of this sort were previously suggested by the model of Teh (2006). [sent-91, score-0.594]

61 However, we claim that the discounting phenomenon in our data is fundamentally different from his model’s prediction. [sent-92, score-0.246]

62 1, growing discounts only emerge when one evaluates against a dissimilar held-out corpus, whereas his model would predict discount growth even in NYT95/NYT950, where we do not observe it. [sent-94, score-1.31]

63 Bellegarda (2004) describes a range of techniques, from interpolation at either the count level or the model level (Bacchiani and Roark, 2003; Bacchiani et al. [sent-96, score-0.21]

64 Their work also improves on the second assumption of Kneser-Ney, that of the inadequacy of the average empirical discount as a discount constant, by employing various other features in order to provide other criteria on which to discount n-grams. [sent-99, score-1.509]

65 Taking a different approach, both Klakow (2000) and Moore and Lewis (2010) use subsampling to select the domain-relevant portion of a large, general corpus given a small in-domain corpus. [sent-100, score-0.051]

66 This can be interpreted as a form of hard discounting, and implicitly models both growing discounts, since frequent n-grams will appear in more of the rejected sentences, and nonuniform discounting over n-grams of each count, since the sentences are cho- sen according to a likelihood criterion. [sent-101, score-0.405]

67 Although we do not consider this second point in constructing our language model, an advantage of our approach over subsampling is that we use our entire training corpus, and in so doing compromise between minimizing errors from data sparsity and accommodating domain shifts to the extent possible. [sent-102, score-0.105]

68 27 3 A Growing Discount Language Model We now implement and evaluate a language model that incorporates growing discounts. [sent-103, score-0.165]

69 1 Methods Instead of using a fixed discount for most n-gram counts, as prescribed by modified Kneser-Ney, we discount by an increasing parametric function of the n-gram count. [sent-105, score-1.077]

70 To improve the fit of the model, we use dedicated parameters for count-1 and count-2 ngrams as in modified Kneser-Ney, yielding a model with five parameters per n-gram order. [sent-107, score-0.264]

71 We also instantiate this model with c fixed to one, so that the model is strictly linear (GDLM-LIN). [sent-109, score-0.046]

72 As baselines for comparison, we use basic interpolated Kneser-Ney (KNLM), with one discount parameter per n-gram order, and modified interpolated Kneser-Ney (MKNLM), with three parameters per n-gram order, as described in (Chen and Goodman, 1998). [sent-110, score-0.644]

73 According to Chen and Goodman (1998), it is common to use different interpolation weights depending on the history count of an n-gram, since MLEs based on many samples are presumed to be more accurate than those with few samples. [sent-112, score-0.187]

74 We used five history count buckets so that JMLM would have the same number of parameters as GDLM. [sent-113, score-0.22]

75 All five models are trigram models with type counts at the lower orders and independent discount or interpolation parameters for each order. [sent-114, score-0.708]

76 Parameters for GDLM, MKNLM, and KNLM are initialized based on estimates from : the regression thereof for GDLM, and raw discounts for MKNLM and KNLM. [sent-115, score-0.594]

77 The parameters ofJMLM are initialized to constants independent of the data. [sent-116, score-0.077]

78 These initializations are all heuristic and not guaranteed to be optimal, so we then iterate through the parameters of each model several times and perform line search d¯(i) TaGbDleL1M:-KJPIeNMrVLp(o*lMc)e. [sent-117, score-0.111]

79 xitT1 r5a6s712i9KnofNtYheT105g342Kr1+8o0wingT23d5r7a0i4s98KncoAuFPt0s25+la140Kn95g+u6ae model (GDLM) and its purely linear variant (GDLMLIN), which are contributions of this work, versus the modified Kneser-Ney (MKNLM), basic Kneser-Ney (KNLM), and Jelinek-Mercer (JMLM) baselines. [sent-118, score-0.102]

80 We report results for in-domain (NYT00+01) and out-ofdomain (AFP02+05+06) training corpora, for two methods of closing the vocabulary. [sent-119, score-0.058]

81 For evaluation, we train, tune, and test on three disjoint corpora. [sent-121, score-0.031]

82 We consider two different training sets: one of 110M words of NYT from 2000 and 2001 (NYT00+01), and one of 110M words of AFP from 2002, 2005, and 2006 (AFP02+05+06). [sent-122, score-0.025]

83 In both cases, we compute d¯(i) and tune parameters on 110M words of NYT from 2002 and 2003, and do our final perplexity evaluation on 4M words of NYT from 2004. [sent-123, score-0.211]

84 Our tune set is chosen to be large so that we can initialize parameters based on the average empirical discount curve; in practice, one could compute empirical discounts based on a smaller tune set with the counts scaled up proportionately, or simply initialize to constant values. [sent-125, score-1.566]

85 We use two different methods to handle out-ofvocabulary (OOV) words: one scheme replaces any unigram token occurring fewer than five times in training with an UNK token, yielding a vocabulary of approximately 157K words, and the other scheme only keeps the top 50K words in the vocabulary. [sent-126, score-0.206]

86 9% in the NYT/NYT and NYT/AFP settings, respectively, and the constant-size vocabulary has OOV rates of 2% and 3. [sent-129, score-0.05]

87 As expected, for in-domain data, GDLM performs comparably to MKNLM, since the discounts do not grow and so there is little to be gained by choosing a param28 eterization that permits this. [sent-133, score-0.623]

88 Out-of-domain, our model outperforms MKNLM and JMLM by approximately 5% for both vocabulary sizes. [sent-134, score-0.079]

89 The outof-domain perplexity values are competitive with those ofRosenfeld (1996), who trained on New York Times data and tested on AP News data under similar conditions, and even more aggressive closing of the vocabulary. [sent-135, score-0.107]

90 Moore and Lewis (2010) achieve lower perplexities, but they use in-domain training data that we do not include in our setting. [sent-136, score-0.025]

91 In the small vocabulary cross-domain setting, for GDLM-LIN, we find dtri(i) = 1. [sent-138, score-0.026]

92 05i as the trigram and bigram discount functions that minimize tune set perplexity. [sent-142, score-0.617]

93 86 In both cases, a growing discount is indeed learned from the tuning procedure, demonstrating the importance of this in our model. [sent-149, score-0.607]

94 Modeling nonlinear discount growth in GDLM yields only a small marginal improvement over the linear discounting model GDLM-LIN, so we prefer GDLM-LIN for its simplicity. [sent-150, score-0.779]

95 A somewhat surprising result is the strong performance of JMLM relative to MKNLM on the divergent corpus pair. [sent-151, score-0.068]

96 We conjecture that this is because the bucketed parameterization of JMLM gives it the freedom to change interpolation weights with n-gram count, whereas MKNLM has essentially a fixed discount. [sent-152, score-0.046]

97 This suggests that modified KneserNey as it is usually parameterized may be a particularly poor choice in cross-domain settings. [sent-153, score-0.079]

98 Overall, these results show that the growing discount phenomenon detailed in section 2, beyond simply being present in out-of-domain held-out data, provides the basis for a new discounting scheme that allows us to improve perplexity relative to modified Kneser-Ney and Jelinek-Mercer baselines. [sent-154, score-1.006]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('discounts', 0.571), ('discount', 0.465), ('discounting', 0.214), ('mknlm', 0.204), ('gdlm', 0.153), ('jmlm', 0.153), ('growing', 0.142), ('nyt', 0.141), ('count', 0.141), ('bacchiani', 0.102), ('moore', 0.095), ('corpora', 0.094), ('constant', 0.087), ('modified', 0.079), ('afp', 0.078), ('growth', 0.077), ('knlm', 0.076), ('ktrain', 0.076), ('perplexity', 0.074), ('divergent', 0.068), ('newswire', 0.065), ('ngram', 0.062), ('lewis', 0.062), ('trigram', 0.061), ('tune', 0.059), ('counts', 0.057), ('trigrams', 0.055), ('empirical', 0.055), ('parameters', 0.054), ('church', 0.053), ('grow', 0.052), ('dbi', 0.051), ('dtri', 0.051), ('jackknife', 0.051), ('ktest', 0.051), ('mles', 0.051), ('subsampling', 0.051), ('likelihood', 0.049), ('parametric', 0.046), ('interpolation', 0.046), ('oov', 0.045), ('gale', 0.045), ('kneserney', 0.045), ('michiel', 0.045), ('divergence', 0.044), ('curve', 0.043), ('acoustics', 0.042), ('dissimilarity', 0.041), ('hsu', 0.039), ('train', 0.039), ('occurring', 0.037), ('klakow', 0.037), ('miscellaneous', 0.035), ('times', 0.034), ('curves', 0.034), ('goodman', 0.034), ('exhibit', 0.033), ('occurrences', 0.033), ('closing', 0.033), ('average', 0.032), ('bigram', 0.032), ('phenomenon', 0.032), ('dissimilar', 0.032), ('log', 0.032), ('test', 0.031), ('kneser', 0.031), ('gigaword', 0.031), ('approximately', 0.03), ('yielding', 0.029), ('signal', 0.029), ('speech', 0.029), ('shifts', 0.029), ('assumption', 0.027), ('documents', 0.027), ('roark', 0.027), ('correlated', 0.027), ('median', 0.026), ('vocabulary', 0.026), ('drawn', 0.026), ('brian', 0.026), ('training', 0.025), ('five', 0.025), ('adaptation', 0.024), ('compute', 0.024), ('initialize', 0.024), ('contiguous', 0.024), ('rates', 0.024), ('model', 0.023), ('initialized', 0.023), ('interpolated', 0.023), ('standpoint', 0.022), ('agence', 0.022), ('bellegarda', 0.022), ('jerome', 0.022), ('nonlinearity', 0.022), ('proportionately', 0.022), ('relaxes', 0.022), ('wid', 0.022), ('wi', 0.022), ('increasing', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

Author: Greg Durrett ; Dan Klein

Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.

2 0.22246392 175 acl-2011-Integrating history-length interpolation and classes in language modeling

Author: Hinrich Schutze

Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.

3 0.076437131 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

Author: Thomas Mueller ; Hinrich Schuetze

Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.

4 0.076047957 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

Author: Joel Lang

Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.

5 0.07067959 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

Author: Graham Neubig ; Taro Watanabe ; Eiichiro Sumita ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars (ITGs). The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memorizes phrases generated not only by terminal, but also non-terminal symbols. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs. Experiments on several language pairs demonstrate that the proposed model matches the accuracy of traditional two-step word alignment/phrase extraction approach while reducing the phrase table to a fraction of the original size.

6 0.070194364 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

7 0.063984618 142 acl-2011-Generalized Interpolation in Decision Tree LM

8 0.058142029 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

9 0.055763517 109 acl-2011-Effective Measures of Domain Similarity for Parsing

10 0.053484704 275 acl-2011-Semi-Supervised Modeling for Prenominal Modifier Ordering

11 0.050359339 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

12 0.047807857 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

13 0.047721498 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

14 0.045884736 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

15 0.04414314 333 acl-2011-Web-Scale Features for Full-Scale Parsing

16 0.043827288 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

17 0.043652847 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

18 0.043327678 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

19 0.043125387 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

20 0.042068146 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, -0.005), (2, -0.012), (3, 0.017), (4, -0.012), (5, -0.021), (6, 0.02), (7, 0.013), (8, -0.002), (9, 0.057), (10, -0.009), (11, 0.008), (12, 0.027), (13, 0.09), (14, 0.019), (15, 0.037), (16, -0.143), (17, 0.081), (18, 0.013), (19, -0.068), (20, 0.095), (21, -0.082), (22, 0.036), (23, -0.078), (24, 0.005), (25, -0.052), (26, 0.081), (27, 0.013), (28, -0.021), (29, -0.022), (30, -0.021), (31, -0.148), (32, -0.037), (33, -0.072), (34, 0.123), (35, -0.026), (36, -0.048), (37, -0.017), (38, 0.086), (39, -0.013), (40, -0.019), (41, 0.061), (42, 0.087), (43, 0.02), (44, -0.042), (45, -0.014), (46, 0.042), (47, 0.131), (48, -0.031), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92592084 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

Author: Greg Durrett ; Dan Klein

Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.

2 0.84432143 175 acl-2011-Integrating history-length interpolation and classes in language modeling

Author: Hinrich Schutze

Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.

3 0.70681614 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

Author: Joel Lang

Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.

4 0.7048589 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

5 0.61588526 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

Author: Thomas Mueller ; Hinrich Schuetze

Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.

6 0.49459773 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

7 0.49326029 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

8 0.47327954 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

9 0.45681524 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

10 0.45525384 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

11 0.45317155 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

12 0.41683307 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

13 0.41206318 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

14 0.40607473 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

15 0.40470222 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

16 0.40447125 116 acl-2011-Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers

17 0.38615415 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

18 0.37606245 74 acl-2011-Combining Indicators of Allophony

19 0.37173861 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

20 0.36323118 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.015), (5, 0.043), (17, 0.067), (26, 0.038), (37, 0.082), (39, 0.058), (41, 0.048), (55, 0.054), (59, 0.028), (63, 0.23), (72, 0.066), (91, 0.041), (96, 0.138)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80277538 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

Author: Greg Durrett ; Dan Klein

Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.

2 0.77256024 133 acl-2011-Extracting Social Power Relationships from Natural Language

Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso

Abstract: Sociolinguists have long argued that social context influences language use in all manner of ways, resulting in lects 1. This paper explores a text classification problem we will call lect modeling, an example of what has been termed computational sociolinguistics. In particular, we use machine learning techniques to identify social power relationships between members of a social network, based purely on the content of their interpersonal communication. We rely on statistical methods, as opposed to language-specific engineering, to extract features which represent vocabulary and grammar usage indicative of social power lect. We then apply support vector machines to model the social power lects representing superior-subordinate communication in the Enron email corpus. Our results validate the treatment of lect modeling as a text classification problem – albeit a hard one – and constitute a case for future research in computational sociolinguistics. 1

3 0.66122806 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

Author: Alla Rozovskaya ; Dan Roth

Abstract: We consider the problem of correcting errors made by English as a Second Language (ESL) writers and address two issues that are essential to making progress in ESL error correction - algorithm selection and model adaptation to the first language of the ESL learner. A variety of learning algorithms have been applied to correct ESL mistakes, but often comparisons were made between incomparable data sets. We conduct an extensive, fair comparison of four popular learning methods for the task, reversing conclusions from earlier evaluations. Our results hold for different training sets, genres, and feature sets. A second key issue in ESL error correction is the adaptation of a model to the first language ofthe writer. Errors made by non-native speakers exhibit certain regularities and, as we show, models perform much better when they use knowledge about error patterns of the nonnative writers. We propose a novel way to adapt a learned algorithm to the first language of the writer that is both cheaper to implement and performs better than other adaptation methods.

4 0.65644467 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

Author: Ines Rehbein ; Josef Ruppenhofer

Abstract: Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. While various simulation studies for a number of NLP tasks have shown that AL works well on goldstandard data, there is some doubt whether the approach can be successful when applied to noisy, real-world data sets. This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. We present a method to filter out inconsistent annotations during AL and show that this makes AL far more robust when ap- plied to noisy data.

5 0.65279436 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

Author: Ryo Nagata ; Edward Whittaker ; Vera Sheinman

Abstract: The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowparsed. This corpus is available for research and educational purposes on the web. In this paper, we describe it in detail together with its data-collection method and annotation schemes. Another contribution of this paper is that we take the first step toward evaluating the performance of existing POStagging/chunking techniques on learner corpora using the created corpus. These contributions will facilitate further research in related areas such as grammatical error detection and automated essay scoring.

6 0.65102959 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

7 0.64806724 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

8 0.64785957 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

9 0.64422083 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

10 0.64418793 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

11 0.64412493 175 acl-2011-Integrating history-length interpolation and classes in language modeling

12 0.64394802 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

13 0.64319682 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

14 0.6427123 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

15 0.64166105 44 acl-2011-An exponential translation model for target language morphology

16 0.64108813 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

17 0.64094049 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

18 0.64051402 141 acl-2011-Gappy Phrasal Alignment By Agreement

19 0.64042211 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

20 0.64036131 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing