acl acl2011 acl2011-24 knowledge-graph by maker-knowledge-mining

24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling


Source: pdf

Author: Joel Lang

Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. [sent-5, score-0.077]

2 Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. [sent-7, score-0.554]

3 The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. [sent-8, score-0.445]

4 We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated. [sent-9, score-0.452]

5 1 Introduction A Language Model (LM) is an important component within many natural language applications in- cluding speech recognition and machine translation. [sent-10, score-0.038]

6 The task of a generative LM is to assign a probability p(w) to a sequence of words w = w1 . [sent-11, score-0.03]

7 wi−1) (1) Yi=1 Thus, the central problem that arises from this formulation consists of estimating the probability p(wi |wi−N+1 . [sent-18, score-0.035]

8 This can be viewed as a classi|wfici−atNio+n1 problem in which the target word Wi corresponds to the class that must be predicted, based on features extracted from the conditioning context, e. [sent-22, score-0.21]

9 625 This paper describes a novel approach for modeling such conditional probabilities. [sent-25, score-0.049]

10 We propose a classifier which is based on the assumption that each feature has a predictive strength, quantifying how well the feature can predict the class (target word) by itself. [sent-26, score-0.605]

11 Then the predictions made by individual features can be combined into a mixture model, in which the prediction of each feature is weighted according to its predictive strength. [sent-27, score-0.589]

12 certain context words) are much more predictive than others but the predictive strength for a particular feature often doesn’t vary much across classes and can thus be assumed constant. [sent-30, score-0.638]

13 The main advantage of our model is that it is straightforward to incorporate rich features without sacrificing scalability or reliability of parameter estimation. [sent-31, score-0.22]

14 In addition, it is simple to implement and no feature selection is required. [sent-32, score-0.114]

15 Section 3 shows that a generative1 LM built with our classifier is competitive to modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated. [sent-33, score-0.353]

16 The classification-based approach to language modeling was introduced by Rosenfeld (1996) who proposed an optimized variant of the maximumentropy classifier (Berger et al. [sent-34, score-0.197]

17 Unfortunately, data sparsity resulting from the large number of classes makes it difficult to obtain reliable parameter estimates, even on large datasets and the high computational costs make it difficult train models on large datasets in the first place2. [sent-36, score-0.249]

18 conditioning on the contextual features, the resulting LM is generative. [sent-39, score-0.088]

19 2For example, using a vocabulary of 20000 words Rosenfeld (1994) trained his model on up to 40M words, however employing heavy feature pruning and indicating that “the com- putational load, was quite severe for a system this size”. [sent-42, score-0.185]

20 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 625–630, ability is however very important, since moving to larger datasets is often the simplest way to obtain a better model. [sent-45, score-0.092]

21 Even the more scalable variant proposed by Mnih and Hinton (2008) is trained on a dataset consisting of only 14M words, also using a vocabulary of around 20000 words. [sent-48, score-0.103]

22 Van den Bosch (2005) proposes a decision-tree classifier which has been applied to training datasets with more than 100M words. [sent-49, score-0.218]

23 However, his model is non-probabilistic and thus a standard comparison with probabilistic models in terms of perplexity isn’t possible. [sent-50, score-0.039]

24 N-Gram models (Goodman, 2001) obtain estimates for p(wi |wi−N+1 . [sent-51, score-0.034]

25 Bec|awuis−e directly using the maximumlikelihood estimate would result in poor predictions, smoothing techniques are applied. [sent-55, score-0.088]

26 A modified interpolated form of Kneser-Ney smoothing (Kneser and Ney, 1995) was shown to consistently outperform a variety of other smoothing techniques (Chen and Goodman, 1999) and currently constitutes a stateof-the-art3 generative LM. [sent-56, score-0.271]

27 2 Model We are concerned with estimating a probability distribution p(Y |x) over a categorical class variable tYri wutiitohn range xY), vcoernd aiti coantaelg oornic aal f celaatsusre v avrieacbtoler xY = w (itxh1 , . [sent-57, score-0.256]

28 While generalizations are conceivable, we will restrict the features Xk to be binary, i. [sent-61, score-0.065]

29 The binary input features x are extracted from the conditioning context wi−N+1 . [sent-66, score-0.153]

30 The specific features we use for language modeling are given in Section 3. [sent-70, score-0.114]

31 We assume sparse features, such that typically only a small number ofthe binary features take value 1. [sent-71, score-0.065]

32 These features are referred to as the active features and predictions are based on them. [sent-72, score-0.306]

33 We introduce a bias feature which is active for every instance, in order to ensure that the set of active features is non-empty for each instance. [sent-73, score-0.436]

34 Individually, each active feature Xk is predictive of the class variable and predicts the class through a categorical dis3The model of Wood et al. [sent-74, score-0.666]

35 (2009) has somewhat higher performance, however, again due to high computational costs the model has only been trained words. [sent-75, score-0.031]

36 Since instadniscetrsi btyuptiiocna,ll wy hhiacvhe w wseev deeranlo atect aivse p f(eYa t|xures the question is how to combine the individual predictions of these features into an overall prediction. [sent-77, score-0.181]

37 To this end we make the assumption that each feature Xk has a certain predictive strength θk ∈ R, where larger values indicate that the feature is ∈mo Rre, likely to predict correctly. [sent-78, score-0.572]

38 s N thoete i nthdeatx -ssientce o fth aec tsivete o ffe aactutirvees features varies across instances, so do the mixing proportions vk (x) and thus this is not a conventional mixture model, but rather a variable one. [sent-80, score-0.353]

39 We will therefore refer to our model as the variable mixture model (VMM). [sent-81, score-0.11]

40 In particular, our model differs from linear or log-linear interpolation models (Klakow, 1998), which combine a typically small number of components that are common across instances. [sent-82, score-0.03]

41 1 classifier, the VMM directly uses a predictor to binary indicator variables and the parameters Q(x) which can be efficiently β>φ(y, x). [sent-87, score-0.049]

42 the categorical parameters αj,k = p(yj |xk) which determine the likelihood of class yj in presence of feature Xk; 2. [sent-90, score-0.484]

43 the parameters θk quantifying the predictive strength of each feature Xk. [sent-91, score-0.558]

44 The two types of parameters are estimated from a training dataset, consisting of instances , ). [sent-92, score-0.049]

45 The smoothed count is computed as c0j,k=(cDj,·kZNk−Zk D i f c cj , k => 0 0 where cj,k is the raw count for class yj and feature Xk, NZk is the number of classes for which the raw count is non-zero, and Zk is the number of classes for which the raw count is zero. [sent-95, score-0.728]

46 The smoothing thus subtracts D from each non-zero count and redistributes the so-obtained mass evenly amongst all zero counts. [sent-97, score-0.145]

47 If all counts are non-zero no mass is redistributed. [sent-98, score-0.03]

48 627 Once the categorical parameters have been computed, we proceed by estimating the predictive strengths θ = (θ1 , . [sent-99, score-0.386]

49 SGA is an online optimization method which iteratively computes the gradient ∇ for each instance and takes a sptuetpe sof th hseiz eg η diine tnhte ∇ ∇dir feocrt eioanc hof i nthstaat ngcread ainednt: t θ(t+1) ← (∂∂llθ(h1), θ(t) + η∇ (8) ∂∂llθ(Mh)) The gradient ∇ = . [sent-104, score-0.228]

50 The magnitude of the Table 2: Feature types and examples for a model of order N=4 and for the context Ye sterday at the pre s s con fe rence Mr Thomps on s aid. [sent-109, score-0.032]

51 For each feature type we write in parentheses the feature sets which include that type of feature. [sent-110, score-0.228]

52 The wildcard symbol * is used as a placeholder for arbitrary regular words. [sent-111, score-0.034]

53 The bias feature, which is active for each instance is written as * * * . [sent-112, score-0.167]

54 In standard N-Gram models the bias feature corresponds to the unigram distribution. [sent-113, score-0.151]

55 In other words, we subtract the counts for a particular instance before computing the update (Equation 8) and add them back when the update has been executed. [sent-115, score-0.162]

56 In total, training only requires two passes over the data, as opposed to a single pass (plus smoothing) required by N-Gram models. [sent-116, score-0.074]

57 All our models were built with the same 30367word vocabulary, which includes the sentence-end symbol and a special symbol for out-of-vocabulary words (UNK). [sent-128, score-0.068]

58 The vocabulary was compiled by selecting all words which occur more than four times in the data of week 31, which was not otherwise used for training or testing. [sent-129, score-0.124]

59 As development set we used the articles of week 50 (4. [sent-130, score-0.09]

60 1M words) and as test set the articles of week 51 (3. [sent-131, score-0.09]

61 For training we used datasets of four different sizes: D1 (week 1, 3. [sent-133, score-0.092]

62 Features We use three different feature sets in our experiments. [sent-135, score-0.114]

63 The first feature set (basic, BA) consists of all features also used in standard N-Gram models, i. [sent-136, score-0.179]

64 Thh Ne s−ec 1ond feature set (short-range, SR) consists of all basic features as well as all skip N-Grams (Ney et al. [sent-141, score-0.179]

65 4 )M thoatre coavne br,e a flol mwoedrds w oitchc tuherri Nng − −in1 tlehen cthon ctoexntare included as bag features, i. [sent-143, score-0.033]

66 as features which indicate the occurrence of a word but not the particular position. [sent-145, score-0.065]

67 The third feature set (long-range, LR) is an extension of SR which also includes longerdistance features. [sent-146, score-0.181]

68 Specifically, this feature set additionally includes all unigram bag features up to a distance d = 9. [sent-147, score-0.212]

69 The feature types and examples of extracted features are given in Table 2. [sent-148, score-0.179]

70 Model Comparison We compared the VMM to modified Kneser-Ney (KN, see Section 1). [sent-149, score-0.035]

71 The order of a VMM is defined through the length of the context from which the basic and short-range features are extracted. [sent-150, score-0.097]

72 In particular, VM-BA of a certain order uses the same features as the N-Gram models of the same order and VM-SR uses the same conditioning context as the N-Gram models of the same order. [sent-151, score-0.217]

73 VM-LR in addition contains longerdistance features, beyond the order of the corresponding N-Gram models. [sent-152, score-0.099]

74 The order of the models was varied between N = 2 . [sent-153, score-0.032]

75 5, however, for the larger two datasets D3 and D4 the order 5 models would not fit into the available RAM which is why for order 5 we can only report scores for D1 and D2. [sent-156, score-0.156]

76 Model Parametrization We used the development set to determine the values for the absolute discounting parameter D (defined in Section 2. [sent-161, score-0.072]

77 1) and the number of iterations for stochastic gradient ascent. [sent-162, score-0.164]

78 Stochastic gradient yields best results with a single pass through all instances. [sent-165, score-0.136]

79 Results The results of our experiments are given in Table 3, which shows that for sufficiently high orders VM-SR matches KN on each dataset. [sent-171, score-0.139]

80 As expected, the VMM’s strength partly stems from the fact that compared to KN it makes better use of the information contained in the conditioning context, as indicated by the fact that VM-SR matches KN whereas VM-BA doesn’t. [sent-172, score-0.291]

81 At orders 4 and 5, VM-LR outperforms KN on all datasets, bringing improvements of around 10% for the two smaller training datasets D1 and D2. [sent-173, score-0.147]

82 Comparing VM-BA and VM-SR at order 4 we see that the 7 additional features used by VM-SR for every instance significantly improve performance and the long-range fea- tures further improve performance. [sent-174, score-0.133]

83 Thus richer feature sets consistently lead to higher model accuracy. [sent-175, score-0.179]

84 For orders 2 and 629 3 VM-SR is inferior to KN, because the SR feature set at order 2 contains no additional features over KN and at order 3 it only contains one additional feature per instance. [sent-177, score-0.412]

85 At order 4 VM-SR matches KN and, while KN gets worse at order 5, the VMM improves and outperforms KN by around 14%. [sent-178, score-0.103]

86 The training time (including disk IO) of the order 4 VM-SR on the largest dataset D4 is about 30 minutes, whereas KN takes about 6 minutes to train. [sent-179, score-0.067]

87 The main advantage of the VMM is that it is straightforward to incorporate rich features without sacrificing scalability or reliability of parameter estimation. [sent-181, score-0.22]

88 Moreover, the VMM is simple to implement and works ‘out-of-the-box’ without feature selection, or any special tuning or tweaking. [sent-182, score-0.114]

89 Applied to language modeling, the VMM results in a state-of-the-art generative language model whose relative performance compared to N-Gram models gets better as one incorporates richer feature sets. [sent-183, score-0.214]

90 It scales almost as well to large datasets as standard N-Gram models: training requires only two passes over the data as opposed to a single pass required by N-Gram models. [sent-184, score-0.204]

91 Thus, the experiments provide empirical evidence that the VMM is based on a reasonable set of modeling assumptions, which translate into an accurate and scalable model. [sent-185, score-0.118]

92 as a language model within a speech recognition or machine translation system. [sent-188, score-0.038]

93 Moreover, optimizing memory usage, for example via feature pruning or randomized algorithms, would allow incorporation of richer feature sets and would likely lead to further improvements, as indicated by the experiments in this paper. [sent-189, score-0.3]

94 We also intend to evaluate the performance of the VMM on other lexical prediction tasks and more generally, on other classification tasks with similar characteristics. [sent-190, score-0.046]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vmm', 0.605), ('xk', 0.282), ('kn', 0.246), ('predictive', 0.18), ('strength', 0.164), ('yj', 0.142), ('categorical', 0.122), ('feature', 0.114), ('wi', 0.108), ('gradient', 0.096), ('active', 0.094), ('datasets', 0.092), ('week', 0.09), ('classifier', 0.089), ('conditioning', 0.088), ('smoothing', 0.088), ('weeks', 0.082), ('predictions', 0.082), ('scalable', 0.069), ('mixture', 0.068), ('stochastic', 0.068), ('longerdistance', 0.067), ('sga', 0.067), ('vyk', 0.067), ('wood', 0.067), ('vk', 0.067), ('features', 0.065), ('rosenfeld', 0.065), ('maximumentropy', 0.059), ('class', 0.057), ('count', 0.057), ('xa', 0.057), ('sr', 0.057), ('orders', 0.055), ('mnih', 0.055), ('quantifying', 0.051), ('sacrificing', 0.051), ('modeling', 0.049), ('parameters', 0.049), ('update', 0.048), ('lm', 0.048), ('zk', 0.047), ('equation', 0.047), ('prediction', 0.046), ('argm', 0.045), ('sufficiently', 0.045), ('smoothed', 0.044), ('bengio', 0.043), ('srilm', 0.043), ('variable', 0.042), ('mixing', 0.042), ('xy', 0.042), ('kneser', 0.041), ('lewis', 0.041), ('pass', 0.04), ('probabilistic', 0.039), ('matches', 0.039), ('proportions', 0.039), ('berger', 0.039), ('scalability', 0.039), ('speech', 0.038), ('scales', 0.038), ('discounting', 0.038), ('pruning', 0.037), ('bias', 0.037), ('den', 0.037), ('raw', 0.037), ('roark', 0.036), ('instance', 0.036), ('chen', 0.035), ('estimating', 0.035), ('whose', 0.035), ('richer', 0.035), ('modified', 0.035), ('minutes', 0.035), ('vocabulary', 0.034), ('parameter', 0.034), ('individual', 0.034), ('symbol', 0.034), ('passes', 0.034), ('xx', 0.034), ('estimates', 0.034), ('neural', 0.033), ('edinburgh', 0.033), ('bag', 0.033), ('order', 0.032), ('computed', 0.032), ('costs', 0.031), ('rich', 0.031), ('counts', 0.03), ('adaptive', 0.03), ('interpolation', 0.03), ('ney', 0.03), ('consistently', 0.03), ('generative', 0.03), ('goodman', 0.03), ('jy', 0.03), ('abby', 0.03), ('aec', 0.03), ('bec', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

Author: Joel Lang

Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.

2 0.14363645 175 acl-2011-Integrating history-length interpolation and classes in language modeling

Author: Hinrich Schutze

Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.

3 0.12311565 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

Author: Thomas Mueller ; Hinrich Schuetze

Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.

4 0.11424962 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

Author: Bing Xiang ; Abraham Ittycheriah

Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.

5 0.10043386 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

6 0.084601313 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

7 0.076047957 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

8 0.073410019 14 acl-2011-A Hierarchical Model of Web Summaries

9 0.073211454 204 acl-2011-Learning Word Vectors for Sentiment Analysis

10 0.067581199 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

11 0.065012604 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

12 0.064587899 44 acl-2011-An exponential translation model for target language morphology

13 0.064021938 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

14 0.063207582 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

15 0.059824016 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

16 0.05923599 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

17 0.058442853 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

18 0.057887334 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

19 0.053899538 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

20 0.053873725 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.17), (1, 0.002), (2, -0.013), (3, -0.003), (4, -0.015), (5, -0.027), (6, 0.024), (7, 0.019), (8, -0.003), (9, 0.095), (10, 0.025), (11, -0.01), (12, 0.021), (13, 0.095), (14, 0.027), (15, 0.02), (16, -0.131), (17, 0.036), (18, 0.018), (19, -0.033), (20, 0.074), (21, -0.141), (22, 0.065), (23, -0.068), (24, -0.043), (25, -0.035), (26, 0.052), (27, -0.009), (28, -0.043), (29, 0.011), (30, 0.029), (31, -0.063), (32, -0.066), (33, -0.018), (34, 0.106), (35, -0.018), (36, 0.001), (37, -0.044), (38, 0.104), (39, -0.009), (40, 0.079), (41, -0.021), (42, 0.068), (43, -0.036), (44, -0.045), (45, -0.027), (46, 0.06), (47, 0.064), (48, -0.013), (49, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92232031 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

Author: Joel Lang

Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.

2 0.86728364 175 acl-2011-Integrating history-length interpolation and classes in language modeling

Author: Hinrich Schutze

Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.

3 0.8596313 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

Author: Greg Durrett ; Dan Klein

Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.

4 0.81473464 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

5 0.66486549 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

Author: Viet Ha Thuc ; Nicola Cancedda

Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.

6 0.66181749 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

7 0.62047273 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

8 0.60903257 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

9 0.60889316 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

10 0.58621883 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

11 0.57156277 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes

12 0.55020607 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

13 0.52862293 116 acl-2011-Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers

14 0.52400506 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

15 0.52032584 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

16 0.51914585 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

17 0.51873732 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

18 0.51611513 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging

19 0.51246369 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

20 0.50881886 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (17, 0.048), (26, 0.016), (37, 0.092), (39, 0.028), (41, 0.041), (55, 0.459), (59, 0.027), (72, 0.034), (91, 0.041), (96, 0.118)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86315972 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

Author: Joel Lang

Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.

2 0.83631498 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

Author: Viet Ha Thuc ; Nicola Cancedda

Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.

3 0.82778651 275 acl-2011-Semi-Supervised Modeling for Prenominal Modifier Ordering

Author: Margaret Mitchell ; Aaron Dunlop ; Brian Roark

Abstract: In this paper, we argue that ordering prenominal modifiers typically pursued as a supervised modeling task is particularly wellsuited to semi-supervised approaches. By relying on automatic parses to extract noun phrases, we can scale up the training data by orders of magnitude. This minimizes the predominant issue of data sparsity that has informed most previous approaches. We compare several recent approaches, and find improvements from additional training data across the board; however, none outperform a simple n-gram model. – –

4 0.81811202 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System

Author: Reyyan Yeniterzi

Abstract: Turkish is an agglutinative language with complex morphological structures, therefore using only word forms is not enough for many computational tasks. In this paper we analyze the effect of morphology in a Named Entity Recognition system for Turkish. We start with the standard word-level representation and incrementally explore the effect of capturing syntactic and contextual properties of tokens. Furthermore, we also explore a new representation in which roots and morphological features are represented as separate tokens instead of representing only words as tokens. Using syntactic and contextual properties with the new representation provide an 7.6% relative improvement over the baseline.

5 0.7482686 144 acl-2011-Global Learning of Typed Entailment Rules

Author: Jonathan Berant ; Ido Dagan ; Jacob Goldberger

Abstract: Extensive knowledge bases ofentailment rules between predicates are crucial for applied semantic inference. In this paper we propose an algorithm that utilizes transitivity constraints to learn a globally-optimal set of entailment rules for typed predicates. We model the task as a graph learning problem and suggest methods that scale the algorithm to larger graphs. We apply the algorithm over a large data set of extracted predicate instances, from which a resource of typed entailment rules has been recently released (Schoenmackers et al., 2010). Our results show that using global transitivity information substantially improves performance over this resource and several baselines, and that our scaling methods allow us to increase the scope of global learning of entailment-rule graphs.

6 0.72837865 237 acl-2011-Ordering Prenominal Modifiers with a Reranking Approach

7 0.67247438 245 acl-2011-Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives

8 0.53043336 150 acl-2011-Hierarchical Text Classification with Latent Concepts

9 0.52639991 175 acl-2011-Integrating history-length interpolation and classes in language modeling

10 0.52041399 116 acl-2011-Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers

11 0.51834977 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

12 0.51673871 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

13 0.51258105 85 acl-2011-Coreference Resolution with World Knowledge

14 0.50378752 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

15 0.4985708 197 acl-2011-Latent Class Transliteration based on Source Language Origin

16 0.49787396 135 acl-2011-Faster and Smaller N-Gram Language Models

17 0.49641073 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

18 0.49522218 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

19 0.49437103 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

20 0.49093571 9 acl-2011-A Cross-Lingual ILP Solution to Zero Anaphora Resolution