acl acl2011 acl2011-175 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hinrich Schutze
Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.
Reference: text
sentIndex sentText sentNum sentScore
1 This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. [sent-2, score-0.73]
2 We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models. [sent-3, score-1.092]
3 This is partially due to the additional cost of creating classes Germany and using classes as part of the model. [sent-9, score-0.274]
4 But an equally important reason is that most models that integrate class-based information do so by way of a simple interpolation and achieve only a modest improvement in performance. [sent-10, score-0.172]
5 In particular, the best probability estimate for frequent events is often the maximum likelihood estimator and this estimator is hard to improve by using other information sources like classes or word similarity. [sent-13, score-0.466]
6 We therefore design a model that attempts to focus the effect of class-based generalization on rare events. [sent-14, score-0.248]
7 HI models address the challenge that frequent events are best estimated by a method close to maximum likelihood by selecting appropriate values for the interpolation weights. [sent-20, score-0.445]
8 In fact, we will use the interpolation weights of a KN model to determine how much weight to give to each component of the interpolation. [sent-25, score-0.169]
9 The difference to a KN model is merely that the lower-order distribution is not the lower-order KN distribution (as in KN), but instead an interpolation of the lower-order KN distribution and a class-based distribution. [sent-26, score-0.169]
10 We will show that this method of integrating history interpolation and classes significantly increases the performance of a language model. [sent-27, score-0.333]
11 Focusing the effect of classes on rare events has another important consequence: if this is the right way of using classes, then they should not be formed based on all events in the training set, but only based on rare events. [sent-28, score-0.948]
12 Finally, we introduce a second discounting method into the model that differs from KN. [sent-30, score-0.202]
13 We propose a polynomial discount and show a significant improvement compared to using KN discounting only. [sent-32, score-0.415]
14 Section 3 reviews the KN model and introduces two models, the DupontRosenfeld model (a “recursive” model) and a toplevel interpolated model, that integrate the KN model (a history interpolation model) with a class model. [sent-35, score-0.417]
15 Based on an analysis of strengths and weaknesses of DupontRosenfeld and top-level interpolated models, we present a new polynomial discounting mechanism that does better than either in Section 6. [sent-38, score-0.393]
16 The novelty of our approach is that we integrate phrase-level classes into a KN model. [sent-52, score-0.17]
17 Hierarchical clustering (McMahon and Smith, 1996; Zitouni and Zhou, 2007; Zitouni and Zhou, 2008) has the advantage that the size of the class to be used in a specific context is not fixed, but can be chosen at an optimal level of the hierarchy. [sent-53, score-0.198]
18 The key novelty of our clustering method is that clusters are formed based on rare events in the training corpus. [sent-55, score-0.65]
19 However, the importance of rare events for clustering in language modeling has not been investigated before. [sent-58, score-0.552]
20 3 Models In this section, we introduce the three models that we compare in our experiments: Kneser-Ney model, Dupont-Rosenfeld model, and top-level interpolation model. [sent-64, score-0.172]
21 The key idea of the improved model we will adopt is that class generalization ought to play the same role in history-interpolated models as the lowerorder distributions: they should improve estimates for unseen and rare events. [sent-75, score-0.37]
22 For a trigram model, this means that we interpolate pKN(w3 |w2) and pB (w3 |w1w2) on the first backoff level and| pKN(w3) and pB (w3 |w2) on the second backoff level, where pB is the (Brown) class model (see Section 4 for details on pB). [sent-77, score-0.288]
23 We cluster bigram histories and unigram histories separately and write pB (w3 |w1w2) for the bigram cluster model and pB (w3 |w2) f|owr the unigram cluster model. [sent-82, score-0.827]
24 1518 The unigram distribution of the DupontRosenfeld model is set to the unigram distribution of the KN model: pDR(w) = pKN(w). [sent-84, score-0.206]
25 Most importantly, it allows a truly parallel backoff whereas in our model the recursive backoff distribution pDR is interpolated with a class distribution pB that is not backed off. [sent-86, score-0.273]
26 The strength o|fw history interpolation is that estimates for frequent events are close to ML, e. [sent-90, score-0.491]
27 3 Top-level interpolation Class-based models are often combined with other models by interpolation, starting with the work by Brown et al. [sent-96, score-0.211]
28 Since we cluster both unigrams and bigrams, we interpolate three models: pTOP(w3|w1w2) = µ1(w1w2)pB(w3|w1w2) + where + µ2(w2)pB(w3|w2) (1 − µ1(w1w2) − µ2(w2))pKN(w3|w1w2) µ1 (w1w2) = λ1 if w1w2 ∈ B2 and 0 other- wise, µ2(w2) = λ2 if w2 ∈ B1 an∈d 0B otherwise and λ1 and λ2 are parameters. [sent-98, score-0.245]
29 The training set contains 1519 256,873 unique unigrams and 4,494,222 unique bigrams. [sent-102, score-0.212]
30 We therefore represent a bigram as a hyphenated word in bigram clustering; e. [sent-113, score-0.186]
31 The input to the clustering is the vocabulary Bi andT hthee nclpuustte tro training corpus. [sent-116, score-0.162]
32 Fso trh a particular yba Bse set size b, the unigram input vocabulary B1 is set to tsheet sbi zmeo bs,t frequent unigrams oinc tahbeu training set and the bigram input vocabulary B2 is set to the b most frequent bigrams vino ctaheb training set. [sent-117, score-0.462]
33 In this section, we call the WSJ training corpus the raw corpus and the cluster training corpus the cluster corpus to be able to distinguish them. [sent-118, score-0.284]
34 The clustUern corpus iesn nthte u sneitg orafm mall c sequences o Tf htweo c unigrams ∈ B1 that occur in the raw corpus, one sequence per line. [sent-134, score-0.15]
35 The cluster corpus -ise vthenet s beitg roaf mall c sequences Tohfe et cwluo tbei-r grams ∈ B2 that occur in the training corpus, one sequence per line. [sent-137, score-0.183]
36 As mentioned above, we need both unigram and bigram clusters because we want to incorporate class-based generalization for histories of lengths 1 and 2. [sent-139, score-0.348]
37 Since the focus of this paper is not on clustering algorithms, reformatting the training corpus as described above (as a sequence of hy- phenated bigrams) is a simple way of using SRILM for bigram clustering. [sent-141, score-0.279]
38 The unique-event clusterings are motivated by the fact that in the Dupont-Rosenfeld model, frequent events are handled by discounted ML estimates. [sent-142, score-0.39]
39 Consequently, we should form clusters not based on all events in the training corpus, but only on events that are rare because this is the type of event that classes will then be applied to in prediction. [sent-144, score-0.876]
40 In practice this means that clustering is mostly influenced by rare events since, on the level of types, most events are rare. [sent-146, score-0.759]
41 This is not surprising as the class-based component of the model can only benefit rare events and it is therefore reasonable to estimate this component based on a corpus dominated by rare events. [sent-148, score-0.628]
42 We started experimenting with reweighted cor– pora because class sizes become very lopsided in regular SRILM clustering as the size of the base set increases. [sent-149, score-0.284]
43 Highly differentiated classes for frequent words contribute substantially to this objective function whereas putting all rare words in a few large clusters does not hurt the objective much. [sent-151, score-0.407]
44 However, our focus is on using clustering for improving prediction for rare events; this means that the objective function is counterproductive when contexts are frequency-weighted as they occur in the corpus. [sent-152, score-0.301]
45 After overweighting rare contexts, the objective function is more in sync with what we use clusters for in our model. [sent-153, score-0.226]
46 512 unigram classes and 512 bigram classes roughly correspond to this number. [sent-160, score-0.452]
47 We prefer powers of 2 to facilitate efficient storage of cluster ids (one such cluster id must be stored for each unigram and each bigram) and therefore choose k = 512. [sent-161, score-0.295]
48 To estimate n-gram emission probabilities pE, we first introduce an additional cluster for all unigrams that are not in the base set; emission probabilities are then estimated by maximum likelihood. [sent-163, score-0.318]
49 The two class distributions are then defined as follows: pB(w3|w1w2) = pT(g(w3)|g(w1w2))pE(w3|g(w3)) pB(w3|w2) = pT(g(w3)|g(w2))pE(w3|g(w3)) where g(v) is the class of the uni- or bigram v. [sent-166, score-0.249]
50 1520 pDR Table 3: Optimal parameters pTOP for Dupont-Rosenfeld (left) and top-level (right) models on the validation set and per- plexity on the validation set. [sent-167, score-0.195]
51 The two tables compare performance when using a class model trained on all events vs a class model trained on unique events. [sent-168, score-0.48]
52 |B1| = |B2 | is the number of unigrams and bigrams in the clusters; e. [sent-169, score-0.146]
53 5 Results Table 3 shows the performance of pDR and pTOP for a range of base set sizes |Bi | and for classes trained on aralln egeve onfts b aasned s on unique e|v aenndts f. [sent-174, score-0.247]
54 All following tables also optimize on the validation set and report results on the validation set. [sent-177, score-0.156]
55 Table 3 confirms previous findings that classes improve language model performance. [sent-179, score-0.173]
56 All models have a perplexity that is lower than KN (88. [sent-180, score-0.151]
57 d2 hanisdtory interpolation are sb iontdhi cvaatleusa tbhleat w clhaessne tshe a nmdo hdise-l is backing off. [sent-195, score-0.156]
58 This again is evidence that rare-event clustering is the correct approach: only clusters derived in rareevent clustering receive high weights αi in the interpolation. [sent-201, score-0.336]
59 This effect can also be observed for pTOP: the value of λ1 (the weight of bigrams) is higher for unique-event clustering than for all-event clustering (with the exception of lines 1b&2b). [sent-202, score-0.274]
60 The quality of bigram clusters seems to be low in all-event clustering when the base set becomes too large. [sent-203, score-0.345]
61 Table 4 compares the two models in two different conditions: (i) b-: using unigram clusters only and (ii) b+: using unigram clusters and bigram clusters. [sent-205, score-0.426]
62 However, for unique events, the model that includes bigrams (b+) does better than the model without bigrams (b-). [sent-207, score-0.275]
63 The effect is larger for pDR than for pTOP because (for unique events) a larger weight for the unigram model (λ2 = . [sent-208, score-0.222]
64 Given that training large class models with SRILM on all events would take several weeks or even months, we restrict our direct 1521 Table4:Usbin-+go. [sent-212, score-0.354]
65 For Tboabthle models, perplexity steadily edaecsereda fseusrt as |Bi | oisr ibnocthrea msoedd flrs,om pe 60,000 t sot 400,000. [sent-238, score-0.156]
66 ) The improvements in perplexity become smaller for larger base set sizes, but it is reassuring to see that the general trend continues for large base set sizes. [sent-240, score-0.24]
67 Our explanation is that the class component is focused on rare events and the items that are being added to the clustering for large base sets are all rare events. [sent-241, score-0.808]
68 1 1Dupont and Rosenfeld (1997) found a relatively large improvement of the “global” linear interpolation model – ptop in our terminology – compared to the baseline whereas ptop performs less well in our experiments. [sent-243, score-0.807]
69 6 Polynomial discounting Further comparative analysis of pDR and pTOP revealed that pDR is not uniformly better than pTOP. [sent-245, score-0.166]
70 For example, for the history w1w2 = cents a, the continuation w3 = share dominates. [sent-247, score-0.182]
71 pDR deals well with this situation because pDR(w3 |w1w2) is the discounted ML estimate, with a disco|wunt that is small relative to the 10,768 occurrences of cents a share in the training set. [sent-248, score-0.192]
72 In the pTOP model on the last line in Table 5, the discounted ML estimate is multiplied by 1−. [sent-249, score-0.178]
73 Because their training set unigram frequency is at least 10, they have a good chance of being assigned to a class that captures their distributional behavior well and pB (w3 |w1w2) is then likely to be a good estimate. [sent-259, score-0.171]
74 For a history with these properties, it is advantageous to further discount the discounted ML estimates by multiplying them with . [sent-260, score-0.272]
75 However, it looks like the KN discounts are not large enough for productive histories, at least not in a combined history-length/class model. [sent-265, score-0.157]
76 Apparently, when incorporating the strengths of a classbased model into KN, the default discounting mechanism does not reallocate enough probability mass – – 1522 from high-frequency to low-frequency events. [sent-266, score-0.306]
77 The incorporation of the additional polynomial discount into KN is straightforward. [sent-273, score-0.249]
78 pPOLKN directly implements the insight that, when using class-based generalization, discounts for counts x ≥ 4 should be larger lthizaant they are oinu nKtsN fo. [sent-277, score-0.153]
79 It allows us to determine whether a polynomial discount by itself (without using KN discounts in addition) is sufficient. [sent-279, score-0.38]
80 Results for the two models are shown in Table 6 and compared with the two best models from Table 5, for |Bi | = 400,000, classes trained on unique events. [sent-280, score-0.272]
81 This shows that using discounts that are larger than KN discounts for large counts is potentially ad- vantageous. [sent-282, score-0.284]
82 + The linear interpolation αp (1−α)q of two distribTuhtieo lnins p a inndte q oilsa a ofnor αmp +of( 1lin−eaαr) discounting: p is discounted by 1 − α and q by α. [sent-296, score-0.204]
83 SIte can thautzs, be viewed as polynomial discounting for r = 1. [sent-299, score-0.299]
84 Absolute discounting could be viewed as a form of polynomial discounting for r = 0. [sent-300, score-0.465]
85 We know of no other work that has explored exponents between 0 and 1 and shown that for this type of exponent, one obtains competitive discounts that could be argued to be simpler than more complex discounts like KN discounts. [sent-301, score-0.285]
86 2 We can then compute perplexity for each bin, compare perplexities for different experiments and use the sign test for determining significance. [sent-314, score-0.148]
87 , 35 5< (∗n n2 = means kth ≥at 3te2s ts uscetperplexity on line 3 is significantly lower than test set perplexity on line 2. [sent-318, score-0.184]
88 The main findings on the validation set also hold for the test set: (i) Trained on unique events and with a sufficiently large |Bi |, both pDR and pTOP are bettaer s uthffainci eKnNtly: 10<∗1, 11|,< b∗o1th. [sent-319, score-0.364]
89 (ii) Training on unique events is better than training on all events: 3 <∗ 2, 5 <∗ 4, 7<∗ 6, 9 <∗ 8. [sent-320, score-0.311]
90 (iii) For unique events, using bigram and unigram classes gives better results than using unigram classes only: 3<∗ 7. [sent-321, score-0.594]
91 (vi) Polynomial discounting is significantly better than KN discounting for the Dupont-Rosenfeld model pDR although the absolute difference in perplexity is small: 13<∗10. [sent-325, score-0.48]
92 The main result of the experiments is that Dupont-Rosenfeld models (which focus on rare events) are better than the standardly used top-level models; and that training classes on unique events is better than training classes on all events. [sent-330, score-0.813]
93 (ii) The pDR model (which adjusts the interpolation weight given to classes based on the prevalence of nonfrequent events following) is better than top-level model pTOP (which uses a fixed weight for classes). [sent-337, score-0.571]
94 A comparison of Dupont-Rosenfeld and top-level results suggested that the KN discount mechanism does not discount high-frequency events enough. [sent-340, score-0.503]
95 We empirically determined that better discounts are obtained by letting the discount grow as a function of the count of the discounted event and implemented this as polynomial discounting, an arguably simpler way of discounting than Kneser-Ney discounting. [sent-341, score-0.67]
96 In future work, we would like to find a theoreti- cal justification for the surprising fact that polynomial discounting does at least as well as Kneser-Ney discounting. [sent-344, score-0.299]
97 Finally, training classes on unique events is an extreme way of highly weighting rare events. [sent-348, score-0.612]
98 We would like to explore training regimes that lie between unique-event clustering and all-event clustering and upweight rare events less. [sent-349, score-0.692]
99 Distributed word clustering for large scale class-based language modeling in machine translation. [sent-461, score-0.159]
100 Hierarchical linear discounting class n-gram language models: A multilevel class hierarchy approach. [sent-491, score-0.288]
wordName wordTfidf (topN-words)
[('pdr', 0.494), ('ptop', 0.319), ('kn', 0.318), ('events', 0.229), ('pkn', 0.203), ('discounting', 0.166), ('rare', 0.164), ('pb', 0.153), ('classes', 0.137), ('clustering', 0.137), ('bi', 0.134), ('polynomial', 0.133), ('interpolation', 0.133), ('discounts', 0.131), ('discount', 0.116), ('perplexity', 0.112), ('pml', 0.107), ('cluster', 0.105), ('dupontrosenfeld', 0.099), ('bigram', 0.093), ('unigram', 0.085), ('validation', 0.078), ('unigrams', 0.073), ('bigrams', 0.073), ('cents', 0.072), ('discounted', 0.071), ('interpolate', 0.067), ('dupont', 0.066), ('zitouni', 0.066), ('history', 0.063), ('clusters', 0.062), ('backoff', 0.062), ('class', 0.061), ('histories', 0.06), ('unique', 0.057), ('ml', 0.053), ('base', 0.053), ('interpolated', 0.052), ('ppolkn', 0.049), ('generalization', 0.048), ('clusterings', 0.046), ('frequent', 0.044), ('pe', 0.044), ('rosenfeld', 0.042), ('mechanism', 0.042), ('classbased', 0.041), ('models', 0.039), ('jelinek', 0.039), ('icassp', 0.039), ('line', 0.036), ('ii', 0.036), ('model', 0.036), ('perplexities', 0.036), ('estimate', 0.035), ('srilm', 0.035), ('speech', 0.035), ('distributions', 0.034), ('deligne', 0.033), ('justo', 0.033), ('mcmahon', 0.033), ('qiru', 0.033), ('reweighted', 0.033), ('suhm', 0.033), ('yokoyama', 0.033), ('novelty', 0.033), ('xr', 0.033), ('brown', 0.032), ('hi', 0.031), ('pt', 0.03), ('event', 0.03), ('mall', 0.029), ('waibel', 0.029), ('imed', 0.029), ('kuo', 0.029), ('momtazi', 0.029), ('frederick', 0.028), ('stanley', 0.027), ('wiegand', 0.027), ('productive', 0.026), ('emission', 0.026), ('emami', 0.025), ('dietrich', 0.025), ('training', 0.025), ('sequence', 0.024), ('raw', 0.024), ('klakow', 0.024), ('share', 0.024), ('reichart', 0.023), ('backing', 0.023), ('uszkoreit', 0.023), ('continuation', 0.023), ('simpler', 0.023), ('larger', 0.022), ('whittaker', 0.022), ('estimates', 0.022), ('modeling', 0.022), ('chen', 0.022), ('bilmes', 0.021), ('sch', 0.021), ('probability', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 175 acl-2011-Integrating history-length interpolation and classes in language modeling
Author: Hinrich Schutze
Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.
2 0.25478411 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes
Author: Thomas Mueller ; Hinrich Schuetze
Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.
3 0.22246392 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models
Author: Greg Durrett ; Dan Klein
Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.
4 0.15270455 142 acl-2011-Generalized Interpolation in Decision Tree LM
Author: Denis Filimonov ; Mary Harper
Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.
5 0.14363645 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
Author: Joel Lang
Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.
6 0.11562113 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
7 0.094851457 122 acl-2011-Event Extraction as Dependency Parsing
8 0.090634905 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
9 0.0834025 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
10 0.079504982 293 acl-2011-Template-Based Information Extraction without the Templates
11 0.077717252 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
12 0.074782342 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
13 0.074326448 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
14 0.071564719 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction
15 0.071363494 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
16 0.069749266 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
17 0.068567805 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
18 0.066415094 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
19 0.065085016 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering
20 0.054476969 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal
topicId topicWeight
[(0, 0.152), (1, 0.016), (2, -0.068), (3, 0.002), (4, 0.036), (5, 0.013), (6, 0.006), (7, 0.018), (8, 0.05), (9, 0.099), (10, 0.016), (11, -0.017), (12, 0.028), (13, 0.152), (14, 0.019), (15, -0.045), (16, -0.163), (17, 0.107), (18, 0.045), (19, -0.032), (20, 0.116), (21, -0.144), (22, 0.062), (23, -0.171), (24, 0.004), (25, -0.105), (26, 0.125), (27, -0.016), (28, -0.043), (29, 0.009), (30, -0.012), (31, -0.211), (32, -0.129), (33, -0.081), (34, 0.151), (35, -0.045), (36, -0.015), (37, -0.038), (38, 0.044), (39, -0.041), (40, -0.004), (41, 0.001), (42, 0.052), (43, 0.034), (44, -0.055), (45, 0.041), (46, 0.05), (47, 0.092), (48, -0.022), (49, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.94600809 175 acl-2011-Integrating history-length interpolation and classes in language modeling
Author: Hinrich Schutze
Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.
2 0.81369448 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models
Author: Greg Durrett ; Dan Klein
Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.
3 0.80925256 142 acl-2011-Generalized Interpolation in Decision Tree LM
Author: Denis Filimonov ; Mary Harper
Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.
4 0.71163011 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes
Author: Thomas Mueller ; Hinrich Schuetze
Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.
5 0.68815714 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
Author: Joel Lang
Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.
6 0.49233896 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
7 0.49120799 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components
8 0.46797091 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition
9 0.46254483 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
10 0.44213927 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
11 0.43644086 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
12 0.39553815 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
14 0.39092362 82 acl-2011-Content Models with Attitude
15 0.38669214 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation
16 0.37854812 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models
17 0.36090082 293 acl-2011-Template-Based Information Extraction without the Templates
18 0.36035171 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
19 0.35827062 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
20 0.34732795 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
topicId topicWeight
[(5, 0.025), (17, 0.072), (26, 0.032), (31, 0.011), (37, 0.093), (39, 0.055), (41, 0.063), (55, 0.083), (59, 0.021), (63, 0.013), (72, 0.065), (76, 0.188), (91, 0.036), (96, 0.158), (97, 0.011)]
simIndex simValue paperId paperTitle
1 0.88530904 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing
Author: Gisle Ytrestl
Abstract: This paper describes a backtracking strategy for an incremental deterministic transitionbased parser for HPSG. The method could theoretically be implemented on any other transition-based parser with some adjustments. In this paper, the algorithm is evaluated on CuteForce, an efficient deterministic shiftreduce HPSG parser. The backtracking strategy may serve to improve existing parsers, or to assess if a deterministic parser would benefit from backtracking as a strategy to improve parsing.
same-paper 2 0.82935464 175 acl-2011-Integrating history-length interpolation and classes in language modeling
Author: Hinrich Schutze
Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.
3 0.82356894 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
Author: Hajime Morita ; Tetsuya Sakai ; Manabu Okumura
Abstract: We propose a new method for query-oriented extractive multi-document summarization. To enrich the information need representation of a given query, we build a co-occurrence graph to obtain words that augment the original query terms. We then formulate the summarization problem as a Maximum Coverage Problem with Knapsack Constraints based on word pairs rather than single words. Our experiments with the NTCIR ACLIA question answering test collections show that our method achieves a pyramid F3-score of up to 0.3 13, a 36% improvement over a baseline using Maximal Marginal Relevance. 1
4 0.76989144 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
Author: Shuguang Li ; Suresh Manandhar
Abstract: In this paper we address the problem of question recommendation from large archives of community question answering data by exploiting the users’ information needs. Our experimental results indicate that questions based on the same or similar information need can provide excellent question recommendation. We show that translation model can be effectively utilized to predict the information need given only the user’s query question. Experiments show that the proposed information need prediction approach can improve the performance of question recommendation.
5 0.76074761 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
Author: Alla Rozovskaya ; Dan Roth
Abstract: We consider the problem of correcting errors made by English as a Second Language (ESL) writers and address two issues that are essential to making progress in ESL error correction - algorithm selection and model adaptation to the first language of the ESL learner. A variety of learning algorithms have been applied to correct ESL mistakes, but often comparisons were made between incomparable data sets. We conduct an extensive, fair comparison of four popular learning methods for the task, reversing conclusions from earlier evaluations. Our results hold for different training sets, genres, and feature sets. A second key issue in ESL error correction is the adaptation of a model to the first language ofthe writer. Errors made by non-native speakers exhibit certain regularities and, as we show, models perform much better when they use knowledge about error patterns of the nonnative writers. We propose a novel way to adapt a learned algorithm to the first language of the writer that is both cheaper to implement and performs better than other adaptation methods.
6 0.75664097 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models
7 0.75607133 237 acl-2011-Ordering Prenominal Modifiers with a Reranking Approach
8 0.75117826 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
9 0.74816418 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
10 0.74548382 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
11 0.74343556 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
12 0.74273634 44 acl-2011-An exponential translation model for target language morphology
13 0.7422744 133 acl-2011-Extracting Social Power Relationships from Natural Language
15 0.73977941 144 acl-2011-Global Learning of Typed Entailment Rules
16 0.73872948 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
17 0.7384541 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
18 0.73797488 311 acl-2011-Translationese and Its Dialects
19 0.73784882 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
20 0.73768431 141 acl-2011-Gappy Phrasal Alignment By Agreement