emnlp emnlp2011 emnlp2011-131 knowledge-graph by maker-knowledge-mining

131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?


Source: pdf

Author: Denis Filimonov ; Mary Harper

Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. [sent-4, score-0.371]

2 Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. [sent-5, score-0.667]

3 In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. [sent-6, score-0.496]

4 In Section 2, we describe the details of the syntactic decision tree LM. [sent-46, score-0.431]

5 Construction of a single-tree model is difficult due to the inevitable greediness of the tree construction process and its tendency to overfit the data. [sent-47, score-0.196]

6 This problem is often addressed by interpolating with lower order decision trees. [sent-48, score-0.299]

7 In Section 3, we point out the inappropriateness of backoff methods borrowed from n-gram models for decision tree LMs and briefly describe a generalized interpolation for such models. [sent-49, score-1.061]

8 , ΦN}, A decision tree provides us with a clustering func- Φ(wii−−1n+1tii−−1n+1) {Φ1, tion → where N is the number ofclusters, and clusters Φk are disjoint subsets of the context space. [sent-58, score-0.477]

9 1 Decision Tree Construction We use recursive partitioning to grow decision trees. [sent-62, score-0.305]

10 Binary splits are often referred to as questions about the context because a binary partition can be represented by a binary function that decides whether an element of context space belongs to one partition or the other. [sent-65, score-0.168]

11 We utilize univariate questions where each question partitions the context on one attribute, e. [sent-66, score-0.174]

12 The questions about words and tags are constructed differently: • The questions q about the words are in the form q(x) ≡ wi+x ∈ bSo,u wt thheere w x diss an integer bremtqw(xee)n ≡ −n 1a∈nd S −1, aerned xS ⊂ Vn nist a seurb bseettowf etehen w−onrd + vocabulary V an . [sent-69, score-0.279]

13 Because the algorithm is greedy and depends on the initialization, we construct 4 questions per word position using different random initializations of the Exchange algorithm. [sent-72, score-0.24]

14 To estimate the probability at the backoff node (B in Figure 1), we can either use the probability from its grandparent node A or estimate it using a lower order tree (see Section 3), or combine the two. [sent-74, score-0.635]

15 + We have observed no noticeable difference between these methods, which suggests that only a small fraction of probability is estimated from these nodes; therefore, for simplicity, we use the probability estimated at the backoff node’s grandparent. [sent-75, score-0.383]

16 • To create questions about tags we create a hiTeroar ccrheaicteal clustering bofo ualtl tags wine tchree aftoerm a hoi-f a binary tree. [sent-76, score-0.223]

17 In this tree, each leaf is an individual tag and each internal node is associated with the subset of tags that the node dominates. [sent-78, score-0.19]

18 Questions about tags are constructed in the form q(x, k) ≡ ti+x ∈ Tk, where k is a tnhoede f oirnm mth qe( tag t)ree ≡ an td Tk ∈is tThe subset of tags associated with that node. [sent-79, score-0.16]

19 Then, among the questions about attribute 693 Figure 1: A fragment of the decision tree with a backoff node. [sent-85, score-0.782]

20 To account for unseen words, we add the backoff node B. [sent-88, score-0.326]

21 Note that the tree induction algorithm can also be used to construct trees without tags: p(wi|wii−−n1+1) ≈ p(wi|Φ(wii−−n1+1)) We refer to this model as the word-tree model. [sent-91, score-0.414]

22 By comparing syntactic and word-tree models, we are able to separate the effects of decision tree modeling and syntactic information on language modeling by comparing both models to an n-gram baseline. [sent-92, score-0.54]

23 2 In-tree Smoothing A decision tree offers a hierarchy of clusterings that can be exploited for smoothing. [sent-94, score-0.403]

24 , p(witi|Φ(wii−−1n+1tii−−n1+1)) p(witi|Φ(wii−−1n+2tii−−n1+2)), and so on up until p(witi) (wwhii−chn i2s a +o2ne-node tree (essentially a unigram model). [sent-100, score-0.155]

25 Although superficially similar to backoff in n-gram models, lower order decision trees differ substantially from lower order n-gram models and require different interpolation methods. [sent-101, score-1.185]

26 In the next section, we discuss this difference and present a generalized interpolation that is more suitable for combining decision tree models. [sent-102, score-0.757]

27 2 in a more generic way: ˜p (wi|w1i−1) = ρn(wi|Φn(w1i−1)) + (6) γ(Φn(wi1−1)) ·˜ p(wi|BOn−1(w1i−1)) where, ρn is a discounted distribution, Φn is a clustering function of order n, and γ(Φn(wi1−1)) is the backoff weight chosen to normalize the distribution. [sent-107, score-0.352]

28 BOn−1 is the backoff clustering function of order n − 1, representing a reduction of context size. [sent-108, score-0.37]

29 In the case of a decision tree model, the same b2ackoff function is typically used, but the clustering function can be arbitrary. [sent-112, score-0.441]

30 6 is that the backoff context BOn−1 (w1i−1) allows for a more robust (but less informed) probability estimation than the con- text cluster Φn(wi1−1). [sent-114, score-0.376]

31 More precisely: ∀w1i−1,W : W ∈ Φn(wi1−1) ⇒ W ∈ BOn−1(w1i−1) (7) that is, every word sequence W that belongs to a context cluster Φn(wi1−1), belongs to the same backoff cluster BOn−1 (w1i−1) (hence has the same backoff distribution). [sent-115, score-0.712]

32 , a decision tree, the property is not necessarily satisfied. [sent-119, score-0.305]

33 Let us consider what happens when we have two context sequences W and W0 that belong to the same cluster Φn(W) = Φn(W0) but different backoff clusters BOn−1 (W) BOn−1 (W0). [sent-121, score-0.351]

34 For example: suppose we h(Wave) Φ(wi−2wi−1) = ({on}, {may,june}) and two corresponding backoff c({luosnte}r,s{: BayO,ju0 = ({may}) a cnodr rBesOpo00n = ({june}). [sent-122, score-0.273]

35 However this would not be possible in the backoff scheme in Eq. [sent-126, score-0.273]

36 Hence arbitrary clustering (an advantage of decision trees) leads to a violation of Property 7, which wii−−n1+1, = is likely to produce a degradation in performance if backoff interpolation Eq. [sent-128, score-0.926]

37 pn(wi |φn) is the probability distribution at the cluster φn iφn the tree of order n. [sent-132, score-0.222]

38 This interpolation method is particularly useful as, unlike count-based discounting methods (e. [sent-133, score-0.326]

39 In (Filimonov and Harper, 2011), we observed that because of the violation of Property 7 in decision tree models, the interpolation method of Eq. [sent-136, score-0.729]

40 Instead we pro- posed the following generalized form of linear interpolation: ˜pn(wi|wi −−n1+1) =Pnm=1Pλmnm(=φ1mλ)m ·( pφmm()wi|φm) (9) Note that the recursive interpolation of Eq. [sent-138, score-0.387]

41 8 can be reprPesented in this form with the additional constraint Pnm=1 λm(φm) = 1, which is not required in the genPeralized interpolation of Eq. [sent-139, score-0.297]

42 9, individual trees do not have explicit higher-lower order relations, they are treated as a collection of trees, i. [sent-145, score-0.262]

43 Naturally, to benefit from the forest model, its trees must differ in some way. [sent-148, score-0.339]

44 Different trees can be created based on differences in the training data, differences in the tree growing algorithm, or some non-determinism in the way the trees are constructed. [sent-149, score-0.623]

45 (Xu, 2005) used randomization techniques to produce a large forest of decision trees that were combined as follows: 695 p(wi|wi −−n1+1) =M1mXM=1pm(wi|wi −−n1+1) (10) where M is the number ofdecision trees in the forest (he proposed M = 100) and pm is the m-th tree model4. [sent-150, score-1.201]

46 Note that this type of interpolation assumes that each tree model is “equal” a priori and therefore is only appropriate when the tree models are grown in the same way (particularly, using the same order of context). [sent-151, score-0.741]

47 (Xu, 2005) showed that, although each individual tree is a fairly weak model, their combination outperforms the n-gram baseline substantially. [sent-155, score-0.183]

48 However, we find this approach impractical for online application of any sizable model: In our experiments, fourgram trees have approximately 1. [sent-156, score-0.518]

49 It would be infeasible to apply a model consisting of more than a handful of such trees without distributed computing of some sort. [sent-158, score-0.27]

50 Therefore, we pose the following question: If we can afford to have only a handful of trees in the model, what would be best approach to construct those trees? [sent-159, score-0.295]

51 In the remainder of this section, we will describe the experimental setup, discuss and evaluate different ways of building decision tree forests for language modeling, and compare combination methods based on Eq. [sent-160, score-0.546]

52 For the syntactic modeling, we used tags comprised of the POS tags of the word and it’s head. [sent-169, score-0.184]

53 We constructed two sets of decision trees (a joint syntactic model and a word-tree model) as described in Section 2. [sent-176, score-0.593]

54 Each set was comprised of a fourgram tree with backoff trigram, bigram, and unigram trees. [sent-177, score-0.756]

55 He found that when the Exchange algorithm was initialized randomly, the Bernoulli trial parameter did not matter; however, when the Exchange algorithm was initialized deterministically; lower values for the Bernoulli trial parameter r yielded better overall forest performance. [sent-189, score-0.212]

56 By introducing Bernoulli trials, on the other hand, there is a choice to purposely degrade the quality of individual trees in the hope that additional diversity would enable their combination to compensate for the loss of quality in individual trees. [sent-198, score-0.353]

57 Another way of introducing randomness to the tree construction without apparent degradation of individual tree quality is through varying the data, e. [sent-199, score-0.483]

58 Let us take a closer look at the effect of different types of randomization on individual trees and their combinations. [sent-203, score-0.357]

59 In the first set of experiments, we compare the performance of a single undegraded fourgram tree9 with forests of fourgram trees grown randomly with Bernoulli trials. [sent-204, score-1.231]

60 Having only sameorder trees in a forest allows us to apply interpolation of Eq. [sent-205, score-0.636]

61 10 (used in (Xu, 2005)) and compare with the interpolation method presented in Eq. [sent-206, score-0.297]

62 By comparing forests of different sizes with the baseline from Table 1, we are able to evaluate the effect of randomization in decision tree growing and assess the importance of the lower order trees. [sent-208, score-0.692]

63 Note that, while an undegraded syntactic tree is better than the word tree, the situation is reversed when the trees are grown randomly. [sent-210, score-0.703]

64 As we increase the number of random trees in the forest, the perplexity decreases as expected, with the interpolation method of Eq. [sent-212, score-0.748]

65 Note that in the case of the word-tree model, it takes 4 random decision trees to reach the performance of a single undegraded tree, while in the joint model, even 8Here and henceforth, by “undegraded” we mean “according to the algorithm described in Section 2. [sent-215, score-0.731]

66 Percentage numbers in parentheses denote the reduction of perplexity relative to the lower order model of the same type. [sent-235, score-0.298]

67 1 Table 2: Perplexity numbers obtained using fourgram trees only. [sent-246, score-0.556]

68 Note that “undgr” and “rnd” denote undegraded and randomly grown trees with Bernoulli trials, respectively, and the number indicates the number of trees in the forest. [sent-247, score-0.754]

69 Also “baseline” refers to the fourgram models with lower order trees (from Table 1, Eq. [sent-248, score-0.6]

70 5 trees are much worse than a single decision tree constructed without randomization. [sent-250, score-0.685]

71 In Table 3, we evaluate forests of fourgram trees produced using randomizations without degrading the tree construction algorithm. [sent-252, score-0.898]

72 All forests in this table use the interpolation method of Eq. [sent-254, score-0.44]

73 Note that, while these perplexity numbers are substantially better than trees produced with Bernoulli trials in Table 2, they are still significantly worse than the baseline model from Table 1. [sent-256, score-0.563]

74 These results suggest that, while it is beneficial to combine different decision trees, we should introduce differences to the tree construction process 697 # tre sExcwhongrd. [sent-257, score-0.444]

75 07 Table 3: Perplexity numbers obtained using fourgram trees produced using random initialization of the Exchange algorithm (Exchng. [sent-265, score-0.614]

76 Note that “baseline” refers to the fourgram models with lower order trees (from Table 1). [sent-267, score-0.6]

77 without degrading the trees when introducing randomness, especially for joint models. [sent-270, score-0.338]

78 In addition, lower order trees seem to play an important role for high quality model combination. [sent-271, score-0.285]

79 3 Context-Restricted Forest As we have mentioned above, combining higher and lower order decision trees produces much better results. [sent-273, score-0.533]

80 A lower order decision tree is grown from a lower order context space, i. [sent-274, score-0.644]

81 Note that in this case, rather than randomly ignoring contexts via Bernoulli trials at every node in the decision tree, we discard some context attributes upfront in a principled manner (i. [sent-277, score-0.497]

82 , most distant context) and then grow the decision tree without degradation. [sent-279, score-0.427]

83 In Table 4, we present the perplexity numbers for our standard model with additional trees. [sent-281, score-0.224]

84 We denote context-restricted trees by their Markovian or- ModelsizePPL 1w1t + 2w2t + 3w3t + 4w4t (*)294MB147. [sent-282, score-0.234]

85 “bernoulli-rnd” and “datarnd” indicate fourgram trees randomized using Bernoulli trials and varying training data, respectively. [sent-287, score-0.623]

86 The second column shows the combined size of decision trees in the forest. [sent-288, score-0.482]

87 ders (words w and tags t independently), so 3w2t indicates a decision tree implementing the probability function: p(witi |wi−1wi−2ti−1). [sent-289, score-0.484]

88 The fourgram joint model presented in Table 1 has four trees and is labeled with the formula “1w1t + 2w2t + 3w3t + 4w4t” in Table 4. [sent-290, score-0.553]

89 The randomly grown trees (denoted “bernoulli-rnd”) are grown utilizing the full context 4w4t using the methods described in Section 4. [sent-291, score-0.476]

90 All models utilize the generalized interpolation method described in Section 3. [sent-293, score-0.426]

91 As can be seen in Table 4, adding undegraded trees consistently improves the performance of an already strong baseline, while adding random trees only increases the perplexity because their quality is worse than undegraded trees’ . [sent-295, score-1.051]

92 Note that the last two rows are syntactic models using the interpolation method of Eq. [sent-309, score-0.356]

93 In Table 5, we present WER results along with the corresponding perplexity numbers from Tables 1 and 4 for our lowest perplexity syntactic model, as well as the baselines (modified KN n-gram model and standard decision tree models using interpolation methods of Eq. [sent-319, score-1.169]

94 9 substantially improves performance over the interpolation method of Eq. [sent-323, score-0.297]

95 Adding four trees utilizing context restricted in different ways further reduces WER by 0. [sent-326, score-0.27]

96 6 Conclusion In this paper, we investigate various aspects of combining multiple decision trees in a single language model. [sent-331, score-0.482]

97 9) for decision tree models proposed in (Filimonov and Harper, 2011) is in fact a forest in- terpolation method rather than a backoff interpolation because, in Eq. [sent-333, score-1.109]

98 9, models do not have explicit higher-lower order relation as they do in backoff interpolation (Eq. [sent-334, score-0.601]

99 Thus, in this paper we investigate the question of how to construct decision trees so that their combination results in improved performance (under the assumption that computational tractability allows only a handful of decision trees in a forest). [sent-336, score-1.025]

100 Additionally, we observe that simply restricting the context used to construct trees in different ways, not only produces smaller trees (because of the context reduction), but the resulting variations in trees also produce forests that are at least as good as forests of larger trees. [sent-340, score-1.085]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wii', 0.301), ('interpolation', 0.297), ('fourgram', 0.284), ('backoff', 0.273), ('decision', 0.248), ('witi', 0.243), ('trees', 0.234), ('wi', 0.197), ('perplexity', 0.186), ('undegraded', 0.183), ('exchange', 0.158), ('tree', 0.155), ('bernoulli', 0.151), ('forests', 0.143), ('filimonov', 0.14), ('bon', 0.14), ('forest', 0.105), ('trials', 0.105), ('grown', 0.103), ('wer', 0.099), ('harper', 0.096), ('randomization', 0.095), ('initializations', 0.087), ('asr', 0.083), ('questions', 0.073), ('lms', 0.062), ('property', 0.057), ('generalized', 0.057), ('tags', 0.056), ('attributes', 0.055), ('node', 0.053), ('heldout', 0.052), ('lower', 0.051), ('constructed', 0.048), ('mb', 0.047), ('comprised', 0.044), ('xu', 0.043), ('rescoring', 0.042), ('cluster', 0.042), ('acoustic', 0.041), ('construction', 0.041), ('degradation', 0.041), ('utilize', 0.041), ('degrading', 0.041), ('discounted', 0.041), ('pnm', 0.041), ('denis', 0.039), ('numbers', 0.038), ('clustering', 0.038), ('mary', 0.037), ('jelinek', 0.037), ('context', 0.036), ('handful', 0.036), ('lattices', 0.036), ('randomness', 0.035), ('bahl', 0.035), ('purposely', 0.035), ('bilmes', 0.035), ('heeman', 0.035), ('joint', 0.035), ('wsj', 0.034), ('lm', 0.034), ('recursive', 0.033), ('attribute', 0.033), ('folds', 0.033), ('stopping', 0.033), ('speech', 0.033), ('models', 0.031), ('random', 0.031), ('estimated', 0.03), ('violation', 0.029), ('discounting', 0.029), ('vocabulary', 0.029), ('individual', 0.028), ('trigram', 0.028), ('syntactic', 0.028), ('introducing', 0.028), ('chelba', 0.028), ('frederick', 0.028), ('kn', 0.028), ('trial', 0.028), ('tk', 0.028), ('initialization', 0.027), ('pn', 0.027), ('violated', 0.026), ('construct', 0.025), ('recursively', 0.025), ('modeling', 0.025), ('transcription', 0.025), ('pm', 0.025), ('probability', 0.025), ('greedy', 0.024), ('iy', 0.024), ('rosenfeld', 0.024), ('fluent', 0.024), ('grow', 0.024), ('partitions', 0.024), ('reduction', 0.023), ('variety', 0.023), ('belongs', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Author: Denis Filimonov ; Mary Harper

Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.

2 0.12399378 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church

Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.

3 0.11429693 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

4 0.10095574 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson

Abstract: unkown-abstract

5 0.096421674 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

Author: Xinyan Xiao ; Yang Liu ; Qun Liu ; Shouxun Lin

Abstract: Although discriminative training guarantees to improve statistical machine translation by incorporating a large amount of overlapping features, it is hard to scale up to large data due to decoding complexity. We propose a new algorithm to generate translation forest of training data in linear time with the help of word alignment. Our algorithm also alleviates the oracle selection problem by ensuring that a forest always contains derivations that exactly yield the reference translation. With millions of features trained on 519K sentences in 0.03 second per sentence, our system achieves significant improvement by 0.84 BLEU over the baseline system on the NIST Chinese-English test sets.

6 0.095432527 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

7 0.093433358 125 emnlp-2011-Statistical Machine Translation with Local Language Models

8 0.084209085 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests

9 0.081324376 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

10 0.064415373 141 emnlp-2011-Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

11 0.063838758 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases

12 0.059835203 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

13 0.055505626 147 emnlp-2011-Using Syntactic and Semantic Structural Kernels for Classifying Definition Questions in Jeopardy!

14 0.054737337 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering

15 0.05469634 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

16 0.053714547 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

17 0.052769605 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

18 0.05252137 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

19 0.052097891 127 emnlp-2011-Structured Lexical Similarity via Convolution Kernels on Dependency Trees

20 0.051030379 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, 0.051), (2, -0.01), (3, -0.025), (4, -0.003), (5, 0.026), (6, -0.062), (7, -0.114), (8, 0.072), (9, -0.084), (10, 0.067), (11, 0.061), (12, -0.063), (13, 0.116), (14, -0.128), (15, 0.123), (16, 0.121), (17, 0.069), (18, -0.024), (19, 0.136), (20, 0.145), (21, -0.001), (22, 0.056), (23, -0.138), (24, -0.08), (25, -0.112), (26, 0.027), (27, -0.103), (28, 0.122), (29, 0.035), (30, -0.018), (31, -0.154), (32, -0.075), (33, -0.15), (34, -0.201), (35, -0.129), (36, -0.062), (37, -0.051), (38, -0.003), (39, -0.005), (40, 0.04), (41, 0.035), (42, -0.08), (43, -0.137), (44, 0.108), (45, 0.242), (46, 0.017), (47, -0.062), (48, 0.114), (49, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95731735 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Author: Denis Filimonov ; Mary Harper

Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.

2 0.59285659 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur

Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.

3 0.52295756 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

4 0.40595454 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

Author: Fabien Cromieres ; Sadao Kurohashi

Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.

5 0.3843427 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church

Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.

6 0.36766738 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

7 0.34923962 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

8 0.32739934 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

9 0.31392083 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

10 0.30963978 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

11 0.29983079 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use

12 0.29321545 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests

13 0.28811297 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

14 0.28358802 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

15 0.28095034 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

16 0.27185899 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering

17 0.27096665 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

18 0.2594561 141 emnlp-2011-Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

19 0.24936759 125 emnlp-2011-Statistical Machine Translation with Local Language Models

20 0.24500728 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.071), (36, 0.025), (37, 0.054), (45, 0.043), (53, 0.036), (54, 0.041), (57, 0.434), (62, 0.013), (64, 0.018), (66, 0.046), (69, 0.012), (79, 0.033), (82, 0.017), (90, 0.015), (96, 0.034), (98, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85663247 113 emnlp-2011-Relation Acquisition using Word Classes and Partial Patterns

Author: Stijn De Saeger ; Kentaro Torisawa ; Masaaki Tsuchida ; Jun'ichi Kazama ; Chikara Hashimoto ; Ichiro Yamada ; Jong Hoon Oh ; Istvan Varga ; Yulan Yan

Abstract: This paper proposes a semi-supervised relation acquisition method that does not rely on extraction patterns (e.g. “X causes Y” for causal relations) but instead learns a combination of indirect evidence for the target relation semantic word classes and partial patterns. This method can extract long tail instances of semantic relations like causality from rare and complex expressions in a large Japanese Web corpus in extreme cases, patterns that occur only once in the entire corpus. Such patterns are beyond the reach ofcurrent pattern based methods. We show that our method performs on par with state-of-the-art pattern based methods, and maintains a reasonable level of accuracy even for instances — — acquired from infrequent patterns. This ability to acquire long tail instances is crucial for risk management and innovation, where an exhaustive database of high-level semantic relations like causation is of vital importance.

same-paper 2 0.84365624 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Author: Denis Filimonov ; Mary Harper

Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.

3 0.83735108 130 emnlp-2011-Summarize What You Are Interested In: An Optimization Framework for Interactive Personalized Summarization

Author: Rui Yan ; Jian-Yun Nie ; Xiaoming Li

Abstract: Most traditional summarization methods treat their outputs as static and plain texts, which fail to capture user interests during summarization because the generated summaries are the same for different users. However, users have individual preferences on a particular source document collection and obviously a universal summary for all users might not always be satisfactory. Hence we investigate an important and challenging problem in summary generation, i.e., Interactive Personalized Summarization (IPS), which generates summaries in an interactive and personalized manner. Given the source documents, IPS captures user interests by enabling interactive clicks and incorporates personalization by modeling captured reader preference. We develop . experimental systems to compare 5 rival algorithms on 4 instinctively different datasets which amount to 5197 documents. Evaluation results in ROUGE metrics indicate the comparable performance between IPS and the best competing system but IPS produces summaries with much more user satisfaction according to evaluator ratings. Besides, low ROUGE consistency among these user preferred summaries indicates the existence of personalization.

4 0.46743304 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus

Author: Su Nam Kim ; Preslav Nakov

Abstract: Responding to the need for semantic lexical resources in natural language processing applications, we examine methods to acquire noun compounds (NCs), e.g., orange juice, together with suitable fine-grained semantic interpretations, e.g., squeezed from, which are directly usable as paraphrases. We employ bootstrapping and web statistics, and utilize the relationship between NCs and paraphrasing patterns to jointly extract NCs and such patterns in multiple alternating iterations. In evaluation, we found that having one compound noun fixed yields both a higher number of semantically interpreted NCs and improved accuracy due to stronger semantic restrictions.

5 0.4264296 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week

Author: Marjorie Freedman ; Lance Ramshaw ; Elizabeth Boschee ; Ryan Gabbard ; Gary Kratkiewicz ; Nicolas Ward ; Ralph Weischedel

Abstract: We report on empirical results in extreme extraction. It is extreme in that (1) from receipt of the ontology specifying the target concepts and relations, development is limited to one week and that (2) relatively little training data is assumed. We are able to surpass human recall and achieve an F1 of 0.5 1 on a question-answering task with less than 50 hours of effort using a hybrid approach that mixes active learning, bootstrapping, and limited (5 hours) manual rule writing. We compare the performance of three systems: extraction with handwritten rules, bootstrapped extraction, and a combination. We show that while the recall of the handwritten rules surpasses that of the learned system, the learned system is able to improve the overall recall and F1.

6 0.38505292 26 emnlp-2011-Class Label Enhancement via Related Instances

7 0.38403261 144 emnlp-2011-Unsupervised Learning of Selectional Restrictions and Detection of Argument Coercions

8 0.3819544 147 emnlp-2011-Using Syntactic and Semantic Structural Kernels for Classifying Definition Questions in Jeopardy!

9 0.37846419 70 emnlp-2011-Identifying Relations for Open Information Extraction

10 0.37717164 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

11 0.35830143 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models

12 0.34690124 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

13 0.34625039 128 emnlp-2011-Structured Relation Discovery using Generative Models

14 0.34488648 67 emnlp-2011-Hierarchical Verb Clustering Using Graph Factorization

15 0.34216753 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

16 0.33926919 105 emnlp-2011-Predicting Thread Discourse Structure over Technical Web Forums

17 0.33923572 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model

18 0.33618987 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

19 0.33531824 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

20 0.33259639 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words