acl acl2011 acl2011-14 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani
Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
Reference: text
sentIndex sentText sentNum sentScore
1 A Hierarchical Model of Web Summaries Yves Petinot and Kathleen McKeown and Kapil Thadani Department of Computer Science Columbia University New York, NY 10027 { ypet inot | kathy | kapi l @ cs . [sent-1, score-0.033]
2 edu } Abstract We investigate the relevance of hierarchical topic models to represent the content of Web gists. [sent-3, score-0.516]
3 We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. [sent-4, score-0.355]
4 Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. [sent-6, score-0.185]
5 We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data. [sent-7, score-0.239]
6 1 Introduction The work presented in this paper is aimed at leveraging a manually created document ontology to model the content of an underlying document col- lection. [sent-8, score-0.347]
7 Our study focuses on the ontology underlying DMOZ1 , a popular Web directory. [sent-10, score-0.118]
8 We propose two methods for crystalizing a hierarchical topic model against its hierarchy and show that the resulting models outperform a flat unigram model in its predictive power over held-out data. [sent-11, score-0.96]
9 org 670 To construct our hierarchical topic models, we adopt the mixed membership formalism (Hofmann, 1999; Blei et al. [sent-14, score-0.409]
10 , 2010), where a document is represented as a mixture over a set of word multinomials. [sent-15, score-0.161]
11 the DMOZ hierarchy) as a tree where internal nodes (category nodes) and leaf nodes (documents), as well as the edges connecting them, are known a priori. [sent-18, score-0.038]
12 Each node Ni in H is mapped to a multinomial word distribution MultNi , and each path cd to a leaf node D is associated with a mixture over the multinonials (MultC0 . [sent-19, score-0.749]
13 The mixture components are combined using a mixing proportion vector (θC0 . [sent-23, score-0.201]
14 θCk ), so that the likelihood of string w being produced by path cd is: p(w|cd) =Y|w| Xcd|θjp(wi|cd,j) (1) X|cd|θj= 1,∀d (2) Yi=0 Xj=0 where: Xj=0 In the following, we propose two models that fit in this framework. [sent-26, score-0.381]
15 We describe how they allow the derivation of both p(wi |cd,j) and θ and present early experimental results showing that explicit hierarchical information of content can indeed be used as a basis for content modeling purposes. [sent-27, score-0.309]
16 2 Related Work While several efforts have focused on the DMOZ corpus, often as a reference for Web summarization Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-28, score-0.074]
17 , 2009b), very little research has attempted to make use of its hierarchy as is. [sent-32, score-0.322]
18 (2005), where the DMOZ hierarchy is used as a basis for a hierarchical lexicon, is closest to ours although their contribution is not a full-fledged content model, but a selection of highly salient vocabulary for every category of the hierarchy. [sent-34, score-0.613]
19 The problem considered in this paper is connected to the area ofTopic Modeling (Blei and Lafferty, 2009) where the goal is to reduce the surface complexity of text documents by modeling them as mixtures over a finite set of topics2. [sent-35, score-0.034]
20 While the inferred models are usually flat, in that no explicit relationship exists among topics, more complex, non-parametric, representations have been proposed to elicit the hierarchical structure of various datasets (Hofmann, 1999; Blei et al. [sent-36, score-0.308]
21 , 2009a) or Fixed hLDA (Reisinger and Pa¸ sca, 2009) where the set of topics associated with a document is known a priori. [sent-40, score-0.224]
22 In both cases, document labels are mapped to constraints on the set of topics on which the - otherwise unaltered - topic inference algorithm is to be applied. [sent-41, score-0.507]
23 Lastly, while most recent developments have been based on unsupervised data, it is also worth mentioning earlier approaches like Topic Signatures (Lin and Hovy, 2000) where words (or phrases) characteristic of a topic are identified using a statistical test of dependence. [sent-42, score-0.224]
24 Our first model extends this approach to the hierarchical setting, building actual topic models based on the selected vocabulary. [sent-43, score-0.454]
25 3 Information-Theoretic Approach The assumption that topics are known a-priori allows us to extend the concept of Topic Signatures to a hierarchical setting. [sent-44, score-0.309]
26 Lin and Hovy (2000) describe a Topic Signature as a list of words highly correlated with a target concept, and use a χ2 estimator over labeled data to decide as to the allocation of a word to a topic. [sent-45, score-0.044]
27 Here, the sub-categories of a node correspond to the topics. [sent-46, score-0.102]
28 However, since the hierarchy is naturally organized in a generic-to-specific fashion, 2Here we use the term topic to describe a normalized distribution over a fixed vocabulary V. [sent-47, score-0.597]
29 671 for each node we select words that have the least discriminative power between the node’s children. [sent-48, score-0.174]
30 The rationale is that, if a word can discriminate well between one child and all others, then it belongs in that child’s node. [sent-49, score-0.076]
31 In the first phase, the hierarchy tree is traversed in a bottom-up fashion to compile word frequency information under each node. [sent-52, score-0.406]
32 In the second phase, the hierarchy is traversed top-down and, at each step, words get assigned to the current node based on whether they can discriminate between the current node’s children. [sent-53, score-0.522]
33 Once a word has been assigned on a given path, it can no longer be assigned to any other node on this path. [sent-54, score-0.102]
34 Thus, within a path, a word always takes on the meaning of the one topic to which it has been assigned. [sent-55, score-0.224]
35 The discriminative power of a term with respect to node N is formalized based on one of the following measures: Entropy of the a posteriori children category distribution for a given w. [sent-56, score-0.382]
36 Ent(w) = −X p(C|w)log(p(C|w) (3) C∈XSub(N) Cross-Entropy between the a priori children category distribution and the a posteriori children categories distribution conditioned on the appearance of w. [sent-57, score-0.444]
37 The number of degrees of freedom of the χ2 distribution is a function of the number of children. [sent-59, score-0.051]
38 3Although this makes the decision process less arbitrary Algorithm 1 Generative process for hLLDA • For each topic t ∈ H Draw βt = (βt,1, . [sent-61, score-0.224]
39 K} Draw a random path assignment cd ∈ H – • – –DD riarw(·| aα d)istribution over levels along cd, θd∼ – – Draw a document length n ∼ φH For each word wd,i ∈ {wd,1 , wd,2 , . [sent-67, score-0.476]
40 2 Topic Definition & Mixing Proportions Based on the final word assignments, we estimate the probability of word wi in topic Tk, as: P(wi|Tk) =nCnkC(wki) with nCk (wi) the total number of occurrence of (6) wi in documents under Ck, and nCk the total number of words in documents under Ck. [sent-71, score-0.556]
41 Given the individual word assignments we evaluate the mixing proportions using corpus-level estimates, which are computed by averaging the mixing proportions of all the training documents. [sent-72, score-0.535]
42 4 Hierarchical Bayesian Approach The previous approach, while attractive in its simplicity, makes a strong claim that a word can be emitted by at most one node on any given path. [sent-73, score-0.102]
43 A more interesting model might stem from allowing soft word-topic assignments, where any topic on the document’s path may emit any word in the vocabulary space. [sent-74, score-0.345]
44 We consider a modified version of hierarchical LDA (Blei et al. [sent-75, score-0.185]
45 , 2010), where the underlying tree structure is known a priori and does not have to be inferred from data. [sent-76, score-0.128]
46 The generative story for this model, which we designate as hierarchical LabeledLDA (hLLDA), is shown in Algorithm 1. [sent-77, score-0.185]
47 Just as with Fixed Structure LDA4 (Reisinger and Pa¸ sca, than with a hand-selected threshold, this raises the issue of identifying the true distribution for the estimator used. [sent-78, score-0.095]
48 com/ j oerai i / 672 2009), the topics used for inference are, for each document, those found on the path from the hierarchy root to the document itself. [sent-80, score-0.7]
49 Once the target path cd ∈ H is known, the model reduces to LDA over the ∈set H Hof i topics comprising cd. [sent-81, score-0.46]
50 dGuicveens tthoa Lt DthAe joint distribution p(θ, z, w |cd) is intractable (Blei et al. [sent-82, score-0.051]
51 Equation 7 can be understood as defining the unormalized posterior word-level assignment distribution as the product of the current level mixing proportion θi and of the current estimate of the word-topic conditional probability p(wi |zi). [sent-85, score-0.231]
52 By repeatedly resampling from this distribut|izon we obtain individual word assignments which in turn allow us to estimate the topic multinomials and the per-document mixing proportions. [sent-86, score-0.533]
53 Specifically, the topic multinomials are estimated as: βcd[j],i= p(wi|zcd[j]) =Pnnzw·zcicdd[[j ] ++ η V η (8) while the per-document mixinPg proportions θd can be estimated as: θd,j≈nndd+·,j+ |cd α|α,∀j ∈ 1,. [sent-87, score-0.396]
54 ,cd (9) Although we experimented with hyper-parameter learning (Dirichlet concentration parameter η), doing so did not significantly impact the final model. [sent-90, score-0.054]
55 5 Experimental Results We compared the predictive power of our model to that of several language models. [sent-93, score-0.118]
56 In every case, we compute the perplexity of the model over the heldout data W = {w1 . [sent-94, score-0.128]
57 1 Data Preprocessing Our experiments focused on the English portion of the DMOZ dataset5 (about 2. [sent-98, score-0.071]
58 Akin to Berger and Mittal (2000) we mapped numerical tokens to the NUM placeholder and selected the V = 65535 most frequent words as our vocabulary. [sent-102, score-0.089]
59 Any token outside of this set was mapped to the OOV token. [sent-103, score-0.059]
60 2 Reference Models Our reference models consists of several n-gram (n ∈ [1, 3]) language models, none of which makes use ∈o [f1 t,h3e] )h liaenrgaurcahgieca ml oidnefolsr,m naotinoen o fav waihlaicbhle m farokems the corpus. [sent-106, score-0.119]
61 Note that an interesting model to include here would have been one that jointly infers a hierarchy of topics as well as the topics that comprise it, much like the regular hierarchical LDA algorithm (Blei et al. [sent-110, score-0.755]
62 We are especially interested in seeing whether an automatically inferred hierarchy of topics would fundamentally differ from the manuallycurated hierarchy used by DMOZ. [sent-113, score-0.816]
63 5We discarded the Top/World portion of the hierarchy. [sent-114, score-0.071]
64 3 Experimental Results The perplexities obtained for the hierarchical and ngram models are reported in Table 1. [sent-116, score-0.23]
65 9760 Table 1: Perplexity of the hierarchical models and the reference n-gram models over the entire DMOZ dataset (all), and the non-Regional portion of the dataset (reg). [sent-124, score-0.418]
66 When taken on the entire hierarchy (all), the performance of the Bayesian and entropy-based mod- els significantly exceeds that of the 1-gram model (significant under paired t-test, both with p-value < 2. [sent-125, score-0.353]
67 While it is not clear how one could extend the information-theoretic models to include such context, we are currently investigating enhancements to the hLLDA model along the lines of the approach proposed in Wallach (2006). [sent-132, score-0.079]
68 A second area of analysis is to compare the performance of the various models on the entire hierarchy versus on the non-Regional portion of the tree (reg). [sent-133, score-0.469]
69 We can see that the perplexity of the proposed models decreases while that of the flat n-grams models increase. [sent-134, score-0.284]
70 Since the non-Regional portion of the DMOZ hierarchy is organized more consistently in a semantic fashion6, we believe this reflects the ability of the hierarchical models to take advantage of 6The specificity of the Regional sub-tree has also been discussed by previous work (Ramage et al. [sent-135, score-0.714]
71 , 2009b), justifying a special treatment for that part of the DMOZ dataset. [sent-136, score-0.033]
72 the corpus structure to represent the content of the summaries. [sent-138, score-0.062]
73 On the other hand, the Regional portion of the dataset seems to contribute a significant amount of noise to the hierarchy, leading to a loss in performance for those models. [sent-139, score-0.071]
74 We can observe that while hLLDA outperforms all information-theoretical models when applied to the entire DMOZ corpus, it falls behind the entropybased model when restricted to the non-regional section of the corpus. [sent-140, score-0.076]
75 Also if the reduction in perplexity remains limited for the entropy, χ2 and hLLDA models, the cross-entropy based model incurs a more significant boost in performance when applied to the more semantically-organized portion of the corpus. [sent-141, score-0.199]
76 The reason behind such disparity in behavior is not clear and we plan on investigating this issue as part of our future work. [sent-142, score-0.034]
77 Further analyzing the impact of the respective DMOZ sub-sections, we show in Figure 1 results for the hierarchical and 1-gram models when trained and tested over the 14 main sub-trees of the hierarchy. [sent-143, score-0.23]
78 Our intuition is that differences in the organization of those sub-trees might affect the predictive power of the various models. [sent-144, score-0.118]
79 Looking at sub-trees we can see that the trend is the same for most of them, with the best level of perplexity being achieved by the hierarchical Bayesian model, closely followed by the 674 information-theoretical model using entropy as its selection criterion. [sent-145, score-0.313]
80 6 Conclusion In this paper we have demonstrated the creation of a topic-model of Web summaries using the hierarchy of a popular Web directory. [sent-146, score-0.405]
81 This hierarchy provides a backbone around which we crystalize hierarchical topic models. [sent-147, score-0.731]
82 Individual topics exhibit increasing specificity as one goes down a path in the tree. [sent-148, score-0.336]
83 While we focused on Web summaries, this model can be readily adapted to any Web-related content that can be seen as a mixture of the component topics appearing along a paths in the hierarchy. [sent-149, score-0.247]
84 The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. [sent-180, score-0.319]
85 The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. [sent-194, score-0.224]
86 The automated acquisition of topic signatures for text summarization. [sent-206, score-0.294]
87 Labeled lda: A supervised topic model for credit attribution in multilabeled corpora. [sent-211, score-0.257]
wordName wordTfidf (topN-words)
[('dmoz', 0.45), ('hierarchy', 0.322), ('hllda', 0.225), ('topic', 0.224), ('cd', 0.215), ('hierarchical', 0.185), ('blei', 0.176), ('mixing', 0.14), ('wi', 0.132), ('perplexity', 0.128), ('ramage', 0.125), ('topics', 0.124), ('path', 0.121), ('node', 0.102), ('document', 0.1), ('lda', 0.095), ('specificity', 0.091), ('proportions', 0.086), ('multinomials', 0.086), ('regional', 0.086), ('assignments', 0.083), ('reisinger', 0.081), ('delort', 0.075), ('gists', 0.075), ('nck', 0.075), ('xsub', 0.075), ('web', 0.072), ('power', 0.072), ('draw', 0.071), ('portion', 0.071), ('signatures', 0.07), ('children', 0.066), ('flat', 0.066), ('berger', 0.065), ('content', 0.062), ('mixture', 0.061), ('mapped', 0.059), ('bayesian', 0.058), ('traversed', 0.054), ('concentration', 0.054), ('mult', 0.054), ('reg', 0.054), ('sca', 0.054), ('griffiths', 0.053), ('mittal', 0.052), ('distribution', 0.051), ('summaries', 0.05), ('appearance', 0.048), ('inferred', 0.048), ('posteriori', 0.047), ('predictive', 0.046), ('hofmann', 0.045), ('ontology', 0.045), ('zi', 0.045), ('models', 0.045), ('hovy', 0.045), ('category', 0.044), ('ck', 0.044), ('discriminate', 0.044), ('estimator', 0.044), ('tk', 0.043), ('sigir', 0.042), ('dirichlet', 0.041), ('reference', 0.041), ('assignment', 0.04), ('priori', 0.04), ('underlying', 0.04), ('thomas', 0.04), ('leaf', 0.038), ('nonparametric', 0.037), ('xj', 0.036), ('investigating', 0.034), ('documents', 0.034), ('summarization', 0.033), ('micheal', 0.033), ('navigating', 0.033), ('auai', 0.033), ('fav', 0.033), ('hector', 0.033), ('hlda', 0.033), ('justifying', 0.033), ('kapi', 0.033), ('multilabeled', 0.033), ('nallapati', 0.033), ('oerai', 0.033), ('pachinko', 0.033), ('ramesh', 0.033), ('thadani', 0.033), ('lin', 0.033), ('popular', 0.033), ('srilm', 0.032), ('child', 0.032), ('conditioned', 0.031), ('entire', 0.031), ('fashion', 0.03), ('deviations', 0.03), ('elicit', 0.03), ('hof', 0.03), ('placeholder', 0.03), ('pnas', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 14 acl-2011-A Hierarchical Model of Web Summaries
Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani
Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
2 0.19972101 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens
Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.
3 0.19038619 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
4 0.18751 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
5 0.17970976 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
6 0.178811 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
7 0.17332762 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
8 0.15961318 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
9 0.15928322 150 acl-2011-Hierarchical Text Classification with Latent Concepts
10 0.127729 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
11 0.11293586 204 acl-2011-Learning Word Vectors for Sentiment Analysis
12 0.098473467 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
13 0.087939158 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
14 0.087339006 142 acl-2011-Generalized Interpolation in Decision Tree LM
15 0.084598221 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes
16 0.082526006 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
17 0.078386553 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
18 0.074817054 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
19 0.073884144 109 acl-2011-Effective Measures of Domain Similarity for Parsing
20 0.073410019 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
topicId topicWeight
[(0, 0.211), (1, 0.083), (2, -0.054), (3, 0.11), (4, -0.03), (5, -0.108), (6, -0.15), (7, 0.226), (8, -0.016), (9, 0.083), (10, -0.087), (11, 0.029), (12, 0.082), (13, 0.039), (14, 0.151), (15, 0.003), (16, -0.052), (17, -0.055), (18, -0.027), (19, 0.08), (20, 0.042), (21, 0.044), (22, 0.025), (23, -0.023), (24, 0.002), (25, -0.041), (26, 0.015), (27, 0.021), (28, -0.102), (29, 0.037), (30, -0.033), (31, -0.025), (32, -0.042), (33, 0.042), (34, 0.035), (35, 0.001), (36, 0.025), (37, -0.039), (38, 0.005), (39, -0.002), (40, 0.044), (41, 0.007), (42, 0.025), (43, -0.007), (44, -0.048), (45, 0.01), (46, -0.079), (47, -0.005), (48, 0.042), (49, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.9706226 14 acl-2011-A Hierarchical Model of Web Summaries
Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani
Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
2 0.87002683 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
3 0.86660415 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai
Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,
4 0.81698221 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
Author: Asli Celikyilmaz ; Dilek Hakkani-Tur
Abstract: Extractive methods for multi-document summarization are mainly governed by information overlap, coherence, and content constraints. We present an unsupervised probabilistic approach to model the hidden abstract concepts across documents as well as the correlation between these concepts, to generate topically coherent and non-redundant summaries. Based on human evaluations our models generate summaries with higher linguistic quality in terms of coherence, readability, and redundancy compared to benchmark systems. Although our system is unsupervised and optimized for topical coherence, we achieve a 44.1 ROUGE on the DUC-07 test set, roughly in the range of state-of-the-art supervised models.
5 0.80245805 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
6 0.80020744 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
7 0.7753163 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
8 0.76481175 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
9 0.75377983 117 acl-2011-Entity Set Expansion using Topic information
10 0.70070255 305 acl-2011-Topical Keyphrase Extraction from Twitter
11 0.65055549 150 acl-2011-Hierarchical Text Classification with Latent Concepts
12 0.58557487 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
13 0.56938541 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
14 0.54002368 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
15 0.51390421 142 acl-2011-Generalized Interpolation in Decision Tree LM
16 0.48639482 76 acl-2011-Comparative News Summarization Using Linear Programming
17 0.4733018 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing
18 0.45067868 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
19 0.44761831 82 acl-2011-Content Models with Attitude
20 0.44517258 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
topicId topicWeight
[(5, 0.029), (17, 0.056), (26, 0.015), (28, 0.013), (37, 0.105), (39, 0.071), (41, 0.074), (44, 0.011), (53, 0.011), (55, 0.049), (59, 0.042), (72, 0.047), (91, 0.027), (96, 0.148), (97, 0.221)]
simIndex simValue paperId paperTitle
1 0.95325583 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment
Author: Peter LoBue ; Alexander Yates
Abstract: Understanding language requires both linguistic knowledge and knowledge about how the world works, also known as common-sense knowledge. We attempt to characterize the kinds of common-sense knowledge most often involved in recognizing textual entailments. We identify 20 categories of common-sense knowledge that are prevalent in textual entailment, many of which have received scarce attention from researchers building collections of knowledge.
2 0.89696544 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
Author: John Lee ; Jason Naradowsky ; David A. Smith
Abstract: Most previous studies of morphological disambiguation and dependency parsing have been pursued independently. Morphological taggers operate on n-grams and do not take into account syntactic relations; parsers use the “pipeline” approach, assuming that morphological information has been separately obtained. However, in morphologically-rich languages, there is often considerable interaction between morphology and syntax, such that neither can be disambiguated without the other. In this paper, we propose a discriminative model that jointly infers morphological properties and syntactic structures. In evaluations on various highly-inflected languages, this joint model outperforms both a baseline tagger in morphological disambiguation, and a pipeline parser in head selection.
3 0.89267182 167 acl-2011-Improving Dependency Parsing with Semantic Classes
Author: Eneko Agirre ; Kepa Bengoetxea ; Koldo Gojenola ; Joakim Nivre
Abstract: This paper presents the introduction of WordNet semantic classes in a dependency parser, obtaining improvements on the full Penn Treebank for the first time. We tried different combinations of some basic semantic classes and word sense disambiguation algorithms. Our experiments show that selecting the adequate combination of semantic features on development data is key for success. Given the basic nature of the semantic classes and word sense disambiguation algorithms used, we think there is ample room for future improvements. 1
4 0.8605752 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
Author: Yabin Zheng ; Lixing Xie ; Zhiyuan Liu ; Maosong Sun ; Yang Zhang ; Liyun Ru
Abstract: Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most common user behaviors to modify the errors, we collect 54, 309, 334 error-correction pairs from a realworld data set that contains 2, 277, 786 users via backspace operations. In addition, we present a comparative analysis of the data to achieve a better understanding of users’ input behaviors. Comparisons with English typos suggest that some language-specific properties result in a part of Chinese input errors. 1
same-paper 5 0.83860326 14 acl-2011-A Hierarchical Model of Web Summaries
Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani
Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
6 0.76639652 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary
7 0.75251067 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
8 0.73685992 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon
9 0.73279595 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
10 0.72619659 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation
11 0.72204608 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD
12 0.72200543 11 acl-2011-A Fast and Accurate Method for Approximate String Search
13 0.71998602 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
14 0.71743798 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
15 0.71714711 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
16 0.71431327 44 acl-2011-An exponential translation model for target language morphology
17 0.71229643 178 acl-2011-Interactive Topic Modeling
18 0.71154052 28 acl-2011-A Statistical Tree Annotator and Its Applications
19 0.71101969 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
20 0.71060473 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models