emnlp emnlp2012 emnlp2012-115 knowledge-graph by maker-knowledge-mining

115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model


Source: pdf

Author: Xian-Ling Mao ; Zhao-Yan Ming ; Tat-Seng Chua ; Si Li ; Hongfei Yan ; Xiaoming Li

Abstract: Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. Supervised hierarchical topic modeling makes heavy use of the information from observed hierarchical labels, but cannot explore new topics; while unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, but does not make use of any information from hierarchical labels. In this paper, we propose a semi-supervised hierarchical topic model which aims to explore new topics automatically in the data space while incorporating the information from observed hierarchical labels into the modeling process, called SemiSupervised Hierarchical Latent Dirichlet Allocation (SSHLDA). We also prove that hLDA and hLLDA are special cases of SSHLDA. We . conduct experiments on Yahoo! Answers and ODP datasets, and assess the performance in terms of perplexity and clustering. The experimental results show that predictive ability of SSHLDA is better than that of baselines, and SSHLDA can also achieve significant improvement over baselines for clustering on the FScore measure.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn , {{cxhiuaantl s nmgimnagoz,hlaxomy}a@np}k k@u nus Abstract Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. [sent-6, score-1.157]

2 1 Introduction Topic models, such as latent Dirichlet allocation (LDA), are useful NLP tools for the statistical analysis of document collections and other discrete data. [sent-14, score-0.253]

3 cn , Furthermore, hierarchical topic modeling is able to obtain the relations between topics parent-child and sibling relations. [sent-24, score-0.669]

4 Unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, such as hierarchical Latent Dirichlet Allocation (hLDA) (Blei et al. [sent-25, score-0.833]

5 hLDA makes use ofnested Dirichlet Process to automatically obtain a L-level hierarchy of topics. [sent-27, score-0.243]

6 They are usually documents with hierarchical labels such as Web pages and their placement in hierarchical directories (Ming et al. [sent-29, score-0.512]

7 Unsupervised hierarchical topic modeling cannot make use of any information from hierarchical labels, thus supervised hierarchical topic models, such as hierarchical Labeled Latent Dirichlet Allocation (hLLDA) (Petinot et al. [sent-31, score-1.302]

8 hLLDA uses hierarchical labels to automatically build corresponding topic for each label, but it cannot find new latent topics in the data space, only depending on hierarchy of labels. [sent-33, score-0.994]

9 We think that a corpus with hierarchical labels should include not only observed topics of labels, but also there are more latent topics, just like icebergs. [sent-35, score-0.59]

10 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl node in BT, called Leaf Topic Hierarchy (LTH); finally, we add each LTH to corresponding leaf in the BT and obtain a hierarchy for the entire dataset. [sent-40, score-0.317]

11 To tackle the above drawbacks, we explore the use of probabilistic models for such a task where the hierarchical labels are merely viewed as a part of a hierarchy of topics, and the topics of a path in the whole hierarchy generate a corresponding document. [sent-44, score-0.904]

12 Our proposed generative model learns both the latent topics of the underlying data and the labeling strategies in a joint model, by leveraging on the hierarchical structure of labels and Hierarchical Dirichlet Process. [sent-45, score-0.57]

13 We demonstrate the effectiveness of the proposed model on large, real-world datasets in the question answering and website category domains on two tasks: the topic modeling of documents, and the use of the generated topics for document clustering. [sent-46, score-0.589]

14 Our results show that our joint, semi-hierarchical model outperforms the state-of-the-art supervised and unsupervised hierarchical algorithms. [sent-47, score-0.249]

15 The contributions of this paper are threefold: (1) We propose a joint, generative semi-supervised hierarchical topic model, i. [sent-48, score-0.455]

16 SSHLDA is able to not only explore new latent topics in the data space, but also makes use of the information from the hierarchy of observed labels; (2) We prove that hLDA and hLLDA are special cases of SSHLDA; (3) We develop a gibbs sampling inference algorithm for the proposed model. [sent-51, score-0.605]

17 2 Related Work There have been many variations of topic models. [sent-57, score-0.236]

18 The existing topic models can be divided into four categories: Unsupervised non-hierarchical topic models, Unsupervised hierarchical topic models, and their corresponding supervised counterparts. [sent-58, score-0.932]

19 Unsupervised non-hierarchical topic models are widely studied, such as LSA (Deerwester et al. [sent-59, score-0.236]

20 LDA is similar to pLSA, except that in LDA the topic distribution is assumed to have a Dirichlet prior. [sent-67, score-0.271]

21 Another famous model that not only represents topic correlations, but also learns them, is the Correlated Topic Model (CTM). [sent-69, score-0.26]

22 (2004) proposed the hLDA model that simultaneously learns the structure of a topic hierarchy and the topics that are contained within that hierarchy. [sent-78, score-0.634]

23 This algorithm can be used to extract topic hierarchies from large document collections. [sent-79, score-0.342]

24 Although unsupervised topic models are sufficiently expressive to model multiple topics per document, they are inappropriate for labeled corpora because they are unable to incorporate the observed labels into their learning procedure. [sent-80, score-0.597]

25 However, these models obtain topics that do not correspond directly with the labels. [sent-92, score-0.215]

26 For hierarchical labeled data, there are also few models that are able to handle the label relations in data. [sent-96, score-0.272]

27 Although hLLDA can obtain a distribution over words for each label, hLLDA is unable to capture the relations between parent and child node using parameters, and it also cannot detect automatically latent topics in the data space. [sent-101, score-0.379]

28 In this paper, we will propose a generative topic model to tackle these problems of hLLDA. [sent-102, score-0.261]

29 3 Preliminaries The nested Chinese restaurant process (nCRP) is a distribution over hierarchical partitions (Blei et al. [sent-103, score-0.318]

30 In this paper, we will make use of nested CRP to explore latent topics in data space. [sent-114, score-0.308]

31 If a model can learn a distribution over words for a label, we refer the topic with a corresponding label as a labeled topic. [sent-116, score-0.349]

32 If a model can learn an unseen and latent topic without a label, we refer the Figure 1: The graphical model of SSHLDA. [sent-117, score-0.325]

33 4 The Semi-Supervised Hierarchical Topic Model In this section, we will introduce a semisupervised hierarchical topic model, i. [sent-119, score-0.451]

34 SSHLDA is a probabilistic graphical model that describes a process for generating a hierarchical labeled document collection. [sent-122, score-0.301]

35 , 2011), SSHLDA can incorporate labeled topics into the generative process of documents. [sent-124, score-0.239]

36 , 2004), SSHLDA can automatically explore latent topic in data space, and extend the existing hierarchy of observed topics. [sent-126, score-0.583]

37 ) is a k-dimensional Dirichlet distribution, Tj is a set of paths in the hierarchy of latent topics for jth leaf node in the hierarchy of ob- 803 Figure 2: One illustration of SSHLDA. [sent-130, score-0.796]

38 The shaded nodes are observed topics, and circled nodes are latent topics. [sent-132, score-0.297]

39 The latent topics are generated automatically by SSHLDA model. [sent-133, score-0.274]

40 SSHLDA, as shown in Figure 1, assumes the following generative process: (1) For each table k ∈ T in the infinite tree, (a) Draw a topic βk ∼ Dir(η). [sent-139, score-0.293]

41 (c) Draw an L-dimensional topic proportion vector θm from Dir(α). [sent-149, score-0.236]

42 As the example showed in Figure 2, we assume that we have known a hierarchy of observed topics: {A1,A2,A17,A3,A4}, and assume the height iocfs :the { Ad1e,sAire2d,A topical Atr4e}e, i sa Ld = 5m. [sent-161, score-0.52]

43 A thlle c hirecilgehdt nodes are latent topics, and shaded nodes are observed topics. [sent-162, score-0.297]

44 ch A wfteorrd g efrttoimng one of topics in this set of topics cm. [sent-166, score-0.37]

45 5 Probabilistic Inference µ In this section, we describe a Gibbs sampling algorithm for sampling from the posterior and corresponding topics in the SSHLDA model. [sent-167, score-0.251]

46 The Gibbs sampler provides a method for simultaneously exploring the model parameter space (the latent topics of the whole corpus) and the model structure space (L-level trees). [sent-168, score-0.274]

47 In SSHLDA, we sample the paths cm for document m and the per-word level allocations to topics in those paths zm,n. [sent-169, score-0.438]

48 1 will tend to ochfo thoese topics iwngith t fpeiwcser ( high-probability words), µci is the dirichlet prior of δci , and is the set of µci . [sent-175, score-0.291]

49 wm,n denotes the nth word in the mth document; and cm,l represents the restaurant corresponding to the lth-level topic in document m; and zm,n, the assignment of the nth word in the mth document to one of the L available topics. [sent-176, score-0.535]

50 Noting that cm is composed of com and cem , com is the set of observed topics for document m, and cem is the set of latent topics for document m. [sent-181, score-0.981]

51 The conditional distribution for cm, the L topics associated with document m, is: p(cm |z, w, c−m , µ) =p(com |µ)p(cem |zem, wem, ce) , , ∝p(com |µ)p(wem |cem we−m zem) p(cem |ce) (2) p(com|µ) =|coi∏m=|0−1p(ci,m|µci) (3) where 804 and p(wem |cem, we−m, zem) =|lc∏=em1|(∏Γ(wn. [sent-182, score-0.298]

52 ce(mn,clw,e−mm,l,−+m n+·cem n,lcw,emm,+l,m |V+ |η η)) (4) is the set of latent topics for all documents other than m, zem is the assignment of the words in the mth document to one of the latent topics, and wem is the set of the words belonging to one of the latent topics in the the mth document. [sent-184, score-0.949]

53 ncwem,l,−m is the number of instances of word w that have,− bmeen assigned to the topic indexed by cem,l, not including those in the document m. [sent-185, score-0.314]

54 After obtaining individual word assignments z, we can estimate the topic multinomials and the per-document mixing proportions. [sent-191, score-0.258]

55 Specifically, the topic multinomials are estimated as: βcm,j,i= p(wi|zcm,j) =|V |ηη + + n∑zcwnmi. [sent-192, score-0.236]

56 1 Relation to Existing Models In this section, we draw comparisons with the current state-of-the-art models for hierarchical topic modeling (Blei et al. [sent-199, score-0.493]

57 Our proposed model provides a unified framework allowing us to model hierarchical labels while to explore new latent topics. [sent-203, score-0.36]

58 Equivalence to hLDA As introduced in Section 2, hLDA is a unsupervised hierarchical topic model. [sent-204, score-0.455]

59 In this case, there are no observed nodes, that is, the corpus has no hierarchical labels. [sent-205, score-0.239]

60 Equivalence to hLLDA hLLDA is a supervised hierarchical topic model, which means all nodes in hi- erarchy are observed. [sent-208, score-0.551]

61 6 Experiments We demonstrate the effectiveness of the proposed model on large, real-world datasets in the question answering and website category domains on two tasks: the topic modeling of documents, and the use of the generated topics for document clustering. [sent-211, score-0.589]

62 From this table, we can see that these datasets are very diverse: Y Ans has much fewer labels than O Hlth and O Home, but have much more docu- ments for each label; meanwhile the depth of hierarchical tree for O Hlth and O Home can reach level 9 or above. [sent-227, score-0.34]

63 2 Case Study With topic modeling, the top associated words of topics can be used as good descriptors for topics in a hierarchy (Blei et al. [sent-232, score-0.819]

64 The tree-based topic visualizations of Figure 3 (a) and (b) are the results of SSHLDA and Simp-hLDA. [sent-235, score-0.236]

65 We have three major observations from the example: (i) SSHLDA is a unified and generative model, after learning, it can obtain a hierarchy of topics; ∗http://dmoz. [sent-236, score-0.268]

66 In both figures, the shaded and squared nodes are observed labels, not topics; the shaded and round nodes are topics with observed labels; blue nodes are topics but without labels and the yellow node is one of leaves in hierarchy of labels. [sent-238, score-1.053]

67 while Simp-hLDA is a heuristic method, and its result is a mixture of label nodes and topical nodes. [sent-240, score-0.215]

68 For example, Figure 3 (b) shows that the hierarchy includes label nodes and topic nodes, and each of labeled nodes just has a label, but label nodes in Figure 3 (a) have their corresponding topics. [sent-241, score-0.765]

69 For example, in Figure 3 (b), although label “root” is a parent of label “Computers & Internet”, the topical words of label “Computers & Internet” show the topical node is not a child of label “root”. [sent-243, score-0.442]

70 (iii) In a hierarchy of topics, if a topical node has correspending label, the label can help people understand descendant topical nodes. [sent-245, score-0.508]

71 For example, when we know node “error files click screen virus” in Figure 3 (a) has its label “Computers & Internet”, we can understand the child topic “hard screen usb power dell” is about 806 “computer hardware”. [sent-246, score-0.325]

72 3 Perplexity Comparison A good topic model should be able to generalize to unseen data. [sent-250, score-0.236]

73 Perplexity, which is widely used in the language modeling and topic modeling community, is equivalent algebraically to the inverse of the geometric mean perword likelihood (Blei et al. [sent-252, score-0.284]

74 The perplexity ofM test documents is calculated as: perplexity(Dtest) = exp{−∑dM=1∑∑mNdM=d=11lNogdp(wdm)} (10) where Dtest is the test collection of M documents, Nd is document length of document d and wdm is mth word in document d. [sent-262, score-0.417]

75 It shows that the performance oLf S S∈H {L5D,A6, 7is, always better than the state-of-the-art baselines, and means that our proposed model can model the hierarchical labeled data better than the state-of-the-art models. [sent-270, score-0.223]

76 Given a document collection DS with a H-level hierarchy oflabels, each label in the hierarchy and corresponding documents will be taken as the ground truth of clustering algorithms. [sent-276, score-0.64]

77 Base Tree (BT), and we need to construct a L-level hierarchy (l < L <= H) over the documents DS using a 807 Figure 4: Perplexities of hLLDA, hLDA, Simp-hLDA and SSHLDA. [sent-281, score-0.26]

78 The results are run over the O Hlth dataset, with the height of the hierarchy of observed labels l = 3. [sent-282, score-0.494]

79 The X-axis is the height of the whole topical tree (L), and Y-axis is the perplexity. [sent-283, score-0.295]

80 (ii) for Simp-hLDA, hLDA runs on the documents in each leaf-node in BT, and the height parameter is (L − l) for each hLDA. [sent-287, score-0.206]

81 After training, reaamche edorc isum (Lent − i sl assigned t hoL top-1 topic aracicnoinrdg-, ing to the distribution over topics for the document. [sent-288, score-0.456]

82 Each topic and corresponding documents forms a new cluster. [sent-289, score-0.283]

83 Each topic and corresponding documents forms a new cluster. [sent-292, score-0.283]

84 Topics and their corresponding documents form a hierarchy of clusters. [sent-295, score-0.26]

85 Thus we can use clustering metrics to measure the quality of various algorithms by using a measure that takes into account the overall set of clusters that are represented in the new generated part of a hierarchical tree. [sent-299, score-0.234]

86 The FScore of the class Cr, is the maximum FScore value attained at any node in the hierarchical clustering tree T. [sent-303, score-0.307]

87 2 Experimental Results Each of hLDA, Simp-hLDA and SSHLDA needs a parameter—the height of the topical tree, i. [sent-310, score-0.262]

88 L; and for Simp-hLDA and SSHLDA, they need another parameter—the height of the hierarchical observed labels, i. [sent-312, score-0.398]

89 The h-clustering does not have any height parameters, thus its FScore will keep the same values at different height of the topical tree. [sent-314, score-0.421]

90 With choosing the height of hierarchical labels for O Home as 4, i. [sent-315, score-0.43]

91 l = 4, the results of our model and baselines with respect to the height of a hierarchy are shown in Figure 5. [sent-317, score-0.372]

92 The results are run over the O Home dataset, with the height of the hierarchy of observed labels l= 3. [sent-330, score-0.494]

93 The X-axis is the height of the whole topical tree (L), and Y-axis is the FScore measure. [sent-331, score-0.295]

94 7 Conclusion and Future work In this paper, we have proposed a semi-supervised hierarchical topic models, i. [sent-336, score-0.43]

95 Specially, SSHLDA incorporates the information of labels into generative process of topic modeling while exploring latent topics in data space. [sent-339, score-0.636]

96 In the future, we will continue to explore novel topic models for hierarchical labeled data to further improve the effectiveness; meanwhile we will also apply SSHLDA to other media forms, such as image, to solve related problems in these areas. [sent-344, score-0.459]

97 Hierarchical topic models and the nested chinese restaurant process. [sent-389, score-0.325]

98 Text modeling using unsupervised topic models and concept hierarchies. [sent-412, score-0.285]

99 Prototype hierarchy based clustering for the categorization and navigation of web collections. [sent-480, score-0.253]

100 Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. [sent-512, score-0.266]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sshlda', 0.514), ('hlda', 0.403), ('hllda', 0.292), ('topic', 0.236), ('hierarchy', 0.213), ('hierarchical', 0.194), ('topics', 0.185), ('height', 0.159), ('fscore', 0.151), ('blei', 0.12), ('dirichlet', 0.106), ('topical', 0.103), ('cm', 0.103), ('lda', 0.09), ('latent', 0.089), ('allocation', 0.086), ('cem', 0.084), ('chemudugunta', 0.084), ('bt', 0.084), ('ramage', 0.083), ('document', 0.078), ('labels', 0.077), ('petinot', 0.069), ('computers', 0.064), ('home', 0.064), ('perplexity', 0.064), ('nodes', 0.063), ('internet', 0.062), ('crp', 0.06), ('ans', 0.06), ('hlth', 0.056), ('wem', 0.056), ('restaurant', 0.055), ('cr', 0.053), ('label', 0.049), ('arxiv', 0.048), ('rubin', 0.048), ('documents', 0.047), ('ci', 0.046), ('observed', 0.045), ('mth', 0.044), ('smyth', 0.043), ('lth', 0.043), ('zem', 0.043), ('pachinko', 0.042), ('clustering', 0.04), ('node', 0.04), ('gibbs', 0.04), ('draw', 0.039), ('cl', 0.039), ('shaded', 0.037), ('datasets', 0.036), ('odp', 0.036), ('mcauliffe', 0.036), ('preprint', 0.036), ('distribution', 0.035), ('leaf', 0.034), ('nested', 0.034), ('sampling', 0.033), ('tree', 0.033), ('infinite', 0.032), ('yahoo', 0.031), ('supervised', 0.03), ('website', 0.03), ('drawbacks', 0.03), ('obtain', 0.03), ('labeled', 0.029), ('hierarchies', 0.028), ('allocations', 0.028), ('dtest', 0.028), ('erarchy', 0.028), ('hslda', 0.028), ('pam', 0.028), ('perotte', 0.028), ('wdm', 0.028), ('ce', 0.027), ('customers', 0.027), ('crawled', 0.027), ('com', 0.025), ('imagine', 0.025), ('unsupervised', 0.025), ('generative', 0.025), ('tables', 0.025), ('formula', 0.024), ('modeling', 0.024), ('cluto', 0.024), ('sits', 0.024), ('famous', 0.024), ('ctm', 0.024), ('plsa', 0.024), ('sub', 0.023), ('customer', 0.023), ('tm', 0.023), ('paths', 0.022), ('proceeding', 0.022), ('ii', 0.022), ('path', 0.022), ('mixing', 0.022), ('mult', 0.022), ('semisupervised', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

Author: Xian-Ling Mao ; Zhao-Yan Ming ; Tat-Seng Chua ; Si Li ; Hongfei Yan ; Xiaoming Li

Abstract: Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. Supervised hierarchical topic modeling makes heavy use of the information from observed hierarchical labels, but cannot explore new topics; while unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, but does not make use of any information from hierarchical labels. In this paper, we propose a semi-supervised hierarchical topic model which aims to explore new topics automatically in the data space while incorporating the information from observed hierarchical labels into the modeling process, called SemiSupervised Hierarchical Latent Dirichlet Allocation (SSHLDA). We also prove that hLDA and hLLDA are special cases of SSHLDA. We . conduct experiments on Yahoo! Answers and ODP datasets, and assess the performance in terms of perplexity and clustering. The experimental results show that predictive ability of SSHLDA is better than that of baselines, and SSHLDA can also achieve significant improvement over baselines for clustering on the FScore measure.

2 0.20314194 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

Author: Lan Du ; Wray Buntine ; Huidong Jin

Abstract: Topic models are increasingly being used for text analysis tasks, often times replacing earlier semantic techniques such as latent semantic analysis. In this paper, we develop a novel adaptive topic model with the ability to adapt topics from both the previous segment and the parent document. For this proposed model, a Gibbs sampler is developed for doing posterior inference. Experimental results show that with topic adaptation, our model significantly improves over existing approaches in terms of perplexity, and is able to uncover clear sequential structure on, for example, Herman Melville’s book “Moby Dick”.

3 0.20072901 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

4 0.18125322 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler

Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.

5 0.12740651 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

6 0.11094415 19 emnlp-2012-An Entity-Topic Model for Entity Linking

7 0.1029784 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

8 0.085378662 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

9 0.082304746 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

10 0.063941643 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

11 0.058935832 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

12 0.053935558 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management

13 0.04449863 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

14 0.043409038 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

15 0.040286947 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

16 0.039970703 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

17 0.03977691 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

18 0.039303605 24 emnlp-2012-Biased Representation Learning for Domain Adaptation

19 0.036551956 78 emnlp-2012-Learning Lexicon Models from Search Logs for Query Expansion

20 0.036193863 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.167), (1, 0.1), (2, 0.094), (3, 0.189), (4, -0.318), (5, 0.148), (6, -0.071), (7, -0.096), (8, -0.129), (9, 0.024), (10, -0.042), (11, -0.01), (12, 0.124), (13, 0.078), (14, 0.083), (15, 0.016), (16, -0.037), (17, 0.033), (18, 0.077), (19, -0.043), (20, 0.045), (21, -0.009), (22, 0.02), (23, 0.026), (24, 0.031), (25, -0.091), (26, 0.017), (27, 0.058), (28, 0.016), (29, 0.082), (30, -0.128), (31, 0.002), (32, -0.009), (33, 0.016), (34, -0.015), (35, -0.038), (36, 0.031), (37, -0.078), (38, 0.072), (39, -0.036), (40, 0.054), (41, 0.053), (42, 0.0), (43, -0.039), (44, -0.108), (45, -0.019), (46, 0.093), (47, 0.066), (48, -0.035), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96943086 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

Author: Xian-Ling Mao ; Zhao-Yan Ming ; Tat-Seng Chua ; Si Li ; Hongfei Yan ; Xiaoming Li

Abstract: Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. Supervised hierarchical topic modeling makes heavy use of the information from observed hierarchical labels, but cannot explore new topics; while unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, but does not make use of any information from hierarchical labels. In this paper, we propose a semi-supervised hierarchical topic model which aims to explore new topics automatically in the data space while incorporating the information from observed hierarchical labels into the modeling process, called SemiSupervised Hierarchical Latent Dirichlet Allocation (SSHLDA). We also prove that hLDA and hLLDA are special cases of SSHLDA. We . conduct experiments on Yahoo! Answers and ODP datasets, and assess the performance in terms of perplexity and clustering. The experimental results show that predictive ability of SSHLDA is better than that of baselines, and SSHLDA can also achieve significant improvement over baselines for clustering on the FScore measure.

2 0.87545073 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

Author: Lan Du ; Wray Buntine ; Huidong Jin

Abstract: Topic models are increasingly being used for text analysis tasks, often times replacing earlier semantic techniques such as latent semantic analysis. In this paper, we develop a novel adaptive topic model with the ability to adapt topics from both the previous segment and the parent document. For this proposed model, a Gibbs sampler is developed for doing posterior inference. Experimental results show that with topic adaptation, our model significantly improves over existing approaches in terms of perplexity, and is able to uncover clear sequential structure on, for example, Herman Melville’s book “Moby Dick”.

3 0.82287687 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

4 0.80745453 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler

Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.

5 0.56774402 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

6 0.47647393 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

7 0.4312956 19 emnlp-2012-An Entity-Topic Model for Entity Linking

8 0.4139522 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

9 0.39375865 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

10 0.3746171 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

11 0.27266607 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management

12 0.24080268 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

13 0.24019393 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

14 0.23746189 30 emnlp-2012-Constructing Task-Specific Taxonomies for Document Collection Browsing

15 0.23648734 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

16 0.20642781 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes

17 0.19874689 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

18 0.19290236 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

19 0.18272537 119 emnlp-2012-Spectral Dependency Parsing with Latent Variables

20 0.17752878 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.019), (25, 0.013), (34, 0.063), (60, 0.057), (63, 0.121), (64, 0.012), (65, 0.035), (70, 0.02), (73, 0.396), (74, 0.038), (76, 0.028), (80, 0.019), (86, 0.029), (95, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84868759 125 emnlp-2012-Towards Efficient Named-Entity Rule Induction for Customizability

Author: Ajay Nagesh ; Ganesh Ramakrishnan ; Laura Chiticariu ; Rajasekar Krishnamurthy ; Ankush Dharkar ; Pushpak Bhattacharyya

Abstract: Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a fourstage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization ofbasic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.

2 0.82618189 112 emnlp-2012-Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

Author: Altaf Rahman ; Vincent Ng

Abstract: We examine the task of resolving complex cases of definite pronouns, specifically those for which traditional linguistic constraints on coreference (e.g., Binding Constraints, gender and number agreement) as well as commonly-used resolution heuristics (e.g., string-matching facilities, syntactic salience) are not useful. Being able to solve this task has broader implications in artificial intelligence: a restricted version of it, sometimes referred to as the Winograd Schema Challenge, has been suggested as a conceptually and practically appealing alternative to the Turing Test. We employ a knowledge-rich approach to this task, which yields a pronoun resolver that outperforms state-of-the-art resolvers by nearly 18 points in accuracy on our dataset.

same-paper 3 0.80153489 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

Author: Xian-Ling Mao ; Zhao-Yan Ming ; Tat-Seng Chua ; Si Li ; Hongfei Yan ; Xiaoming Li

Abstract: Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. Supervised hierarchical topic modeling makes heavy use of the information from observed hierarchical labels, but cannot explore new topics; while unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, but does not make use of any information from hierarchical labels. In this paper, we propose a semi-supervised hierarchical topic model which aims to explore new topics automatically in the data space while incorporating the information from observed hierarchical labels into the modeling process, called SemiSupervised Hierarchical Latent Dirichlet Allocation (SSHLDA). We also prove that hLDA and hLLDA are special cases of SSHLDA. We . conduct experiments on Yahoo! Answers and ODP datasets, and assess the performance in terms of perplexity and clustering. The experimental results show that predictive ability of SSHLDA is better than that of baselines, and SSHLDA can also achieve significant improvement over baselines for clustering on the FScore measure.

4 0.41938913 97 emnlp-2012-Natural Language Questions for the Web of Data

Author: Mohamed Yahya ; Klaus Berberich ; Shady Elbassuoni ; Maya Ramanath ; Volker Tresp ; Gerhard Weikum

Abstract: The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources. Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the . in question translation and the resulting query answering.

5 0.41831174 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

6 0.41750333 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

7 0.40720388 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

8 0.40481013 100 emnlp-2012-Open Language Learning for Information Extraction

9 0.40436375 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

10 0.39658028 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

11 0.39101428 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

12 0.39021578 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

13 0.38742128 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

14 0.3862077 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

15 0.37909919 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

16 0.37738493 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

17 0.37166736 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

18 0.37058988 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

19 0.37011039 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

20 0.36983749 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation