emnlp emnlp2012 emnlp2012-8 knowledge-graph by maker-knowledge-mining

8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes


Source: pdf

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. [sent-11, score-0.304]

2 In this article, we present a hierarchical generative probabilistic model of topical phrases. [sent-12, score-0.303]

3 The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. [sent-13, score-0.432]

4 We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models. [sent-15, score-0.428]

5 1 Introduction Probabilistic topic models have been the focus of intense study in recent years. [sent-16, score-0.326]

6 The archetypal topic model, Latent Dirichlet Allocation (LDA), posits that words within a document are conditionally independent given their topic (Blei et al. [sent-17, score-0.652]

7 When an end-user runs a topic model, the output he or she is often interested in is a list of topical 214 unigrams, words probable in a topic (hence, representative of it). [sent-20, score-0.902]

8 In many situations, such as during the use of the topic model for the analysis of a new or ill-understood corpus, these lists can be insufficiently informative. [sent-21, score-0.38]

9 For instance, if a layperson ran LDA on the NIPS corpus, he would likely get a topic whose most prominent words include policy, value, and reward. [sent-22, score-0.368]

10 Most situations where a topic model is actually useful for data exploration require a model whose output is rich enough to dispel the need for the user’s extensive prior knowledge of the data. [sent-25, score-0.326]

11 Furthermore, lists of topical unigrams are often made only marginally interpretable by virtue of their non-compositionality, the principle that a collocation’s meaning typically is not derivable from its constituent words (Schone and Jurafsky, 2001). [sent-26, score-0.43]

12 For example, the meaning of compact disc as a music medium comes from neither the unigram compact nor the unigram disc, but emerges from the bigram as a whole. [sent-27, score-0.2]

13 Moreover, non-compositionality is topic dependent; compact disc should be interpreted as a music medium in a music topic, and as a small region bounded by a circle in a mathematical topic. [sent-28, score-0.452]

14 LDA is prone to decompose collocations into different topics and violate the principle of noncompositionality, and its unigram lists are harder to interpret as a result. [sent-29, score-0.168]

15 TNG is a topic model which satisfies the first desideratum by producing lists of representative, topically cohesive n-grams of the form shown in Figure 1. [sent-35, score-0.47]

16 We diverge from TNG by our addressing the second desideratum, and we do so through a more straightforward and intuitive definition of what constitutes a phrase and its topic. [sent-36, score-0.113]

17 In the furtherance of our goals, we employ a hierarchical method of modeling phrases that uses dependent Pitman-Yor processes to ameliorate overfitting. [sent-37, score-0.139]

18 We then provide details of our inference procedures and evaluate our model against competing models on a subset of the TREC AP corpus (Harman, 1992) in an experiment on human subjects which assesses the interpretability of topical n-gram lists. [sent-40, score-0.339]

19 The experiment is premised on the notion that topic models should be evaluated through a real-world task instead of through information-theoretic measures which often negatively correlate with topic quality (Chang et al. [sent-41, score-0.652]

20 Each word w in a corpus w is drawn from a distribution φ indexed by a topic z, where z is drawn from a distribution θ indexed by its document d. [sent-44, score-0.582]

21 β 215 Here and throughout the article, we use a bold font for vector notation: for example, z is the vector of all topic assignments, and its ith entry, zi, corresponds to the topic assignment ofthe ith token in the corpus. [sent-47, score-0.856]

22 It does this by representing a joint distribution P(z, c|w) wthihser bey ye raechpr ci eisn a Bgo aol jeoainn tva drisiatrbibleu tthioant signals wthe) start of a new n-gram beginning at the ith token. [sent-49, score-0.259]

23 When ci = 0, word wi is joined i∼nto D a topic-specific bigram with wi−1 . [sent-52, score-0.358]

24 When ci = 1, wi is drawn from a topicspecific unigram distribution and is the start of a new n-gram. [sent-53, score-0.478]

25 An unusual feature of TNG is that words within a topical n-gram, a sequence of words delineated by c, do not share the same topic. [sent-54, score-0.25]

26 (2007) analyze each topical n-gram post hoc as if the topic of the final word in the n-gram was the topic assignment of the entire n-gram. [sent-56, score-0.953]

27 Though this design simplifies inference, we perceive it as a shortcoming since the aforementioned principle of non-compositionality supports the intuitive idea that each collocation ought to be drawn from a single topic. [sent-57, score-0.156]

28 Another potential drawback of TNG is that the topic-specific bigram distributions σzw share no probability mass between each other or with the unigram distributions φz. [sent-58, score-0.132]

29 Hence, observing a bigram under one topic does not make it more likely under another topic or make its constituent unigrams more probable. [sent-59, score-0.755]

30 To be more concrete, in TNG, observing space shuttle under a topic z (or under two topics, one for each word) regrettably does not make space shuttle more likely under a topic z0 z, nor does it make observing shuttle more likely u=6nd ze,r n any topic. [sent-60, score-0.841]

31 Each column within a box shows the top fifteen phrases for a topic and is restricted to phrases of a minimum length of one, two, or three words, respectively. [sent-62, score-0.432]

32 In these situations, the observed number of bigrams in a given topic will necessarily be very small and thus not support strong inferences. [sent-67, score-0.326]

33 3 PDLDA A more natural definition of a topical phrase, one which meets our second desideratum, is to have each phrase possess a single topic. [sent-68, score-0.306]

34 It can also be understood through the lens of Bayesian changepoint detection. [sent-70, score-0.189]

35 Viewing a sentence as a time series ofwords, we posit that the generative parameter, the topic, changes period- 217 ically in accordance with the changepoint indicators c. [sent-72, score-0.189]

36 Because there is no restriction on the number of words between changepoints, topical phrases can be arbitrarily long but will always have a single topic drawn from θd. [sent-73, score-0.699]

37 The full definition of PDLDA is given by wi | u ∼ Discrete(Gu) Gu ∼ PYP(a|u| , b|u| , Gπ(u)) G∅ ∼ PYP(a0, b0, H) zi| d,zi−1,θd,ci ∼ (Dδzii−sc1rete(θd) i f c i == 10 | wi−1, zi−1, π ∼ Bernoulli ? [sent-74, score-0.163]

38 ) Like TNG, PDLDA assumes that the probability of a changepoint ci+1 after the ith token depends on the current topic zi and word wi. [sent-79, score-0.877]

39 This causes the length of a phrase to depend on its topic and constituent words. [sent-80, score-0.382]

40 The changepoints explicitly model which words tend to start and end phrases in each document. [sent-81, score-0.116]

41 Depending on ci, zi is either set deterci ministically to the preceding topic (when ci = 0) or is drawn anew from θd (when ci = 1). [sent-82, score-0.959]

42 In this way, each topical phrase has a single topic drawn from its document’s topic distribution. [sent-83, score-1.028]

43 Let u be a context vector consisting of the phrase topic and the past m words: u < zi, wi−1 , wi−2 , . [sent-85, score-0.382]

44 For example, the first word wi of a phrase beginning at a position inecessarily has ci = 1; consequently, all the preceding words wi−j in the context vector are treated as start symbols so that wi is effectively drawn from a topic-specific unigram distribution. [sent-93, score-0.639]

45 In PDLDA, each token is drawn from a distribution conditioned on its context u. [sent-94, score-0.223]

46 , cEiavcihc} n aondde wGo i-tno pthice tree =is 2 a mPoitmdealn w-Yitohr process whose base distribution is its parent node, and H is a uniform distribution over V . [sent-97, score-0.116]

47 The next section explains this hierarchical distribution in more detail. [sent-101, score-0.111]

48 f wWohrdens w drawn iid from a PYP-distributed G, one can analytically marginalize G and consider the resulting conditional distribution of w given its parameters a, b, and base distribution φ. [sent-106, score-0.25]

49 This marginal can best be understood by considering the distribution of any wi |w1, . [sent-107, score-0.221]

50 In the CRP metaphor, one imagines a restaurant with an unbounded number of tables, where each table has one shared dish (a draw from φ) and can seat an unlimited number ofcustomers. [sent-111, score-0.116]

51 The CRP specifies a 218 process by which customers entering the restaurant choose a table to sit at and, consequently, the dish they eat. [sent-112, score-0.232]

52 Subsequent customers sit at an occupied table k with probability proportional to ck − a and choose a new unoccupied table with probability proportional to b ta, where ck is the number of customers seated at table k and t is the number of occupied tables in G. [sent-114, score-0.268]

53 The hierarchical PYP (HPYP) is an intuitive recursive formulation of the PYP in which the base distribution φ is itself PYP-distributed. [sent-116, score-0.168]

54 This smooths each context’s distribution like the Bayesian n-gram model of Teh (2006), which is a Bayesian version of interpolated Kneser-Ney smoothing (Chen and + Goodman, 1998). [sent-120, score-0.09]

55 One ramification of this setup is that if a word occurs in a context u, the sharing makes it more likely in other contexts that have something in common with u, such as a shared topic or word. [sent-121, score-0.326]

56 , cuw· = Pk cuwk where cuwk is the number of customers ePating w in u at table k. [sent-125, score-0.206]

57 2 Inference In this section, we describe Markov chain Monte Carlo procedures to sample from P (z, c, τ|w, U), tChaer posterior udriestsri tbout sioamn over topic assignments z, phrase boundaries c, and seating arrangements τ given an observed corpus w. [sent-135, score-0.447]

58 However, rather than onerously storing the table assignment of every token in w, we store only the counts of how many tables there are in a restaurant and how many customers are sitting at each table in that restaurant. [sent-140, score-0.271]

59 Our sampling strategy for a given token iin document d is to jointly propose changes to the change- θ, point ci and topic assignment zi, and then to the seating arrangement τ. [sent-142, score-0.694]

60 Recall that according to the model, if ci = 0, zi = zi−1 ; otherwise zi is generated from the topic distribution for document d. [sent-143, score-1.051]

61 Since the topic assignment remains the same until a new changepoint at a position i0 is reached, each token wj for j from position iuntil i0 − 1 will depend on zi because for these j, zj = zi. [sent-144, score-0.95]

62 W−e c1a lwl itlhli ds set nodf tokens the phrase suffix of the ith token and denote it s(i). [sent-145, score-0.161]

63 In addition to txh(ei )w ,ords { si(ni t)h ∪e {sumfafixx s(i), )th +e changepoint itinodnica totor variables cj for j in x(i) are also conditioned on zi. [sent-149, score-0.297]

64 The variables that depend directly on zi, ci are zs(i) , ws(i) , cx(i) . [sent-151, score-0.153]

65 The proposal distribution first = , , 219 draws from a multinomial over T + 1options: one option for ci = 0, zi = zi − 1; and one for ci = 1 paired with each possible zi = z ∈ 1 o . [sent-152, score-1.182]

66 n¬zws(cj) After drawing a proposal for ci, zs(i) for token i, the sampler adds a customer eating wi to a table serving wi in restaurant ui. [sent-157, score-0.631]

67 We accept the proposal with probability min(A, 1) where A = Pˆ(z0s(i),ci0,τs0(i) Q(zs(i),ci,τs(i) Pˆ(zs(i), τs(i)) Q(z0s(i), ci0, τs0(i)) where Q is the proposal distribution and Pˆ is the true unnormalized distribution. [sent-160, score-0.152]

68 Pˆ differs from Q in ci, that the probability of each word wj and the seating arrangement depends only on ¬s(j), as opposed to tahrrea simplification of using ¬s(i). [sent-161, score-0.165]

69 We then interleave a slice sampling algorithm (Neal, 2000) between sweeps of the Metropolis-Hastings sampler to learn these parameters. [sent-167, score-0.094]

70 4 Related Work An integral part of modeling topical phrases is the relaxation of the bag-of-words assumption in LDA. [sent-169, score-0.303]

71 Among them, Griffiths and Steyvers (2005) present a model in which words are generated either conditioned on a topic or conditioned on the previous word in a bigram, but not both. [sent-171, score-0.402]

72 Her model uses a hierarchical Dirichlet to share parameters across bigrams in a topic in a manner similar to our use of PYPs, but it lacks a notion of the topic being shared between the words in an n-gram. [sent-174, score-0.705]

73 These models are unconcerned with topical n-grams and thus do not model phrases. [sent-178, score-0.25]

74 Johnson (2010) presents an Adaptor Grammar model of topical phrases. [sent-179, score-0.25]

75 In Johnson’s model, subtrees corresponding to common phrases for a topic are memoized, resulting in a model in which each topic is associated with a distribution over whole phrases. [sent-181, score-0.763]

76 While it is a theoretically elegant method for finding topical phrases, for large corpora we found inference to be impractically slow. [sent-182, score-0.25]

77 naisvrel f2torym icfhpen8ls0teroaimratnT rfcia lth4co nlifct8rbo0philsrue ocshpt Figure 4: Experimental setup of the phrase intrusion experiment in which subjects must click on the ngram that does not belong. [sent-185, score-0.235]

78 (2009) to create a “phrase intrusion” task that quantitatively compares the quality of the topical n-gram lists produced by our model against those of other models. [sent-189, score-0.304]

79 Each of 48 subjects underwent 80 trials of a webbased experiment on Amazon Mechanical Turk, a reliable (Paolacci et al. [sent-190, score-0.197]

80 Each subject’s task is to select the intruder phrase, a spurious n-gram not belonging with the others in the list. [sent-194, score-0.147]

81 If, other than the intruder, the items in the list are all on the same topic, then subjects can easily identify the intruder because the list is semantically cohesive and makes sense. [sent-195, score-0.272]

82 If the list is incohesive and has no discernible topic, subjects must guess arbitrarily and performance is at random. [sent-196, score-0.089]

83 To construct each trial’s list, we chose two topics z and z0 (z z0), then selected the three most probable n-grams f rzom z and the intruder phrase, an n-gram probable in z0 and improbable in z. [sent-197, score-0.198]

84 This design ensures that the intruder is not identifiable due solely to its being rare. [sent-198, score-0.201]

85 Interspersed among the phrase intrusion trials were several simple screening trials intended to affirm that subjects possessed a minimal level of attentiveness and reading comprehension. [sent-199, score-0.493]

86 For example, one such screening trial = presented subjects with the list banana, apple, television, orange. [sent-200, score-0.191]

87 Subjects who got any of these trials Figure 5: An across-subject measure of the ability to detect intruders as a function of n-gram size and model. [sent-201, score-0.108]

88 Excluding trials with repeated words does not qualitatively affect the results. [sent-202, score-0.108]

89 Each subject was presented with trials constructed from the output of PDLDA and TNG for unigrams, bigrams, and trigrams. [sent-204, score-0.108]

90 The topical phrases found by TNG and PDLDA often revolve around a central n-gram, with other words pre- or post- appended to it. [sent-209, score-0.303]

91 In this intrusion experiment, any n-gram not containing the central word or phrase may be trivially identifiable, regardless of its relevance to the topic. [sent-210, score-0.146]

92 For example, the intruder in Trial 4 of Figure 4 is easily identifiable even if a subject does not understand English. [sent-211, score-0.201]

93 For all models, we treated certain punctuation as the start of a phrase by setting cj = 1for all tokens j immediately following periods, commas, semicolons, and exclamation and question marks. [sent-219, score-0.126]

94 It is defined as MPkm,n = P1 (ikm,s,n = ωmk,,sn)/S where ωkm,,sn is the index of Ps Pthe intruding n-gram for subject s among the words generated from the kth topic of model m, ikm,,sn is the intruder selected by s, and S is the number of subjects. [sent-225, score-0.473]

95 Figure 5b demonstrates that the outcome of the experiment does not depend strongly on whether the topical n-gram lists have repeated words. [sent-229, score-0.304]

96 6 Conclusion We presented a topic model which simultaneously segments a corpus into phrases of varying lengths and assigns topics to them. [sent-230, score-0.43]

97 The topical phrases found by PDLDA are much richer sources of information than the topical unigrams typically produced in topic modeling. [sent-231, score-0.94]

98 As evidenced by the phrase-intrusion experiment, the topical n-gram lists that PDLDA finds are much more interpretable than those found by TNG. [sent-232, score-0.34]

99 The formalism of Bayesian changepoint detection arose naturally from the intuitive assumption that the topic of a sequence of tokens changes periodically, and that the tokens in between changepoints comprise a phrase. [sent-233, score-0.635]

100 Topical n-grams: Phrase and topic discovery, with an application to information retrieval. [sent-348, score-0.326]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tng', 0.378), ('pdlda', 0.336), ('topic', 0.326), ('zi', 0.257), ('topical', 0.25), ('changepoint', 0.189), ('lda', 0.177), ('wi', 0.163), ('ci', 0.153), ('intruder', 0.147), ('zw', 0.144), ('zs', 0.114), ('trials', 0.108), ('intrusion', 0.09), ('subjects', 0.089), ('pyps', 0.084), ('restaurant', 0.083), ('pyp', 0.082), ('customers', 0.08), ('drawn', 0.07), ('cj', 0.07), ('seating', 0.065), ('changepoints', 0.063), ('cuwk', 0.063), ('harman', 0.063), ('shuttle', 0.063), ('dirichlet', 0.061), ('unigrams', 0.061), ('trial', 0.06), ('distribution', 0.058), ('token', 0.057), ('intuitive', 0.057), ('phrase', 0.056), ('lists', 0.054), ('desideratum', 0.054), ('disc', 0.054), ('gu', 0.054), ('identifiable', 0.054), ('trec', 0.053), ('phrases', 0.053), ('hierarchical', 0.053), ('assignment', 0.051), ('topics', 0.051), ('blei', 0.05), ('pitman', 0.049), ('bayesian', 0.049), ('sampler', 0.049), ('ith', 0.048), ('proposal', 0.047), ('slice', 0.045), ('mallet', 0.045), ('crp', 0.045), ('griffiths', 0.043), ('zj', 0.042), ('arrangement', 0.042), ('cuw', 0.042), ('hpyp', 0.042), ('htmm', 0.042), ('layperson', 0.042), ('memoized', 0.042), ('paolacci', 0.042), ('schone', 0.042), ('screening', 0.042), ('bigram', 0.042), ('adaptor', 0.04), ('joshua', 0.038), ('conditioned', 0.038), ('music', 0.036), ('adams', 0.036), ('analytically', 0.036), ('cohesive', 0.036), ('darkened', 0.036), ('eating', 0.036), ('honda', 0.036), ('occupied', 0.036), ('periodically', 0.036), ('sit', 0.036), ('interpretable', 0.036), ('unigram', 0.034), ('teh', 0.034), ('steyvers', 0.034), ('processes', 0.033), ('bernoulli', 0.033), ('serving', 0.033), ('ap', 0.033), ('beta', 0.033), ('conjugate', 0.033), ('dish', 0.033), ('gruber', 0.033), ('perplexity', 0.032), ('smoothing', 0.032), ('discrete', 0.031), ('donna', 0.03), ('simplification', 0.03), ('chang', 0.029), ('principle', 0.029), ('marginalize', 0.028), ('wallach', 0.028), ('distributions', 0.028), ('wj', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

2 0.24646086 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

Author: Lan Du ; Wray Buntine ; Huidong Jin

Abstract: Topic models are increasingly being used for text analysis tasks, often times replacing earlier semantic techniques such as latent semantic analysis. In this paper, we develop a novel adaptive topic model with the ability to adapt topics from both the previous segment and the parent document. For this proposed model, a Gibbs sampler is developed for doing posterior inference. Experimental results show that with topic adaptation, our model significantly improves over existing approaches in terms of perplexity, and is able to uncover clear sequential structure on, for example, Herman Melville’s book “Moby Dick”.

3 0.20072901 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

Author: Xian-Ling Mao ; Zhao-Yan Ming ; Tat-Seng Chua ; Si Li ; Hongfei Yan ; Xiaoming Li

Abstract: Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. Supervised hierarchical topic modeling makes heavy use of the information from observed hierarchical labels, but cannot explore new topics; while unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, but does not make use of any information from hierarchical labels. In this paper, we propose a semi-supervised hierarchical topic model which aims to explore new topics automatically in the data space while incorporating the information from observed hierarchical labels into the modeling process, called SemiSupervised Hierarchical Latent Dirichlet Allocation (SSHLDA). We also prove that hLDA and hLLDA are special cases of SSHLDA. We . conduct experiments on Yahoo! Answers and ODP datasets, and assess the performance in terms of perplexity and clustering. The experimental results show that predictive ability of SSHLDA is better than that of baselines, and SSHLDA can also achieve significant improvement over baselines for clustering on the FScore measure.

4 0.17697023 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler

Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.

5 0.14670415 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

6 0.13884684 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars

7 0.13243994 19 emnlp-2012-An Entity-Topic Model for Entity Linking

8 0.12328219 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

9 0.096447915 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

10 0.091956809 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

11 0.083853908 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management

12 0.066854015 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules

13 0.064624295 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

14 0.062451191 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

15 0.060926717 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

16 0.055426769 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

17 0.054874346 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

18 0.050077293 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

19 0.045525622 104 emnlp-2012-Parse, Price and Cut-Delayed Column and Row Generation for Graph Based Parsers

20 0.04409343 75 emnlp-2012-Large Scale Decipherment for Out-of-Domain Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, 0.074), (2, 0.097), (3, 0.189), (4, -0.393), (5, 0.17), (6, -0.069), (7, -0.109), (8, -0.132), (9, 0.028), (10, -0.027), (11, 0.021), (12, 0.136), (13, 0.047), (14, 0.001), (15, -0.02), (16, -0.077), (17, 0.062), (18, -0.008), (19, -0.132), (20, 0.013), (21, 0.029), (22, -0.012), (23, 0.006), (24, -0.099), (25, -0.06), (26, 0.056), (27, 0.142), (28, -0.066), (29, 0.055), (30, 0.009), (31, -0.001), (32, -0.107), (33, 0.051), (34, -0.02), (35, 0.023), (36, -0.051), (37, -0.038), (38, 0.015), (39, -0.024), (40, 0.036), (41, -0.068), (42, -0.009), (43, -0.01), (44, 0.003), (45, -0.024), (46, 0.022), (47, 0.016), (48, 0.004), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96772587 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

2 0.86975002 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

Author: Lan Du ; Wray Buntine ; Huidong Jin

Abstract: Topic models are increasingly being used for text analysis tasks, often times replacing earlier semantic techniques such as latent semantic analysis. In this paper, we develop a novel adaptive topic model with the ability to adapt topics from both the previous segment and the parent document. For this proposed model, a Gibbs sampler is developed for doing posterior inference. Experimental results show that with topic adaptation, our model significantly improves over existing approaches in terms of perplexity, and is able to uncover clear sequential structure on, for example, Herman Melville’s book “Moby Dick”.

3 0.82804358 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

Author: Xian-Ling Mao ; Zhao-Yan Ming ; Tat-Seng Chua ; Si Li ; Hongfei Yan ; Xiaoming Li

Abstract: Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. Supervised hierarchical topic modeling makes heavy use of the information from observed hierarchical labels, but cannot explore new topics; while unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, but does not make use of any information from hierarchical labels. In this paper, we propose a semi-supervised hierarchical topic model which aims to explore new topics automatically in the data space while incorporating the information from observed hierarchical labels into the modeling process, called SemiSupervised Hierarchical Latent Dirichlet Allocation (SSHLDA). We also prove that hLDA and hLLDA are special cases of SSHLDA. We . conduct experiments on Yahoo! Answers and ODP datasets, and assess the performance in terms of perplexity and clustering. The experimental results show that predictive ability of SSHLDA is better than that of baselines, and SSHLDA can also achieve significant improvement over baselines for clustering on the FScore measure.

4 0.74087709 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler

Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.

5 0.57331318 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson

Abstract: The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model.

6 0.56029433 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

7 0.47286591 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars

8 0.46574411 19 emnlp-2012-An Entity-Topic Model for Entity Linking

9 0.38305047 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

10 0.35684058 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management

11 0.35374194 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

12 0.32082054 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

13 0.29263937 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

14 0.2833721 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules

15 0.26098537 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

16 0.23149447 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

17 0.22911797 75 emnlp-2012-Large Scale Decipherment for Out-of-Domain Machine Translation

18 0.22550035 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

19 0.21185063 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

20 0.20159544 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.029), (16, 0.035), (22, 0.294), (25, 0.027), (34, 0.076), (60, 0.065), (63, 0.102), (64, 0.013), (65, 0.014), (70, 0.017), (73, 0.022), (74, 0.079), (76, 0.039), (80, 0.019), (86, 0.03), (95, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79208803 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

2 0.4801279 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

Author: Jianxing Yu ; Zheng-Jun Zha ; Tat-Seng Chua

Abstract: This paper proposes to generate appropriate answers for opinion questions about products by exploiting the hierarchical organization of consumer reviews. The hierarchy organizes product aspects as nodes following their parent-child relations. For each aspect, the reviews and corresponding opinions on this aspect are stored. We develop a new framework for opinion Questions Answering, which enables accurate question analysis and effective answer generation by making use the hierarchy. In particular, we first identify the (explicit/implicit) product aspects asked in the questions and their sub-aspects by referring to the hierarchy. We then retrieve the corresponding review fragments relevant to the aspects from the hierarchy. In order to gener- ate appropriate answers from the review fragments, we develop a multi-criteria optimization approach for answer generation by simultaneously taking into account review salience, coherence, diversity, and parent-child relations among the aspects. We conduct evaluations on 11 popular products in four domains. The evaluated corpus contains 70,359 consumer reviews and 220 questions on these products. Experimental results demonstrate the effectiveness of our approach.

3 0.47663835 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

Author: Lizhen Qu ; Rainer Gemulla ; Gerhard Weikum

Abstract: We propose the weakly supervised MultiExperts Model (MEM) for analyzing the semantic orientation of opinions expressed in natural language reviews. In contrast to most prior work, MEM predicts both opinion polarity and opinion strength at the level of individual sentences; such fine-grained analysis helps to understand better why users like or dislike the entity under review. A key challenge in this setting is that it is hard to obtain sentence-level training data for both polarity and strength. For this reason, MEM is weakly supervised: It starts with potentially noisy indicators obtained from coarse-grained training data (i.e., document-level ratings), a small set of diverse base predictors, and, if available, small amounts of fine-grained training data. We integrate these noisy indicators into a unified probabilistic framework using ideas from ensemble learning and graph-based semi-supervised learning. Our experiments indicate that MEM outperforms state-of-the-art methods by a significant margin.

4 0.47560734 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

Author: Thomas Francois ; Cedrick Fairon

Abstract: This paper present a new readability formula for French as a foreign language (FFL), which relies on 46 textual features representative of the lexical, syntactic, and semantic levels as well as some of the specificities of the FFL context. We report comparisons between several techniques for feature selection and various learning algorithms. Our best model, based on support vector machines (SVM), significantly outperforms previous FFL formulas. We also found that semantic features behave poorly in our case, in contrast with some previous readability studies on English as a first language.

5 0.47541755 97 emnlp-2012-Natural Language Questions for the Web of Data

Author: Mohamed Yahya ; Klaus Berberich ; Shady Elbassuoni ; Maya Ramanath ; Volker Tresp ; Gerhard Weikum

Abstract: The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources. Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the . in question translation and the resulting query answering.

6 0.47226357 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

7 0.47028574 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

8 0.46659204 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

9 0.46105996 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

10 0.4590919 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

11 0.45905897 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

12 0.45845556 100 emnlp-2012-Open Language Learning for Information Extraction

13 0.45842731 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction

14 0.45797929 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

15 0.45774275 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields

16 0.45713237 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

17 0.45696792 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

18 0.45520228 122 emnlp-2012-Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts

19 0.45427087 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

20 0.45312288 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media