nips nips2012 nips2012-332 knowledge-graph by maker-knowledge-mining

332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis


Source: pdf

Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing

Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. [sent-12, score-0.753]

2 Other topic models that were originally proposed for structured data are also applicable to multilingual documents. [sent-13, score-0.535]

3 Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. [sent-14, score-0.738]

4 We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. [sent-15, score-0.67]

5 We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. [sent-16, score-0.947]

6 In topic modeling, each document is represented as a mixture of topics, where each topic is represented as a word distribution. [sent-18, score-0.497]

7 Most topic models assume that texts are monolingual; however, some can capture statistical dependencies between multiple classes of representations and can be used for multilingual parallel or comparable documents. [sent-20, score-0.629]

8 Here, a parallel document is a merged document consisting of multiple language parts that are translations from one language to another, sometimes including sentence-to-sentence or word-to-word alignments. [sent-21, score-0.691]

9 A comparable document is a merged document consisting of multiple language parts that are not translations of each other but instead describe similar concepts and events. [sent-22, score-0.477]

10 Recently published multilingual topic models [3, 4], which are the equivalent of Conditionally Independent LDA (CI-LDA) [5, 6], can discover latent topics among parallel or comparable documents. [sent-23, score-0.753]

11 It can control the proportions of languages in each multilingual topic. [sent-25, score-0.588]

12 However, both CI-LDA and SwitchLDA preserve dependencies between languages only by sharing per-document multinomial distributions over latent topics, and accordingly the resulting dependencies are relatively weak. [sent-26, score-0.346]

13 In this sense, visual features can be said to be the pivot in modeling annotated image data. [sent-31, score-0.528]

14 The pivot language selected is sensitive to the quality of the multilingual topics estimated with CorrLDA. [sent-34, score-1.278]

15 For example, a translation of a Japanese book into English would presumably have a pivot to the Japanese book, but a set of international news stories would have pivots that differ based on the country an article is about. [sent-35, score-0.621]

16 It is often difficult to appropriately select the pivot language. [sent-36, score-0.521]

17 To address this problem, which we call the pivot problem, we propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control the pivot language, in an extension of CorrLDA. [sent-37, score-1.176]

18 Our SymCorrLDA addresses the problem of CorrLDA and can select an appropriate pivot language by inference from the data. [sent-38, score-0.738]

19 , CI-LDA, SwitchLDA, CorrLDA, and our SymCorrLDA, as well as LDA, using comparable articles in different languages (English, Japanese, and Spanish) extracted from Wikipedia. [sent-41, score-0.313]

20 We first demonstrate through experiments that CorrLDA outperforms the other existing multilingual topic models mentioned, and then show that our SymCorrLDA works more effectively than CorrLDA in any case of selecting a pivot language. [sent-42, score-1.041]

21 2 Multilingual Topic Models with Multilingual Comparable Documents Bilingual topic models for bilingual parallel documents that have word-to-word alignments have been developed, such as those by [8]. [sent-43, score-0.439]

22 In contrast, we focus on analyzing dependencies among languages by modeling multilingual comparable documents, each of which consists of multiple language parts that are not translations of each other but instead describe similar concepts and events. [sent-45, score-0.909]

23 The target documents can be parallel documents, but word-to-word alignments are not taken into account in the topic modeling. [sent-46, score-0.299]

24 Some other researchers explored different types of multilingual topic models that are based on the premise of using multilingual dictionaries or WordNet [9, 10, 11]. [sent-47, score-0.906]

25 In contrast, CI-LDA and SwitchLDA only require multilingual comparable documents that can be easily obtained, such as from Wikipedia, when we use those models for multilingual text analysis. [sent-48, score-0.888]

26 Below, we introduce LDA-style topic models that handle multiple classes and can be applied to multilingual comparable documents for the above-mentioned purposes. [sent-50, score-0.669]

27 The CI-LDA framework was used to model multilingual parallel or comparable documents by [3] and [4]. [sent-53, score-0.528]

28 D, T , and Nd respectively indicate the number of documents, number of topics, and number of word tokens that appear in a specific language part in a document d. [sent-55, score-0.423]

29 The superscript ‘(·)’ indicates the variables corresponding to a specific language part in a document d. [sent-56, score-0.333]

30 For all D documents, sample θd ∼ Dirichlet(α) For all T topics and for all L languages, sample ϕ(ℓ) ∼ Dirichlet(β (ℓ) ) t (ℓ) For each of the Nd words w(ℓ) in language ℓ (ℓ ∈ {1, · · · , L}) of document d: i a. [sent-61, score-0.518]

31 CI-LDA preserves dependencies between languages only by sharing the multinomial distributions with parameters θd . [sent-64, score-0.287]

32 Accordingly, there are substantial chances that some topics are assigned only to a specific language part in each document, and the resulting dependencies are relatively weak. [sent-65, score-0.417]

33 2 SwitchLDA Similarly to CI-LDA, SwitchLDA [6] can be applied to multilingual comparable documents. [sent-67, score-0.412]

34 However, different from CI-LDA, SwitchLDA can adjust the proportions of multiple different languages for each topic, according to a binomial distribution for bilingual data or a multinomial distribution for data of more than three languages. [sent-68, score-0.429]

35 Sample a word wi ∼ Multinomial(ϕ(si ) ) zi Here, ψt indicates a multinomial parameter to adjust the proportions of L different languages for topic t. [sent-79, score-0.573]

36 Therefore, SwitchLDA may represent multilingual topics more flexibly; however, it still has the drawback that the dependencies between languages are relatively weak. [sent-82, score-0.738]

37 3 Correspondence LDA (CorrLDA) CorrLDA [7] can also be applied to multilingual comparable documents. [sent-84, score-0.412]

38 In the multilingual setting, this model first generates topics for one language part of a document. [sent-85, score-0.755]

39 For the other languages, the model then uses the topics that were already generated in the pivot language. [sent-87, score-0.644]

40 Figure 2(a) shows a graphical model representation of CorrLDA assuming L (ℓ) languages, when p is the pivot language that is specified in advance. [sent-88, score-0.751]

41 Here, Nd (ℓ ∈ {p, 2, · · · , L}) denotes the number of words in language ℓ in document d. [sent-89, score-0.35]

42 (p) For all D documents’ pivot language parts, sample θd ∼ Dirichlet(α(p) ) For all T topics and for all L languages (including the pivot language), sample ϕ(ℓ) ∼ Dirichlet(β (ℓ) ) t (p) For each of the Nd words w(p) in the pivot language p of document d: i (p) a. [sent-93, score-2.467]

43 (ℓ) For each of the Nd words w(ℓ) in language ℓ (ℓ ∈ {2, · · · , L}) of document d: i ( ) a. [sent-96, score-0.35]

44 Sample a word w(ℓ) ∼ Multinomial(ϕ(ℓ) ) (ℓ) i Nd yi This model can capture more direct dependencies between languages, due to the constraints that topics have to be selected from the topics selected in the pivot language parts. [sent-98, score-1.161]

45 However, when CorrLDA is applied to multilingual documents, a pivot language must be specified in advance. [sent-99, score-1.109]

46 Moreover, the pivot language selected is sensitive to the quality of the multilingual topics estimated with CorrLDA. [sent-100, score-1.278]

47 3 Symmetric Correspondence Topic Models When CorrLDA is applied to parallel or comparable documents, this model first generates topics for one language part of a document, which we refer to this language as a pivot language. [sent-101, score-1.186]

48 For the other languages, the model then uses the topics that were already generated in the pivot language. [sent-102, score-0.644]

49 Since the pivot language may differ based on the subject, such as the country a document is about, it is often difficult to appropriately select the pivot language. [sent-104, score-1.356]

50 This model generates a flag that specifies a pivot language for each word, adjusting the probability of being pivot languages in each language part of a document according to a binomial distribution for bilingual data or a multinomial distribution for data of more than three languages. [sent-106, score-1.985]

51 In other words, SymCorrLDA estimates from the data the best pivot language at the word level in each document. [sent-107, score-0.823]

52 The pivot language flags may be assigned to the words in the originally written portions in each language, since the original portions may be described confidently and with rich vocabulary. [sent-108, score-0.789]

53 For all D documents: (ℓ) For all L languages, sample θd ∼ Dirichlet(α(ℓ) ) Sample πd ∼ Dirichlet(γ) For all T topics and for all L languages, sample ϕ(ℓ) ∼ Dirichlet(β (ℓ) ) t (ℓ) For each of the Nd words w(ℓ) in language ℓ (ℓ ∈ {1, · · · , L}) of document d: i a. [sent-111, score-0.518]

54 Sample a pivot language flag xi(ℓ) ∼ Multinomial(πd ) (ℓ) b. [sent-112, score-0.738]

55 zi i yi i The pivot language flag xi(ℓ) = ℓ for an arbitrary language ℓ indicates that the pivot language for the word wi(ℓ) is its own language ℓ, and xi(ℓ) = m indicates that the pivot language for w(ℓ) is another i language m different from its own language ℓ. [sent-119, score-3.28]

56 W (·) and Nd respectively indicate the total number of i i vocabulary words (word types) in the specified language, and the number of word tokens that appear in the specified language part of document d. [sent-131, score-0.457]

57 ndℓ and ndm are the number of times, for an arbitrary (·) word i ∈ {1, · · · , Nd } in an arbitrary language j ∈ {1, · · · , L} of document d, the flags xi( j) = ℓ and T (·) xi( j) = m respectively are allocated to document d. [sent-132, score-0.546]

58 CtdD indicates the (t, d) element of a T × D topic-document count matrix, meaning the number of times topic t is allocated to the document d’s W (·) language part specified in parentheses. [sent-133, score-0.514]

59 Cwt T indicates the (w, t) element of a W (·) × T word-topic count matrix, meaning the number of times topic t is allocated to word w in the language specified in parentheses. [sent-134, score-0.515]

60 In this model, non-pivot topics are dependent on the distribution behind the pivot topics, not dependent directly on the pivot topics as in the original SymCorrLDA. [sent-138, score-1.288]

61 We will show through experiments how the modification affects the quality of the estimated multilingual topics, in the following section. [sent-143, score-0.386]

62 4 Experiments In this section, we demonstrate some examples with SymCorrLDA, and then we compare multilingual topic models using various evaluation methods. [sent-144, score-0.535]

63 For the evaluation, we use held-out loglikelihood using two datasets, the task of finding an English article that is on the same topic as that of a Japanese article, and a task with the languages reversed. [sent-145, score-0.426]

64 1 Settings The datasets used in this work are two collections of Wikipedia articles: one is in English and Japanese, the other is in English, Japanese, and Spanish, and articles in each collection are connected across languages via inter-language links, as of November 2, 2009. [sent-147, score-0.272]

65 0 rui (species) shu (species) karada (body) konchu̅ (insect) dobutsu (animal) Topic 13 japan osaka kyoto hughes japanese osaka kyoto shi (city) nen (year) kobe 0. [sent-166, score-0.397]

66 three) languages as a comparable document that consists of two (or three) language parts. [sent-170, score-0.556]

67 To carry out the evaluation in the task of finding counterpart articles that we will describe later, we randomly divided the Wikipedia document collection at the document level into 80% training documents and 20% test documents. [sent-171, score-0.362]

68 In addition, we estimated a special implementation of SymCorrLDA, setting πd in a simple way for comparison, where the pivot language flag for each word is randomly selected according to the proportion of the length of each language part (‘SymCorrLDA-rand’). [sent-174, score-1.086]

69 2 Pivot assignments Figure 3 demonstrates how the frequency distribution of the pivot language-flag (binomial) parameter πd,1 for the Japanese language with the bilingual dataset5 in SymCorrLDA changes while in iterations of collapsed Gibbs sampling. [sent-184, score-0.901]

70 This figure shows that the pivot language flag is randomly assigned at the initial state, and then it converges to an appropriate bias for each document as the iterations proceed. [sent-185, score-0.839]

71 We next demonstrate how the pivot language flags are assigned to each document. [sent-186, score-0.755]

72 Figure 4(a) shows the titles of eight documents and the corresponding πd when using the bilingual data (T = 500). [sent-187, score-0.255]

73 Therefore, a pivot is assigned considering the language biases of the articles. [sent-190, score-0.755]

74 Here, πd,1 , πd,2 , and πd,3 respectively indicate the pivot language-flag (multinomial) parameters corresponding to Japanese, English, and Spanish parts in each document. [sent-259, score-0.528]

75 We further demonstrate the proportions of pivot assignments at the topic level. [sent-260, score-0.688]

76 Figure 5 shows the content of 6 topics through 10 words with the highest probability for each language and for each topic when using the bilingual data (T = 500), some of which are biased to Japanese (Topics 13 and 59) or English (Topics 201 and 251), while the others have almost no bias. [sent-261, score-0.708]

77 It can be seen that the pivot bias to specific languages can be interpreted. [sent-262, score-0.705]

78 In this work, we estimated multilingual topic models with the training set and computed the log-likelihood of generating the held-out set that was mentioned in Section 4. [sent-266, score-0.55]

79 Table 3 shows the held-out log-likelihood of each multilingual topic model estimated with the bilingual dataset when T = 500 and 1000. [sent-268, score-0.69]

80 Hereafter, CorrLDA1 refers to the CorrLDA model that was estimated when Japanese was the pivot language. [sent-272, score-0.521]

81 3, the CorrLDA model first generates topics for the pivot language part of a document, and for the other language parts of the document, the model then uses the topics that were already generated in the pivot language. [sent-274, score-1.788]

82 CorrLDA2 refers to the CorrLDA model when English was the pivot language. [sent-275, score-0.506]

83 This is because CorrLDA can capture direct dependencies between languages, due to the constraints that topics have to be selected from the topics selected in the pivot language parts. [sent-277, score-1.076]

84 This is probably because SymCorrLDA estimates the pivot language appropriately adjusted for each word in each document. [sent-284, score-0.838]

85 This is because the constraints in SymCorrLDA-alt are relaxed so that topics do not always have to be selected from the topics selected for the words with the pivot language flags. [sent-287, score-1.08]

86 These results reflect the fact that the performance of SymCorrLDA in its full form is inherently affected by the nature of the language biases in the multilingual comparable documents, rather than merely being affected by the language part length. [sent-291, score-0.876]

87 Here, CorrLDA3 refers to the CorrLDA model that was estimated when Spanish was the pivot language. [sent-293, score-0.521]

88 SymCorrLDA can estimate the pivot language appropriately adjusted for each word in each document in the trilingual data, as with the bilingual data. [sent-295, score-1.124]

89 On clock time, SymCorrLDA does pay some extra, such as around 40% of the time for CorrLDA in the case of the bilingual data, for allocating the pivot language flags. [sent-299, score-0.878]

90 4 Finding counterpart articles Given an article, we can find its unseen counterpart articles in other languages using a multilingual topic model. [sent-301, score-0.936]

91 We estimated document-topic distributions of test documents for each language, using the topic-word distributions that were estimated by each multilingual topic model with training documents. [sent-303, score-0.658]

92 For estimating the document-topic distributions of test documents, we used re-sampling of LDA using the topic-word distribution estimated beforehand by each multilingual topic model [15]. [sent-305, score-0.55]

93 Each held-out English-Japanese article pair connected via an inter-language link is considered to be on the same topic; therefore, JS divergence of such an article pair is expected to be small if the latent topic estimation is accurate. [sent-307, score-0.306]

94 We first assumed each held-out Japanese article to be a query and the corresponding English article to be relevant, and evaluated the ranking of all the test articles of English in ascending order of the JS divergence; then we conducted the task with the languages reversed. [sent-308, score-0.398]

95 Therefore, it is clear that SymCorrLDA estimates multilingual topics the most successfully in this experiment. [sent-339, score-0.509]

96 5 Conclusions In this paper, we compared the performance of various topic models that can be applied to multilingual documents, not using multilingual dictionaries, in terms of held-out log-likelihood and in the task of cross-lingual link detection. [sent-340, score-0.906]

97 We demonstrated through experiments that CorrLDA works significantly more effectively than CI-LDA, which was used in prior work on multilingual topic models. [sent-341, score-0.535]

98 Furthermore, we proposed a new topic model, SymCorrLDA, that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. [sent-342, score-0.67]

99 SymCorrLDA has an advantage in that it does not require a pivot language to be specified in advance, while CorrLDA does. [sent-343, score-0.738]

100 We demonstrated that SymCorrLDA is more effective than CorrLDA and the other topic models, through experiments with Wikipedia datasets using held-out log-likelihood and in the task of finding counterpart articles in other languages. [sent-344, score-0.265]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pivot', 0.506), ('multilingual', 0.371), ('symcorrlda', 0.353), ('corrlda', 0.291), ('japanese', 0.291), ('language', 0.232), ('languages', 0.199), ('switchlda', 0.185), ('topic', 0.164), ('english', 0.145), ('bilingual', 0.14), ('topics', 0.138), ('documents', 0.093), ('word', 0.085), ('document', 0.084), ('lda', 0.075), ('articles', 0.073), ('article', 0.063), ('trilingual', 0.062), ('multinomial', 0.058), ('spanish', 0.049), ('dirichlet', 0.048), ('ndm', 0.044), ('comparable', 0.041), ('wikipedia', 0.039), ('ags', 0.039), ('di', 0.034), ('ag', 0.034), ('words', 0.034), ('erent', 0.033), ('dependencies', 0.03), ('correspondence', 0.029), ('counterpart', 0.028), ('ctdd', 0.026), ('kobe', 0.026), ('collapsed', 0.023), ('parallel', 0.023), ('md', 0.023), ('annotated', 0.022), ('parts', 0.022), ('ectively', 0.022), ('nen', 0.022), ('titles', 0.022), ('tokens', 0.022), ('translation', 0.021), ('js', 0.02), ('ireland', 0.02), ('david', 0.019), ('alignments', 0.019), ('zi', 0.019), ('proportions', 0.018), ('castle', 0.018), ('eguchi', 0.018), ('fukumasu', 0.018), ('hideyoshi', 0.018), ('horyu', 0.018), ('league', 0.018), ('nobunaga', 0.018), ('oda', 0.018), ('pivots', 0.018), ('truck', 0.018), ('vocab', 0.018), ('generative', 0.017), ('symmetric', 0.017), ('allocated', 0.017), ('assigned', 0.017), ('year', 0.017), ('indicates', 0.017), ('selected', 0.016), ('retrieval', 0.016), ('latent', 0.016), ('boldface', 0.016), ('osaka', 0.016), ('estimated', 0.015), ('appropriately', 0.015), ('sample', 0.015), ('species', 0.015), ('mrr', 0.014), ('dublin', 0.014), ('orm', 0.014), ('unaligned', 0.014), ('hanna', 0.014), ('generates', 0.014), ('removed', 0.014), ('translations', 0.014), ('binomial', 0.014), ('gibbs', 0.014), ('ective', 0.014), ('scotland', 0.014), ('graphical', 0.013), ('accordingly', 0.013), ('country', 0.013), ('kyoto', 0.013), ('wilcoxon', 0.013), ('wi', 0.013), ('wallach', 0.012), ('montreal', 0.012), ('canada', 0.012), ('nd', 0.012), ('text', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing

Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. 1

2 0.19865397 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

Author: Elad Hazan, Zohar Karnin

Abstract: We present a simplex algorithm for linear programming in a linear classification formulation. The paramount complexity parameter in linear classification problems is called the margin. We prove that for margin values of practical interest our simplex variant performs a polylogarithmic number of pivot steps in the worst case, and its overall running time is near linear. This is in contrast to general linear programming, for which no sub-polynomial pivot rule is known. 1

3 0.16326125 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

4 0.11071597 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

5 0.11024596 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

6 0.10628379 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

7 0.079519533 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

8 0.077918015 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

9 0.076358959 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

10 0.074232228 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

11 0.07380303 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

12 0.060159974 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

13 0.058514208 126 nips-2012-FastEx: Hash Clustering with Exponential Families

14 0.054834507 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

15 0.054242909 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

16 0.050536133 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

17 0.045535196 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

18 0.043054126 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

19 0.034734752 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

20 0.029990604 22 nips-2012-A latent factor model for highly multi-relational data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.078), (1, 0.038), (2, -0.03), (3, -0.011), (4, -0.122), (5, -0.034), (6, -0.012), (7, 0.007), (8, 0.069), (9, -0.02), (10, 0.182), (11, 0.15), (12, 0.038), (13, 0.034), (14, 0.048), (15, 0.055), (16, 0.033), (17, 0.035), (18, 0.022), (19, 0.111), (20, 0.006), (21, 0.039), (22, -0.06), (23, -0.05), (24, 0.006), (25, -0.021), (26, -0.004), (27, 0.068), (28, 0.074), (29, -0.043), (30, 0.105), (31, -0.006), (32, 0.05), (33, 0.055), (34, 0.034), (35, -0.004), (36, -0.011), (37, 0.087), (38, -0.04), (39, -0.039), (40, -0.007), (41, 0.07), (42, 0.051), (43, -0.026), (44, -0.073), (45, 0.033), (46, 0.055), (47, -0.017), (48, 0.031), (49, -0.127)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94721872 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing

Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. 1

2 0.76891208 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

3 0.73711765 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

4 0.66859102 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

Author: Peter Krafft, Juston Moore, Bruce Desmarais, Hanna M. Wallach

Abstract: We introduce a new Bayesian admixture model intended for exploratory analysis of communication networks—specifically, the discovery and visualization of topic-specific subnetworks in email data sets. Our model produces principled visualizations of email networks, i.e., visualizations that have precise mathematical interpretations in terms of our model and its relationship to the observed data. We validate our modeling assumptions by demonstrating that our model achieves better link prediction performance than three state-of-the-art network models and exhibits topic coherence comparable to that of latent Dirichlet allocation. We showcase our model’s ability to discover and visualize topic-specific communication patterns using a new email data set: the New Hanover County email network. We provide an extensive analysis of these communication patterns, leading us to recommend our model for any exploratory analysis of email networks or other similarly-structured communication data. Finally, we advocate for principled visualization as a primary objective in the development of new network models. 1

5 0.64647514 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

6 0.57856441 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

7 0.56996864 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

8 0.56367731 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

9 0.47591004 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

10 0.45159775 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

11 0.44829911 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

12 0.4075571 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

13 0.39497003 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

14 0.32879558 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

15 0.31497157 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

16 0.3083415 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

17 0.30226409 192 nips-2012-Learning the Dependency Structure of Latent Factors

18 0.29167071 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

19 0.26832029 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

20 0.26032871 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.134), (21, 0.017), (38, 0.05), (39, 0.014), (42, 0.021), (54, 0.017), (55, 0.014), (68, 0.013), (74, 0.038), (76, 0.074), (80, 0.081), (92, 0.026), (93, 0.368)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73954368 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing

Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. 1

2 0.44502926 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

Author: Aaron Dennis, Dan Ventura

Abstract: The sum-product network (SPN) is a recently-proposed deep model consisting of a network of sum and product nodes, and has been shown to be competitive with state-of-the-art deep models on certain difficult tasks such as image completion. Designing an SPN network architecture that is suitable for the task at hand is an open question. We propose an algorithm for learning the SPN architecture from data. The idea is to cluster variables (as opposed to data instances) in order to identify variable subsets that strongly interact with one another. Nodes in the SPN network are then allocated towards explaining these interactions. Experimental evidence shows that learning the SPN architecture significantly improves its performance compared to using a previously-proposed static architecture. 1

3 0.43447369 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

4 0.42594886 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Author: Hyunsin Park, Sungrack Yun, Sanghyuk Park, Jongmin Kim, Chang D. Yoo

Abstract: For phoneme classification, this paper describes an acoustic model based on the variational Gaussian process dynamical system (VGPDS). The nonlinear and nonparametric acoustic model is adopted to overcome the limitations of classical hidden Markov models (HMMs) in modeling speech. The Gaussian process prior on the dynamics and emission functions respectively enable the complex dynamic structure and long-range dependency of speech to be better represented than that by an HMM. In addition, a variance constraint in the VGPDS is introduced to eliminate the sparse approximation error in the kernel matrix. The effectiveness of the proposed model is demonstrated with three experimental results, including parameter estimation and classification performance, on the synthetic and benchmark datasets. 1

5 0.42504603 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

Abstract: We propose a multiresolution Gaussian process to capture long-range, nonMarkovian dependencies while allowing for abrupt changes and non-stationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efficient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity.

6 0.42426962 192 nips-2012-Learning the Dependency Structure of Latent Factors

7 0.40584752 282 nips-2012-Proximal Newton-type methods for convex optimization

8 0.394835 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

9 0.391325 12 nips-2012-A Neural Autoregressive Topic Model

10 0.38834655 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

11 0.38393113 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

12 0.37323079 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

13 0.36812684 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

14 0.36682534 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

15 0.36466733 72 nips-2012-Cocktail Party Processing via Structured Prediction

16 0.36362553 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

17 0.36331999 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

18 0.36173055 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

19 0.36169437 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

20 0.36095884 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models