nips nips2009 nips2009-204 knowledge-graph by maker-knowledge-mining

204 nips-2009-Replicated Softmax: an Undirected Topic Model


Source: pdf

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. [sent-4, score-0.26]

2 We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. [sent-5, score-0.068]

3 This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy. [sent-6, score-0.284]

4 1 Introduction Probabilistic topic models [2, 9, 6] are often used to analyze and extract semantic topics from large text collections. [sent-7, score-0.406]

5 Many of the existing topic models are based on the assumption that each document is represented as a mixture of topics, where each topic defines a probability distribution over words. [sent-8, score-0.609]

6 The mixing proportions of the topics are document specific, but the probability distribution over words, defined by each topic, is the same across all documents. [sent-9, score-0.24]

7 All these models can be viewed as graphical models in which latent topic variables have directed connections to observed variables that represent words in a document. [sent-10, score-0.44]

8 A second major drawback, that is shared by all mixture models, is that these models can never make predictions for words that are sharper than the distributions predicted by any of the individual topics. [sent-12, score-0.105]

9 They are unable to capture the essential idea of distributed representations which is that the distributions predicted by individual active features get multiplied together (and renormalized) to give the distribution predicted by a whole set of active features. [sent-13, score-0.098]

10 For example, distributed representations allow the topics “government”, ”mafia” and ”playboy” to combine to give very high probability to a word “Berlusconi” that is not predicted nearly as strongly by each topic alone. [sent-15, score-0.393]

11 To date, there has been very little work on developing topic models using undirected graphical models. [sent-16, score-0.328]

12 Several authors [4, 17] used two-layer undirected graphical models, called Restricted Boltzmann Machines (RBMs), in which word-count vectors are modeled as a Poisson distribution. [sent-17, score-0.087]

13 While these models are able to produce distributed representations of the input and perform well in terms of retrieval accuracy, they are unable to properly deal with documents of different lengths, which makes learning very unstable and hard. [sent-18, score-0.292]

14 For undirected models marginalizing over unobserved variables is generally a non-trivial operation, which makes learning far more difficult. [sent-21, score-0.11]

15 Recently, [13] attempted to fix this problem by proposing a Constrained Poisson model that would ensure that the mean Poisson 1 rates across all words sum up to the length of the document. [sent-22, score-0.065]

16 While the parameter learning has been shown to be stable, the introduced model no longer defines a proper probability distribution over the word counts. [sent-23, score-0.071]

17 The model can be efficiently trained using Contrastive Divergence, it has a better way of dealing with documents of different lengths, and computing the posterior distribution over the latent topic values is easy. [sent-25, score-0.501]

18 We will also demonstrate that the proposed model is able to generalize much better compared to a popular Bayesian mixture model, Latent Dirichlet Allocation (LDA) [2], in terms of both the log-probability on previously unseen documents and the retrieval accuracy. [sent-26, score-0.284]

19 2 Replicated Softmax: A Generative Model of Word Counts Consider modeling discrete visible units v using a restricted Boltzmann machine, that has a twolayer architecture as shown in Fig. [sent-27, score-0.189]

20 , K}D , where K is the dictionary size and D is the document size, and let h ∈ {0, 1}F be binary stochastic hidden topic features. [sent-32, score-0.441]

21 Let V be a k K × D observed binary matrix with vi = 1 if visible unit i takes on k th value. [sent-33, score-0.178]

22 The probability that the model assigns to a visible binary matrix V is: 1 P (V) = exp (−E(V, h)), Z = exp (−E(V, h)), (2) Z V h h where Z is known as the partition function or normalizing constant. [sent-36, score-0.269]

23 The conditional distributions are given by softmax and logistic functions: F k j=1 hj Wij ) F q bq + j=1 hj Wij i exp (bk + i k p(vi = 1|h) = K q=1 exp D p(hj = 1|V) = (3) K k k vi Wij σ aj + , (4) i=1 k=1 where σ(x) = 1/(1 + exp(−x)) is the logistic function. [sent-37, score-0.957]

24 Now suppose that for each document we create a separate RBM with as many softmax units as there are words in the document. [sent-38, score-0.897]

25 Assuming we can ignore the order of the words, all of these softmax units can share the same set of weights, connecting them to binary hidden units. [sent-39, score-0.777]

26 In this case, we define the energy of the state {V, h} to be: F E(V, h) K K Wjk hj v k − ˆ = − j=1 k=1 F v k bk − D ˆ k=1 h j aj , (5) j=1 k where v k = D vi denotes the count for the k th word. [sent-41, score-0.259]

27 Observe that the bias terms of the hidden ˆ i=1 units are scaled up by the length of the document. [sent-42, score-0.164]

28 This scaling is crucial and allows hidden topic units to behave sensibly when dealing with documents of different lengths. [sent-43, score-0.527]

29 The top layer represents a vector h of stochastic, binary topic features and and the bottom layer represents softmax visible units v. [sent-45, score-1.117]

30 All visible units share the same set of weights, connecting them to binary hidden units. [sent-46, score-0.262]

31 Left: The model for a document containing two and three words. [sent-47, score-0.174]

32 Right: A different interpretation of the Replicated Softmax model, in which D softmax units with identical weights are replaced by a single multinomial unit which is sampled D times. [sent-48, score-0.758]

33 The special bipartite structure of RBM’s allows for quite an efficient Gibbs sampler that alternates between sampling the states of the hidden units independently given the states of the visible units, and vise versa (see Eqs. [sent-53, score-0.299]

34 The weights can now be shared by the whole family of different-sized RBM’s that are created for documents of different lengths (see Fig. [sent-56, score-0.203]

35 6) for a document that contains 100 words is computationally not much more expensive than computing the gradients for a document that contains only one word. [sent-60, score-0.363]

36 A key observation is that using D softmax units with identical weights is equivalent to having a single multinomial unit which is sampled D times, as shown in Fig. [sent-61, score-0.758]

37 If instead of sampling, we use real-valued softmax probabilities multiplied by D, we exactly recover the learning algorithm of a Constrained Poisson model [13], except for the scaling of the hidden biases with D. [sent-63, score-0.701]

38 3 Evaluating Replicated Softmax as a Generative Model Assessing the generalization performance of probabilistic topic models plays an important role in model selection. [sent-64, score-0.264]

39 Much of the existing literature, particularly for undirected topic models [4, 17], uses extremely indirect performance measures, such as information retrieval or document classification. [sent-65, score-0.527]

40 More broadly, however, the ability of the model to generalize can be evaluated by computing the probability that the model assigns to the previously unseen documents, which is independent of any specific application. [sent-66, score-0.114]

41 For undirected models, computing the probability of held-out documents exactly is intractable, since computing the global normalization constant requires enumeration over an exponential number of terms. [sent-67, score-0.27]

42 Evaluating the same probability for directed topic models is also difficult, because there are an exponential number of possible topic assignments for the words. [sent-68, score-0.494]

43 One way of estimating the ratio of normalizing constants is to use a simple importance sampling method: ZB = ZA x 1 p∗ (x) p∗ (x) B B ∗ (x) pA (x) = EpA p∗ (x) ≈ N pA A N i=1 p∗ (x(i) ) B , p∗ (x(i) ) A (7) where x(i) ∼ pA . [sent-84, score-0.132]

44 In high-dimensional spaces, the variance of the importance sampling estimator will be very large, or possibly infinite, unless pA is a near-perfect approximation to pB . [sent-86, score-0.099]

45 Annealed Importance Sampling can be viewed as simple importance sampling defined on a much higher dimensional state space. [sent-87, score-0.099]

46 For each intermediate distribution, a Markov chain transition operator Tk (x′ ; x) that leaves pk (x) invariant must also be defined. [sent-97, score-0.075]

47 5, the joint distribution over {V, h} is defined as1 :   F K 1 p(V, h) = exp  (9) Wjk hj v k  , ˆ Z j=1 k=1 k where v k = D vi denotes the count for the k th word. [sent-101, score-0.193]

48 By explicitly summing out the latent topic ˆ i=1 units h we can easily evaluate an unnormalized probability p∗ (V). [sent-102, score-0.391]

49 The sequence of intermediate distributions, parameterized by β, can now be defined as follows: ps (V) = 1 1 ∗ p (V) = Zs Zs p∗ (V, h) = s h 1 Zs K F Wjk v k ˆ 1 + exp βs j=1 . [sent-103, score-0.144]

50 (10) k=1 Note that for s = 0, we have βs = 0, and so p0 represents a uniform distribution, whose partition function evaluates to Z0 = 2F , where F is the number of hidden units. [sent-104, score-0.104]

51 3, 4, it is also straightforward to derive an efficient Gibbs transition operator that leaves ps (V) invariant. [sent-108, score-0.104]

52 It starts by first sampling from a simple uniform distribution p0 (V) and then applying a series of transition operators T1 , T2 , . [sent-110, score-0.095]

53 , TS−1 that “move” the sample through the intermediate distributions ps (V) towards the target distribution pS (V). [sent-113, score-0.16]

54 Note that there is no need to compute the normalizing constants of any intermediate distri(i) butions. [sent-114, score-0.08]

55 After performing M runs of AIS, the importance weights wAIS can be used to obtain an unbiased estimate of our model’s partition function ZS : ZS Z0 ≈ 1 M M (i) wAIS , (11) i=1 where Z0 = 2F . [sent-115, score-0.095]

56 In particular, if we were to choose dumb transition operators that do nothing, Ts (V′ ← V) = δ(V′ − V) for all s, we simply recover the simple importance sampling procedure of Eq. [sent-117, score-0.154]

57 When evaluating the probability of a collection of several documents, we need to perform a separate AIS run per document, if those documents are of different lengths. [sent-119, score-0.164]

58 This is because each differentsized document can be represented as a separate RBM that has its own global normalizing constant. [sent-120, score-0.184]

59 We used the first 1690 documents as training data and the remaining 50 documents as test. [sent-124, score-0.346]

60 The dataset was already preprocessed, where each document was represented as a vector containing 13,649 word counts. [sent-125, score-0.221]

61 3 The data was split by date into 11,314 training and 7,531 test articles, so the training and test sets were separated in time. [sent-128, score-0.107]

62 We further preprocessed the data by removing common stopwords, stemming, and then only considering the 2000 most frequent words in the training dataset. [sent-129, score-0.109]

63 The topic classes form a tree which is typically of depth 3. [sent-133, score-0.217]

64 For this dataset, we define the relevance of one document to another to be the fraction of the topic labels that agree on the two paths from the root to the two documents. [sent-134, score-0.368]

65 The available data was already in the preprocessed format, where common stopwords were removed and all documents were stemmed. [sent-136, score-0.227]

66 We again only considered the 10,000 most frequent words in the training dataset. [sent-137, score-0.078]

67 For all datasets, each word count wi was replaced by log(1 + wi ), rounded to the nearest integer, which slightly improved retrieval performance of both models. [sent-138, score-0.141]

68 Test perplexity per word (in nats) LDA-50 LDA-200 R. [sent-167, score-0.185]

69 K is ¯ the vocabulary size, D is the mean document length, St. [sent-169, score-0.151]

70 The average test perplexity per word was then estimated as exp −1/N N 1/Dn log p(vn ) , where n=1 N is the total number of documents, Dn and vn are the total number of words and the observed word-count vector for a document n. [sent-181, score-0.476]

71 For the NIPS dataset, the undirected model achieves the average test perplexity of 3405, improving upon LDA’s perplexity of 3576. [sent-183, score-0.391]

72 The LDA with 200 topics performed much better on this dataset compared to the LDA-50, but its performance only slightly improved upon the 50-dimensional Replicated Softmax model. [sent-184, score-0.111]

73 For the 20-newsgroups dataset, even with 200 topics, the LDA could not match the perplexity of the Replicated Softmax model with 50 topic units. [sent-185, score-0.377]

74 LDA achieves an average test perplexity of 1437, substantially reducing it from 2208, achieved by a simple smoothed unigram model. [sent-187, score-0.211]

75 The Replicated Softmax further reduces the perplexity down to 986, which is comparable in magnitude to the improvement produced by the LDA over the unigram model. [sent-188, score-0.185]

76 LDA with 200 topics does improve upon LDA-50, achieving a perplexity of 1142. [sent-189, score-0.226]

77 htm For the 20-newsgroups and Reuters datasets, the 50 held-out documents were randomly sampled from the test sets. [sent-195, score-0.19]

78 6 100 Recall (%) Figure 3: Precision-Recall curves for the 20-newsgroups and Reuters datasets, when a query document from the test set is used to retrieve similar documents from the training corpus. [sent-209, score-0.359]

79 Figure 2 further shows three scatter plots of the average test perplexity per document. [sent-211, score-0.163]

80 Observe that for almost all test documents, the Replicated Softmax achieves a better perplexity compared to the corresponding LDA model. [sent-212, score-0.163]

81 For the Reuters dataset, as expected, there are many documents that are modeled much better by the undirected model than an LDA. [sent-213, score-0.255]

82 4 Document Retrieval We used 20-newsgroup and Reuters datasets to evaluate model performance on a document retrieval task. [sent-216, score-0.28]

83 To decide whether a retrieved document is relevant to the query document, we simply check if they have the same class label. [sent-217, score-0.151]

84 For the Replicated Softmax, the mapping from a word-count vector to the values of the latent topic features is fast, requiring only a single matrix multiplication followed by a componentwise sigmoid non-linearity. [sent-219, score-0.295]

85 For the LDA, we used 1000 Gibbs sweeps per test document in order to get an approximate posterior over the topics. [sent-220, score-0.177]

86 Figure 3 shows that when we use the cosine of the angle between two topic vectors to measure their similarity, the Replicated Softmax significantly outperforms LDA, particularly when retrieving the top few documents. [sent-221, score-0.217]

87 5 Conclusions and Extensions We have presented a simple two-layer undirected topic model that be used to model and automatically extract distributed semantic representations from large collections of text corpora. [sent-222, score-0.426]

88 The proposed model have several key advantages: the learning is easy and stable, it can model documents of different lengths, and computing the posterior distribution over the latent topic values is easy. [sent-224, score-0.524]

89 Furthermore, using stochastic gradient descent, scaling up learning to billions of documents would not be particularly difficult. [sent-225, score-0.164]

90 This is in contrast to directed topic models, where most of the existing inference algorithms are designed to be run in a batch mode. [sent-226, score-0.253]

91 We have also demonstrated that the proposed model is able to generalize much better than LDA in terms of both the log-probability on held-out documents and the retrieval accuracy. [sent-228, score-0.284]

92 In this paper we have only considered the simplest possible topic model, but the proposed model can be extended in several ways. [sent-229, score-0.24]

93 For example, similar to supervised LDA [1], the proposed Replicated Softmax can be easily extended to modeling the joint the distribution over words and a document label, as shown in Fig. [sent-230, score-0.193]

94 Recently, [11] introduced a Dirichlet-multinomial regression model, where a prior on the document-specific topic distributions was modeled as a function of observed metadata of the document. [sent-232, score-0.3]

95 , can be used 7 Latent Topics Latent Topics Metadata Label Multinomial Visible Multinomial Visible Figure 4: Left: A Replicated Softmax model that models the joint distribution of words and document label. [sent-234, score-0.24]

96 Right: Conditional Replicated Softmax model where the observed document-specific metadata affects binary states of the hidden topic units. [sent-235, score-0.377]

97 to influence the states of the latent topic units, as shown in Fig. [sent-236, score-0.295]

98 Finally, as argued by [13], a single layer of binary features may not the best way to capture the complex structure in the count data. [sent-238, score-0.071]

99 Once the Replicated Softmax has been trained, we can add more layers to create a Deep Belief Network [8], which could potentially produce a better generative model and further improve retrieval accuracy. [sent-239, score-0.107]

100 The Rate Adapting Poisson (RAP) model for information retrieval and object recognition. [sent-262, score-0.09]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('softmax', 0.608), ('replicated', 0.458), ('topic', 0.217), ('lda', 0.186), ('reuters', 0.166), ('documents', 0.164), ('ais', 0.161), ('document', 0.151), ('perplexity', 0.137), ('hj', 0.105), ('units', 0.096), ('visible', 0.093), ('topics', 0.089), ('annealed', 0.083), ('pb', 0.083), ('latent', 0.078), ('ps', 0.076), ('rbm', 0.075), ('wjk', 0.069), ('undirected', 0.068), ('retrieval', 0.067), ('pa', 0.066), ('metadata', 0.064), ('importance', 0.059), ('zs', 0.056), ('epdata', 0.055), ('epmodel', 0.055), ('pdata', 0.055), ('wais', 0.055), ('vs', 0.052), ('vn', 0.051), ('hidden', 0.05), ('bk', 0.05), ('unigram', 0.048), ('contrastive', 0.048), ('word', 0.048), ('intermediate', 0.047), ('wij', 0.044), ('words', 0.042), ('salakhutdinov', 0.041), ('vi', 0.041), ('gibbs', 0.041), ('sampling', 0.04), ('datasets', 0.039), ('lengths', 0.039), ('poisson', 0.038), ('aj', 0.037), ('partition', 0.036), ('directed', 0.036), ('semantic', 0.035), ('normalizing', 0.033), ('multinomial', 0.033), ('stopwords', 0.032), ('preprocessed', 0.031), ('generalize', 0.03), ('transition', 0.028), ('corpus', 0.028), ('deep', 0.028), ('dirichlet', 0.028), ('za', 0.028), ('temperatures', 0.028), ('operators', 0.027), ('test', 0.026), ('count', 0.026), ('hinton', 0.025), ('nips', 0.025), ('models', 0.024), ('allocation', 0.024), ('volume', 0.023), ('text', 0.023), ('model', 0.023), ('binary', 0.023), ('dataset', 0.022), ('layer', 0.022), ('unit', 0.021), ('boltzmann', 0.021), ('grif', 0.021), ('exp', 0.021), ('assessing', 0.02), ('bipartite', 0.02), ('cd', 0.02), ('predicted', 0.02), ('multiplied', 0.02), ('assigns', 0.019), ('date', 0.019), ('computing', 0.019), ('distributions', 0.019), ('representations', 0.019), ('graphical', 0.019), ('training', 0.018), ('properly', 0.018), ('unobserved', 0.018), ('proceedings', 0.018), ('bias', 0.018), ('intractable', 0.018), ('frequent', 0.018), ('extract', 0.018), ('represents', 0.018), ('target', 0.018), ('generative', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 204 nips-2009-Replicated Softmax: an Undirected Topic Model

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

2 0.20984726 205 nips-2009-Rethinking LDA: Why Priors Matter

Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach

Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1

3 0.18972979 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Author: Chong Wang, David M. Blei

Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1

4 0.14398326 2 nips-2009-3D Object Recognition with Deep Belief Nets

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We introduce a new type of top-level model for Deep Belief Nets and evaluate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalized-uniform version), which contains stereo-pair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5% error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2% error. 1

5 0.13596439 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei

Abstract: We develop a probabilistic model of human memory performance in free recall experiments. In these experiments, a subject first studies a list of words and then tries to recall them. To model these data, we draw on both previous psychological research and statistical topic models of text documents. We assume that memories are formed by assimilating the semantic meaning of studied words (represented as a distribution over topics) into a slowly changing latent context (represented in the same space). During recall, this context is reinstated and used as a cue for retrieving studied words. By conceptualizing memory retrieval as a dynamic latent variable model, we are able to use Bayesian inference to represent uncertainty and reason about the cognitive processes underlying memory. We present a particle filter algorithm for performing approximate posterior inference, and evaluate our model on the prediction of recalled words in experimental data. By specifying the model hierarchically, we are also able to capture inter-subject variability. 1

6 0.1272562 190 nips-2009-Polynomial Semantic Indexing

7 0.11638097 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

8 0.11614585 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

9 0.11358142 96 nips-2009-Filtering Abstract Senses From Image Search Results

10 0.10315154 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

11 0.08834479 226 nips-2009-Spatial Normalized Gamma Processes

12 0.087885782 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

13 0.08678212 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

14 0.081377201 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

15 0.075857021 47 nips-2009-Boosting with Spatial Regularization

16 0.071447454 87 nips-2009-Exponential Family Graph Matching and Ranking

17 0.061510451 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

18 0.058502689 97 nips-2009-Free energy score space

19 0.057700109 260 nips-2009-Zero-shot Learning with Semantic Output Codes

20 0.05300862 151 nips-2009-Measuring Invariances in Deep Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.165), (1, -0.104), (2, -0.101), (3, -0.171), (4, 0.101), (5, -0.179), (6, -0.089), (7, 0.034), (8, -0.08), (9, 0.206), (10, -0.029), (11, 0.025), (12, -0.093), (13, 0.157), (14, 0.121), (15, -0.042), (16, -0.136), (17, -0.099), (18, -0.073), (19, -0.018), (20, -0.007), (21, 0.028), (22, 0.01), (23, -0.093), (24, -0.038), (25, -0.045), (26, 0.02), (27, -0.013), (28, 0.009), (29, -0.023), (30, 0.033), (31, 0.07), (32, 0.031), (33, -0.003), (34, 0.02), (35, 0.03), (36, -0.024), (37, 0.028), (38, 0.006), (39, -0.01), (40, -0.062), (41, 0.054), (42, -0.117), (43, -0.027), (44, -0.067), (45, -0.065), (46, 0.019), (47, 0.04), (48, -0.016), (49, 0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95328695 204 nips-2009-Replicated Softmax: an Undirected Topic Model

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

2 0.83698463 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Author: Chong Wang, David M. Blei

Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1

3 0.82088 205 nips-2009-Rethinking LDA: Why Priors Matter

Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach

Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1

4 0.77264661 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda

Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.

5 0.67234564 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

Author: Shuang-hong Yang, Hongyuan Zha, Bao-gang Hu

Abstract: We propose Dirichlet-Bernoulli Alignment (DBA), a generative model for corpora in which each pattern (e.g., a document) contains a set of instances (e.g., paragraphs in the document) and belongs to multiple classes. By casting predefined classes as latent Dirichlet variables (i.e., instance level labels), and modeling the multi-label of each pattern as Bernoulli variables conditioned on the weighted empirical average of topic assignments, DBA automatically aligns the latent topics discovered from data to human-defined classes. DBA is useful for both pattern classification and instance disambiguation, which are tested on text classification and named entity disambiguation in web search queries respectively.

6 0.63089752 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

7 0.57339215 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

8 0.4955866 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

9 0.47672054 226 nips-2009-Spatial Normalized Gamma Processes

10 0.46835354 96 nips-2009-Filtering Abstract Senses From Image Search Results

11 0.46356493 190 nips-2009-Polynomial Semantic Indexing

12 0.4568274 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

13 0.41497344 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

14 0.41272786 2 nips-2009-3D Object Recognition with Deep Belief Nets

15 0.40022498 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

16 0.40019819 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

17 0.37888345 97 nips-2009-Free energy score space

18 0.33244163 90 nips-2009-Factor Modeling for Advertisement Targeting

19 0.32931027 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

20 0.30638817 260 nips-2009-Zero-shot Learning with Semantic Output Codes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(12, 0.017), (18, 0.214), (21, 0.01), (24, 0.039), (25, 0.048), (35, 0.049), (36, 0.098), (39, 0.069), (58, 0.05), (61, 0.013), (71, 0.135), (81, 0.02), (86, 0.131), (91, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86329108 204 nips-2009-Replicated Softmax: an Undirected Topic Model

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

2 0.82018483 112 nips-2009-Human Rademacher Complexity

Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers

Abstract: We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learner’s ability to fit random labels, and can be used to bound the learner’s true error based on the observed training sample error. We first review the definition of Rademacher complexity and its generalization bound. We then describe a “learning the noise” procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overfitting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science. 1

3 0.72914183 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

4 0.71518302 56 nips-2009-Conditional Neural Fields

Author: Jian Peng, Liefeng Bo, Jinbo Xu

Abstract: Conditional random fields (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs significantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.

5 0.71258664 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

Abstract: We address the problem of learning classifiers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classifiers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classifiers learned on examples with multiple artificially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our findings by showing that additional views obtained from MT may significantly improve the classification performance in the cases identified by our trade-off. 1

6 0.70727664 96 nips-2009-Filtering Abstract Senses From Image Search Results

7 0.6974144 260 nips-2009-Zero-shot Learning with Semantic Output Codes

8 0.69102865 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

9 0.6900112 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

10 0.68884575 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

11 0.68650651 205 nips-2009-Rethinking LDA: Why Priors Matter

12 0.67847836 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

13 0.67820233 196 nips-2009-Quantification and the language of thought

14 0.67683828 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

15 0.67572391 154 nips-2009-Modeling the spacing effect in sequential category learning

16 0.67568523 226 nips-2009-Spatial Normalized Gamma Processes

17 0.6707201 2 nips-2009-3D Object Recognition with Deep Belief Nets

18 0.67008412 90 nips-2009-Factor Modeling for Advertisement Targeting

19 0.66960561 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

20 0.66772503 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference