nips nips2012 nips2012-274 knowledge-graph by maker-knowledge-mining

274 nips-2012-Priors for Diversity in Generative Latent Variable Models


Source: pdf

Author: James T. Kwok, Ryan P. Adams

Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Probabilistic latent variable models are one of the cornerstones of machine learning. [sent-7, score-0.141]

2 They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. [sent-8, score-0.183]

3 For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. [sent-14, score-0.249]

4 In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i. [sent-15, score-0.165]

5 The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. [sent-19, score-0.48]

6 We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. [sent-21, score-0.559]

7 1 Introduction The probabilistic generative model is an important tool for statistical learning because it enables rich data to be explained in terms of simpler latent structure. [sent-22, score-0.276]

8 In the latter case, we might think of the inferred latent structure as providing a feature representation that summarizes complex high-dimensional interaction into a simpler form. [sent-24, score-0.254]

9 The core assumption behind the use of latent variables as features, however, is that the salient statistical properties discovered by unsupervised learning will be useful for discriminative tasks. [sent-25, score-0.141]

10 Most often, one builds a model where the feature representations are independent a priori, with the hope that a good fit to the data will require employing a variety of latent variables. [sent-28, score-0.169]

11 For example, in a generative clustering model based on a mixture distribution, multiple mixture components will often be used for a single “intuitive group” in the data, simply because the shape of the component’s density is not a close fit to the group’s distribution. [sent-30, score-0.247]

12 A generative 1 mixture model will happily use many of its components to closely fit the density of a single group of data, leading to a highly redundant feature representation. [sent-31, score-0.278]

13 Similarly, when applied to a text corpus, a topic model such as latent Dirichlet allocation [1] will place large probability mass on the same stop words across many topics, in order to fine-tune the probability assigned to the common case. [sent-32, score-0.34]

14 In both of these situations, we would like the latent groupings to uniquely correspond to characteristics of the data: that a group of data should be explained by one mixture component, and that common stop words should be one category of words among many. [sent-33, score-0.388]

15 This intuition expresses a need for diversity in the latent parameters of the model that goes beyond what is highly likely under the posterior distribution implied by an independent prior. [sent-34, score-0.357]

16 In this paper, we propose a modular approach to diversity in generative probabilistic models by replacing the independent prior on latent parameters with a determinantal point process (DPP). [sent-35, score-1.114]

17 The determinantal point process enables a modeler to specify a notion of similarity on the space of interest, which in this case is a space of possible latent distributions, via a positive definite kernel. [sent-36, score-0.753]

18 This construction naturally leads to a generative latent variable model in which diverse sets of latent parameters are preferred over redundant sets. [sent-38, score-0.459]

19 The determinantal point process is a convenient statistical tool for constructing a tractable point process with repulsive interaction. [sent-39, score-0.7]

20 [5] provides a useful survey of probabilistic properties of the determinantal point process, and for statistical properties, see, e. [sent-44, score-0.56]

21 2 Diversity in Generative Latent Variable Models In this paper we consider generic directed probabilistic latent variable models that produce distributions over a set of N data, denoted {xn }N , which live in a sample space X . [sent-51, score-0.207]

22 Each of these data n=1 has a latent discrete label zn , which takes a value in {1, 2, · · · , J}. [sent-52, score-0.252]

23 The latent label indexes into a set of parameters {θj }J . [sent-53, score-0.141]

24 The parameters determined by zn then produce the data according to a j=1 distribution f (xn | θzn ). [sent-54, score-0.111]

25 Typically we use independent priors for the θj , here denoted by π(·), but the distribution over the latent indices zn may be more structured. [sent-55, score-0.297]

26 Taken together this leads to the generic joint distribution: N p({xn , zn }N , {θj }J ) = p({zn }N ) n=1 j=1 n=1 J f (xn | θzn ) n=1 π(θj ). [sent-56, score-0.111]

27 For example, in a typical mixture model, the zn are drawn independently from a multinomial distribution and the θj are the component-specific parameters. [sent-58, score-0.19]

28 In an admixture model such as latent Dirichlet allocation (LDA) [1], the θj may be “topics”, or distributions over words. [sent-59, score-0.292]

29 In an admixture, the zn may share structure based on, e. [sent-60, score-0.111]

30 At training time, one either finds the maximum of the posterior distribution p({θj }J | {xn }N ) or n=1 j=1 collects samples from it, while integrating out the data-specific latent variables zn . [sent-64, score-0.286]

31 Then when presented with a test case x , one can construct a conditional distribution over the corresponding unknown variable z , which is now a “feature” that might usefully summarize many related aspects of x . [sent-65, score-0.102]

32 However, this interpretation of the model is suspect; we have not asked the model to make the zn variables explanatory, except as a byproduct of improving the training likelihood. [sent-66, score-0.145]

33 2 (a) Independent Points (c) Independent Gaussians (e) Independent Multinomials (b) DPP Points (d) DPP Gaussians (f) DPP Multinomials Figure 1: Illustrations of the determinantal point process prior. [sent-68, score-0.583]

34 1 Measure-Valued Determinantal Point Process In this work we propose an alternative to the independence assumption of the standard latent variable model. [sent-71, score-0.141]

35 Rather than specifying p({θj }J ) = j π(θj ), we will construct a determinantal point j=1 process on sets of component-specific distributions {f (x | θj )}J . [sent-72, score-0.66]

36 Via the DPP, it will be possible j=1 for us to specify a preference for sets of distributions that have minimal overlap, as determined via a positive-definite kernel function between distributions. [sent-73, score-0.192]

37 In the case of the simple parametric families for f (·) that we consider here, it is appropriate to think of the DPP as providing a “diverse” set of parameters θ = {θj }J , where the notion of diversity is expressed entirely in terms of the resulting j=1 probability measure on the sample space X . [sent-74, score-0.239]

38 To construct a determinantal point process, we first define a positive definite kernel on Θ, which we denote K : Θ × Θ → R. [sent-78, score-0.655]

39 Such kernels will always induce repulsion of the points so that diverse subsets of Θ will tend to have higher probability under the prior. [sent-86, score-0.109]

40 2 Kernels for Probability Distributions The determinantal point process framework allows us to construct a generative model for repulsion, but as with other kernel-based priors, we must define what “repulsion” means. [sent-90, score-0.706]

41 A variety of positive definite functions on probability measures have been defined, but in this work we will use the probability product kernel [10]. [sent-91, score-0.114]

42 This kernel is a natural generalization of the inner product for probability distributions. [sent-92, score-0.114]

43 (5) This kernel has convenient closed forms for several distributions of interest, which makes it an ideal building block for the present model. [sent-95, score-0.155]

44 That is, under the interpretation of a Bayesian prior as “inferences from previously-seen data”, we would like to be able to imagine an arbitrary amount of such data and construct a highly-informative prior when appropriate. [sent-98, score-0.131]

45 Unfortunately, the standard determinantal point process does not provide a knob to turn to increase its strength arbitrarily. [sent-99, score-0.607]

46 The objective is to have a scaling parameter that can cause the determinantal prior to be arbitrarily strong relative to the likelihood terms. [sent-104, score-0.543]

47 To address this issue, we augment the determinantal point process with an additional parameter λ > 0, so that the probability of a finite subset θ ⊂ Θ becomes p(θ ⊂ Θ) ∝ |Kθ |λ . [sent-108, score-0.583]

48 (8) For integer λ, it can be viewed as a set of λ identical “replicated realizations” from determinantal point processes, leaving our generative view intact. [sent-109, score-0.624]

49 This maps well onto the view of a prior as pseudo-data; our replicated DPP asserts that we have seen λ previous such data sets. [sent-111, score-0.133]

50 As in other pseudo-count priors, we do not require in practice that λ be an integer, and under a penalized log likelihood view of MAP inference, it can be interpreted as a parameter for increasing the effect of the determinantal penalty. [sent-112, score-0.495]

51 In addition to acting as a prior over distributions in the generative setting, we can also view the DPP as a new type of “diversity” regularizer on learning. [sent-115, score-0.204]

52 The goal is to solve θ = argmin L(θ; {xn }N ) − λ ln |Kθ |, n=1 θ⊂Θ 4 (9) βk λ K α θm zmn wmn Nm M Figure 2: Schematic of DPP-LDA. [sent-116, score-0.26]

53 d topics in LDA with a “doublestruck plate” to indicate a determinantal point process. [sent-119, score-0.705]

54 In the following sections, we give empirical evidence that this diversity improves generalization performance. [sent-126, score-0.189]

55 Here we examine two illustrative examples of generative latent variable models into which we can directly plug our DPP-based prior. [sent-131, score-0.229]

56 Diversified Latent Dirichlet Allocation Latent Dirichlet allocation (LDA) [1] is an immensely popular admixture model for text and, increasingly, for other kinds of data that can be treated as a “bag of words”. [sent-132, score-0.109]

57 LDA constructs a set of topics — distributions over the vocabulary — and asserts that each word in the corpus is explained by one of these topics. [sent-133, score-0.333]

58 The topic-word assignments are unobserved, but LDA attempts to find structure by requiring that only a small number of topics be represented in any given document. [sent-134, score-0.169]

59 In the standard LDA formulation, the topics are K discrete distributions βk over a vocabulary of size V , where βkv is the probability of topic k generating word v. [sent-135, score-0.281]

60 Document m has a latent multinomial distribution over topics, denoted θm and each word in the document wmn has a topic index zmn drawn from θm . [sent-137, score-0.466]

61 While classical LDA uses independent Dirichlet priors for the βk , here we “diversify” latent Dirichlet allocation by replacing this prior with a DPP. [sent-138, score-0.296]

62 φm is an N × K matrix in which the nth row, denoted φmn , is a multinomial distribution over topics for word wmn . [sent-152, score-0.338]

63 The inclusion of the determinantal point process prior does, however, effect the maximization step. [sent-157, score-0.631]

64 The diversity prior introduces an additional penalty on β, so that the M-step requires solving M Nm K V (v) φmnk wmn ln βkv + λ ln |Kβ | , β = argmax β (13) m=1 n=1 k=1 v=1 subject to the constraints that each row of β sum to 1. [sent-158, score-0.553]

65 For λ = 0, this optimization procedure yields (v) M Nm the standard update for vanilla LDA, βkv ∝ m=1 n=1 φmnk wmn . [sent-159, score-0.157]

66 Diversified Gaussian Mixture Model The mixture model is a popular model for generative clustering and density estimation. [sent-161, score-0.169]

67 Here we examine determinantal point process priors for the θk in the case where the components are Gaussian. [sent-164, score-0.652]

68 Let f1 = N (µ1 , Σ1 ) and f2 = N (µ2 , Σ2 ) be two Gaussians, the product kernel is: ρ D D ρ ˆ 1 K(f1 , f2 ) = (2π)(1−2ρ) 2 ρ− 2 |Σ| 2 (|Σ1 ||Σ2 |)− 2 exp(− (µT Σ−1 µ1 + µT Σ−1 µ2 − µT Σˆ)) ˆ ˆµ 2 2 2 1 1 ˆ where Σ = (Σ1 + Σ2 )−1 and µ = Σ−1 µ1 + Σ−1 µ2 . [sent-167, score-0.114]

69 In the standard EM algorithm for Gaussian mixtures, one typically introduces latent binary variables znj , which indicate that datum n belongs to component j. [sent-169, score-0.314]

70 The difference between this procedure and the standard EM approach is that the M-step for the DPP-GMM optimizes the objective function (summarizing {µj , Σj }J by θ for clarity): j=1   N J  θ = argmax γ(znj ) [ln χj + ln N (xn |µj , Σj )] + λ ln |Kθ | . [sent-173, score-0.203]

71 5 10 15 20 % increase in inter−centroid distance Figure 4: Effect of centroid distance on test error. [sent-188, score-0.132]

72 The kernel acts on the set of centroids as in Eqn. [sent-190, score-0.169]

73 Let θ = {µj } and znj be the hard assignment indicator, the maximization step is:   N J  θ = argmax znj ||xn − µj ||2 + λ ln |Kθ | . [sent-192, score-0.394]

74 A common frustration with vanilla LDA is that applying LDA to unfiltered data returns topics that are dominated by stop-words. [sent-197, score-0.213]

75 This frustrating phenomenon occurs even as the number of topics is varied from K = 5 to K = 50. [sent-198, score-0.169]

76 The first two columns of Table 1 show the ten most frequent words from two representative topics learned by LDA using K = 25 . [sent-199, score-0.25]

77 However, the topics inferred by regular LDA are still dominated by secondary stop-words that are not informative. [sent-202, score-0.23]

78 By finding stop-word-specific topics, the majority of the remaining topics are available for more informative words. [sent-204, score-0.2]

79 Table 1 shows a sample of topics learned by DPP-LDA on the unfiltered 20 Newsgroup corpus (K = 25, λ = 104 ). [sent-205, score-0.207]

80 High frequency words that are common across many topics significantly increase the similarity between the topics, as measured by the product kernel on the β distributions. [sent-207, score-0.354]

81 This similarity incurs a large penalty in DPP and so the objective actively pushes the parameters of LDA away from regions where stop words occupy large probability mass across many topics. [sent-208, score-0.132]

82 It is common to use the γm , the document specific posterior distribution over topics, as feature vectors in document classification. [sent-210, score-0.142]

83 We inferred {γm,train } on training documents from DPP-LDA variational EM, and then trained a support vector machine (SVM) classifier on {γm,train } with the true topic labels from 20 Newsgroups. [sent-211, score-0.177]

84 On test documents, we fixed the parameters α and β to the values inferred from the training set, and used variational EM to find MAP estimates of {γm,test }. [sent-212, score-0.13]

85 Each patch is then represented by a binary K-dimensional feature vector: the k th entry is one if the patch is closer to the centroid k than its average distance to centroids. [sent-249, score-0.159]

86 We reason that DPP-K-means may produce more informative features since the cluster centroids will repel each other into more distinct positions in pixel space. [sent-253, score-0.139]

87 For each training set size, we ran regular K-means for a range of values of K and select the K that gives the best test accuracy for K-means. [sent-259, score-0.125]

88 For up to 10000 images in the training set, DPP-K-means leads to better test classification accuracy compared to the simple K-means. [sent-261, score-0.123]

89 The comparisons are performed on matched settings: for a given randomly sampled training set and a centroid initialization, we generate the centroids from both K-means and DPP-K-means. [sent-262, score-0.194]

90 The two sets of centroids were used to extract features and train classifiers, which are then tested on the same test set of images. [sent-263, score-0.141]

91 Next we ask if there is a pattern between how far the DPP pushes apart the centroids and classification accuracy on the test set. [sent-269, score-0.184]

92 Focusing on 1000 training images and k = 30, for each randomly sampled training set and centroid initialization, we compute the mean inter-centroid distance for Kmeans and DPP-K-means. [sent-270, score-0.167]

93 We have introduced a general approach to including a preference for diversity into generative probabilistic models. [sent-278, score-0.338]

94 We showed how a determinantal point process can be integrated as a modular component into existing learning algorithms, and discussed its general role as a diversity regularizer. [sent-279, score-0.813]

95 We investigated two settings where diversity can be useful: learning topics from documents, and clustering image patches. [sent-280, score-0.358]

96 Plugging a DPP into latent Dirichlet allocation allows LDA to automatically group stop-words into a few categories, enabling more informative topics in other categories. [sent-281, score-0.428]

97 In both document and image classification tasks, there exists an intermediate regime of diversity (as controlled by the hyperparameter λ) that leads to consistent improvement in accuracy when compared to standard i. [sent-282, score-0.278]

98 A computational bottleneck can come from inverting the M × M kernel matrix K, where M is the number of latent distributions. [sent-286, score-0.225]

99 We expect that there are many other settings where DPP-based diversity can be usefully introduced into a generative probabilistic model: in the emission parameters of HMM and more general time series, and as a mechanism for transfer learning. [sent-288, score-0.335]

100 Statistical properties of determinantal point processes in high-dimensional Euclidean spaces. [sent-315, score-0.536]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dpp', 0.577), ('determinantal', 0.495), ('lda', 0.191), ('diversity', 0.189), ('topics', 0.169), ('latent', 0.141), ('znj', 0.139), ('kv', 0.119), ('wmn', 0.113), ('zn', 0.111), ('generative', 0.088), ('ln', 0.087), ('centroids', 0.085), ('kernel', 0.084), ('dirichlet', 0.079), ('centroid', 0.075), ('coates', 0.064), ('allocation', 0.062), ('xn', 0.062), ('diversi', 0.06), ('zmn', 0.06), ('document', 0.057), ('diverse', 0.057), ('replicated', 0.055), ('mixture', 0.054), ('em', 0.054), ('repulsion', 0.052), ('stop', 0.051), ('prior', 0.048), ('nm', 0.048), ('words', 0.047), ('admixture', 0.047), ('process', 0.047), ('plate', 0.045), ('jesper', 0.045), ('lavancier', 0.045), ('scardicchio', 0.045), ('priors', 0.045), ('vanilla', 0.044), ('ben', 0.044), ('distributions', 0.042), ('modular', 0.041), ('documents', 0.041), ('point', 0.041), ('newsgroup', 0.04), ('topic', 0.039), ('ltered', 0.039), ('corpus', 0.038), ('preference', 0.037), ('mnk', 0.037), ('multinomials', 0.037), ('inferred', 0.035), ('construct', 0.035), ('un', 0.034), ('pushes', 0.034), ('datum', 0.034), ('usefully', 0.034), ('training', 0.034), ('ten', 0.034), ('illustrations', 0.033), ('gaussians', 0.033), ('test', 0.033), ('redundant', 0.032), ('accuracy', 0.032), ('hough', 0.031), ('word', 0.031), ('classi', 0.031), ('informative', 0.031), ('asserts', 0.03), ('replicate', 0.03), ('product', 0.03), ('convenient', 0.029), ('harvard', 0.029), ('kulesza', 0.029), ('explanatory', 0.029), ('argmax', 0.029), ('specify', 0.029), ('variational', 0.028), ('feature', 0.028), ('patch', 0.028), ('map', 0.028), ('density', 0.027), ('patches', 0.027), ('expresses', 0.027), ('gram', 0.027), ('ller', 0.027), ('think', 0.026), ('regular', 0.026), ('regularizer', 0.026), ('poisson', 0.025), ('recovers', 0.025), ('multinomial', 0.025), ('group', 0.025), ('increase', 0.024), ('providing', 0.024), ('probabilistic', 0.024), ('images', 0.024), ('components', 0.024), ('explained', 0.023), ('features', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

Author: James T. Kwok, Ryan P. Adams

Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1

2 0.37983236 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar

Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1

3 0.16471276 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

4 0.13876833 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

5 0.1387569 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

6 0.11492036 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

7 0.10852841 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

8 0.10432018 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

9 0.085993536 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

10 0.085049711 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

11 0.083859451 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.083247669 304 nips-2012-Selecting Diverse Features via Spectral Regularization

13 0.081606217 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

14 0.079936214 294 nips-2012-Repulsive Mixtures

15 0.078785881 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

16 0.075220555 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

17 0.07380303 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

18 0.071737967 168 nips-2012-Kernel Latent SVM for Visual Recognition

19 0.070197262 200 nips-2012-Local Supervised Learning through Space Partitioning

20 0.069713049 126 nips-2012-FastEx: Hash Clustering with Exponential Families


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.192), (1, 0.077), (2, -0.057), (3, -0.016), (4, -0.142), (5, -0.066), (6, -0.004), (7, 0.026), (8, 0.075), (9, -0.087), (10, 0.169), (11, 0.156), (12, 0.04), (13, 0.022), (14, 0.116), (15, 0.033), (16, 0.045), (17, 0.084), (18, -0.01), (19, 0.051), (20, 0.112), (21, 0.042), (22, 0.113), (23, -0.05), (24, 0.248), (25, 0.071), (26, -0.065), (27, 0.027), (28, 0.038), (29, -0.012), (30, 0.035), (31, -0.004), (32, 0.074), (33, 0.032), (34, -0.023), (35, 0.102), (36, -0.178), (37, 0.045), (38, 0.001), (39, 0.04), (40, -0.008), (41, 0.001), (42, -0.02), (43, -0.071), (44, -0.002), (45, 0.011), (46, -0.011), (47, -0.075), (48, 0.033), (49, -0.164)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88556105 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

Author: James T. Kwok, Ryan P. Adams

Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1

2 0.71584755 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar

Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1

3 0.67650789 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing

Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. 1

4 0.67430621 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

5 0.64540929 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

6 0.62022132 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

7 0.52758718 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

8 0.52436495 304 nips-2012-Selecting Diverse Features via Spectral Regularization

9 0.52296764 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

10 0.51672673 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

11 0.48422903 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

12 0.46071997 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

13 0.45265523 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

14 0.44308692 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

15 0.41009608 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

16 0.40795958 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications

17 0.40301341 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

18 0.36032659 192 nips-2012-Learning the Dependency Structure of Latent Factors

19 0.35563636 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

20 0.35439545 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.087), (21, 0.013), (38, 0.089), (39, 0.02), (42, 0.022), (53, 0.011), (54, 0.018), (55, 0.038), (74, 0.334), (76, 0.116), (80, 0.1), (92, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9516632 202 nips-2012-Locally Uniform Comparison Image Descriptor

Author: Andrew Ziegler, Eric Christiansen, David Kriegman, Serge J. Belongie

Abstract: Keypoint matching between pairs of images using popular descriptors like SIFT or a faster variant called SURF is at the heart of many computer vision algorithms including recognition, mosaicing, and structure from motion. However, SIFT and SURF do not perform well for real-time or mobile applications. As an alternative very fast binary descriptors like BRIEF and related methods use pairwise comparisons of pixel intensities in an image patch. We present an analysis of BRIEF and related approaches revealing that they are hashing schemes on the ordinal correlation metric Kendall’s tau. Here, we introduce Locally Uniform Comparison Image Descriptor (LUCID), a simple description method based on linear time permutation distances between the ordering of RGB values of two image patches. LUCID is computable in linear time with respect to the number of pixels and does not require floating point computation. 1

2 0.95010382 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

Author: Angela Eigenstetter, Bjorn Ommer

Abstract: Category-level object detection has a crucial need for informative object representations. This demand has led to feature descriptors of ever increasing dimensionality like co-occurrence statistics and self-similarity. In this paper we propose a new object representation based on curvature self-similarity that goes beyond the currently popular approximation of objects using straight lines. However, like all descriptors using second order statistics, ours also exhibits a high dimensionality. Although improving discriminability, the high dimensionality becomes a critical issue due to lack of generalization ability and curse of dimensionality. Given only a limited amount of training data, even sophisticated learning algorithms such as the popular kernel methods are not able to suppress noisy or superfluous dimensions of such high-dimensional data. Consequently, there is a natural need for feature selection when using present-day informative features and, particularly, curvature self-similarity. We therefore suggest an embedded feature selection method for SVMs that reduces complexity and improves generalization capability of object models. By successfully integrating the proposed curvature self-similarity representation together with the embedded feature selection in a widely used state-of-the-art object detection framework we show the general pertinence of the approach. 1

3 0.94864184 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

Abstract: We present an approach to detecting and analyzing the 3D configuration of objects in real-world images with heavy occlusion and clutter. We focus on the application of finding and analyzing cars. We do so with a two-stage model; the first stage reasons about 2D shape and appearance variation due to within-class variation (station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then refined by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset. 1

4 0.91554981 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

Author: Levi Boyles, Max Welling

Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1

5 0.90562975 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: The Lov´ sz ϑ function of a graph, a fundamental tool in combinatorial optimizaa tion and approximation algorithms, is computed by solving a SDP. In this paper we establish that the Lov´ sz ϑ function is equivalent to a kernel learning problem a related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call SVM − ϑ graphs, on which the Lov´ sz ϑ function a can be approximated well by a one-class SVM. This leads to novel use of SVM techniques for solving algorithmic problems in large graphs e.g. identifying a √ 1 planted clique of size Θ( n) in a random graph G(n, 2 ). A classic approach for this problem involves computing the ϑ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of SVM − ϑ graph. As a consequence a SVM based approach easily identifies the clique in large graphs and is competitive with the state-of-the-art. We introduce the notion of common orthogonal labelling and show that it can be computed by solving a Multiple Kernel learning problem. It is further shown that such a labelling is extremely useful in identifying a large common dense subgraph in multiple graphs, which is known to be a computationally difficult problem. The proposed algorithm achieves an order of magnitude scalability compared to state of the art methods. 1

6 0.86971509 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

same-paper 7 0.84749246 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

8 0.8136124 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

9 0.79998183 185 nips-2012-Learning about Canonical Views from Internet Image Collections

10 0.7988776 201 nips-2012-Localizing 3D cuboids in single-view images

11 0.78828788 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

12 0.7829892 210 nips-2012-Memorability of Image Regions

13 0.78292197 8 nips-2012-A Generative Model for Parts-based Object Segmentation

14 0.76356333 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

15 0.75390816 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

16 0.75368679 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

17 0.74819785 168 nips-2012-Kernel Latent SVM for Visual Recognition

18 0.74134821 303 nips-2012-Searching for objects driven by context

19 0.72371185 146 nips-2012-Graphical Gaussian Vector for Image Categorization

20 0.71938103 193 nips-2012-Learning to Align from Scratch