nips nips2005 nips2005-52 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Lafferty School of Computer Science Carnegie Mellon University Abstract Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. [sent-3, score-0.344]
2 The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. [sent-4, score-0.306]
3 A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. [sent-5, score-0.736]
4 This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. [sent-6, score-0.455]
5 In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. [sent-7, score-1.239]
6 We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. [sent-8, score-0.613]
7 The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. [sent-9, score-0.228]
8 Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. [sent-10, score-0.123]
9 1 Introduction The availability and use of unstructured historical collections of documents is rapidly growing. [sent-11, score-0.183]
10 org) is a not-for-profit organization that maintains a large online scholarly journal archive obtained by running an optical character recognition engine over the original printed journals. [sent-14, score-0.163]
11 This provides an extremely useful service to the scholarly community, with the collection comprising nearly three million published articles in a variety of fields. [sent-16, score-0.289]
12 The sheer size of this unstructured and noisy archive naturally suggests opportunities for the use of statistical modeling. [sent-17, score-0.137]
13 Alerted to the existence of this new related topic, the researcher could browse the collection in a topic-guided manner to begin to investigate connections to a previously unrecognized body of work. [sent-19, score-0.15]
14 Since the archive comprises millions of articles spanning centuries of scholarly work, automated analysis is essential. [sent-20, score-0.285]
15 Several statistical models have recently been developed for automatically extracting the topical structure of large document collections. [sent-21, score-0.172]
16 In technical terms, a topic model is a generative probabilistic model that uses a small number of distributions over a vocabulary to describe a document collection. [sent-22, score-0.593]
17 LDA assumes that the words of each document arise from a mixture of topics. [sent-25, score-0.256]
18 The topics are shared by all documents in the collection; the topic proportions are document-specific and randomly drawn from a Dirichlet distribution. [sent-26, score-0.775]
19 LDA allows each document to exhibit multiple topics with different proportions, and it can thus capture the heterogeneity in grouped data that exhibit multiple latent patterns. [sent-27, score-0.593]
20 Recent work has used LDA in more complicated document models [9, 11, 7], and in a variety of settings such as image processing [12], collaborative filtering [8], and the modeling of sequential data and user profiles [6]. [sent-28, score-0.212]
21 Our goal in this paper is to address a limitation of the topic models proposed to date: they fail to directly model correlation between topics. [sent-30, score-0.443]
22 In many—indeed most—text corpora, it is natural to expect that subsets of the underlying latent topics will be highly correlated. [sent-31, score-0.365]
23 In a corpus of scientific articles, for instance, an article about genetics may be likely to also be about health and disease, but unlikely to also be about x-ray astronomy. [sent-32, score-0.134]
24 For the LDA model, this limitation stems from the independence assumptions implicit in the Dirichlet distribution on the topic proportions. [sent-33, score-0.43]
25 Under a Dirichlet, the components of the proportions vector are nearly independent; this leads to the strong and unrealistic modeling assumption that the presence of one topic is not correlated with the presence of another. [sent-34, score-0.553]
26 In this paper we present the correlated topic model (CTM). [sent-35, score-0.457]
27 The CTM uses an alternative, more flexible distribution for the topic proportions that allows for covariance structure among the components. [sent-36, score-0.529]
28 This gives a more realistic model of latent topic structure where the presence of one latent topic may be correlated with the presence of another. [sent-37, score-1.026]
29 We fit the model to a portion of the JSTOR archive of the journal Science. [sent-39, score-0.134]
30 We demonstrate that the model gives a better fit than LDA, as measured by the accuracy of the predictive distributions over held out documents. [sent-40, score-0.122]
31 Furthermore, we demonstrate qualitatively that the correlated topic model provides a natural way of visualizing and exploring such an unstructured collection of textual data. [sent-41, score-0.654]
32 2 The Correlated Topic Model The key to the correlated topic model we propose is the logistic normal distribution [1]. [sent-42, score-0.706]
33 The logistic normal is a distribution on the simplex that allows for a general pattern of variability between the components by transforming a multivariate normal random variable. [sent-43, score-0.397]
34 Σ βk ηd Zd,n µ Wd,n N D K Figure 1: Top: Graphical model representation of the correlated topic model. [sent-54, score-0.457]
35 The logistic normal distribution, used to model the latent topic proportions of a document, can represent correlations between topics that are impossible to capture using a single Dirichlet. [sent-55, score-1.08]
36 Bottom: Example densities of the logistic normal on the 2-simplex. [sent-56, score-0.224]
37 The logistic normal distribution assumes that η is normally distributed and then mapped to the simplex with the inverse of the mapping given in equation (3); that is, f (ηi ) = exp ηi / j exp ηj . [sent-58, score-0.447]
38 The logistic normal models correlations between components of the simplicial random variable through the covariance matrix of the normal distribution. [sent-59, score-0.406]
39 The logistic normal was originally studied in the context of analyzing observed compositional data such as the proportions of minerals in geological samples. [sent-60, score-0.375]
40 In this work, we extend its use to a hierarchical model where it describes the latent composition of topics associated with each document. [sent-61, score-0.417]
41 Let {µ, Σ} be a K-dimensional mean and covariance matrix, and let topics β1:K be K multinomials over a fixed word vocabulary. [sent-62, score-0.321]
42 The correlated topic model assumes that an N -word document arises from the following generative process: 1. [sent-63, score-0.629]
43 , N }: (a) Draw topic assignment Zn | η from Mult(f (η)). [sent-69, score-0.345]
44 This process is identical to the generative process of LDA except that the topic proportions are drawn from a logistic normal rather than a Dirichlet. [sent-71, score-0.69]
45 The strong independence assumption imposed by the Dirichlet in LDA is not realistic when analyzing document collections, where one may find strong correlations between topics. [sent-74, score-0.172]
46 The covariance matrix of the logistic normal in the CTM is introduced to model such correlations. [sent-75, score-0.287]
47 An LDA model will predict words based on the latent topics that the observations suggest, but the CTM has the ability to predict items associated with additional topics that are correlated with the conditionally probable topics. [sent-79, score-0.844]
48 1 Posterior inference and parameter estimation Posterior inference is the central challenge to using the CTM. [sent-81, score-0.12]
49 The posterior distribution of the latent variables conditional on a document, p(η, z1:N | w1:N ), is intractable to compute; once conditioned on some observations, the topic assignments z1:N and log proportions η are dependent. [sent-82, score-0.715]
50 We make use of mean-field variational methods to efficiently obtain an approximation of this posterior distribution. [sent-83, score-0.269]
51 In brief, the strategy employed by mean-field variational methods is to form a factorized distribution of the latent variables, parameterized by free variables which are called the variational parameters. [sent-84, score-0.651]
52 The tradeoff is that variational methods do not come with the same theoretical guarantees as simulation methods. [sent-87, score-0.242]
53 See [13] for a modern review of variational methods for statistical inference. [sent-88, score-0.242]
54 In graphical models composed of conjugate-exponential family pairs and mixtures, the variational inference algorithm can be automatically derived from general principles [2, 14]. [sent-89, score-0.302]
55 In the CTM, however, the logistic normal is not conjugate to the multinomial. [sent-90, score-0.224]
56 We will therefore derive a variational inference algorithm by taking into account the special structure and distributions used by our model. [sent-91, score-0.302]
57 (5) The variational distributions of the discrete variables z1:N are specified by the Kdimensional multinomial parameters φ1:N . [sent-94, score-0.273]
58 The variational distribution of the continuous variables η1:K are K independent univariate Gaussians {λi , νi }. [sent-95, score-0.267]
59 Since the variational parameters are fit using a single observed document w1:N , there is no advantage in introducing a non-diagonal variational covariance matrix. [sent-96, score-0.694]
60 The nonconjugacy of the logistic normal leads to difficulty in computing the expected log probability of a topic assignment: Eq [log p(zn | η)] = Eq η T zn − Eq log( K i=1 exp{ηi }) . [sent-97, score-0.759]
61 (6) To preserve the lower bound on the log probability, we upper bound the log normalizer with a Taylor expansion, Eq log K i=1 exp{ηi } ≤ ζ −1 ( K i=1 Eq [exp{ηi }]) − 1 + log(ζ), (7) where we have introduced a new variational parameter ζ. [sent-98, score-0.613]
62 The expectation Eq [exp{ηi }] is the mean of a log normal distribution with mean and variance obtained from the variational 2 2 parameters {λi , νi }; thus, Eq [exp{ηi }] = exp{λi + νi /2} for i ∈ {1, . [sent-99, score-0.461]
63 Given a model {β1:K , µ, Σ} and a document w1:N , the variational inference algorithm optimizes equation (4) with respect to the variational parameters {λ1:K , ν1:K , φ1:N , ζ}. [sent-109, score-0.825]
64 In variational inference for LDA, each coordinate can be optimized analytically. [sent-111, score-0.358]
65 Given a collection of documents, we carry out parameter estimation in the correlated topic model by attempting to maximize the likelihood of a corpus of documents as a function of the topics β1:K and the multivariate Gaussian parameters {µ, Σ}. [sent-114, score-0.957]
66 We use variational expectation-maximization (EM), where we maximize the bound on the log probability of a collection given by summing equation (4) over the documents. [sent-115, score-0.563]
67 In the E-step, we maximize the bound with respect to the variational parameters by performing variational inference for each document. [sent-116, score-0.696]
68 In the M-step, we maximize the bound with respect to the model parameters. [sent-117, score-0.177]
69 This is maximum likelihood estimation of the topics and multivariate Gaussian using expected sufficient statistics, where the expectation is taken with respect to the variational distributions computed in the E-step. [sent-118, score-0.562]
70 In the experiments reported below, we run variational inference until the relative change in the probability bound of equation (4) is less than 10−6 , and run variational EM until the relative change in the likelihood bound is less than 10−5 . [sent-120, score-0.737]
71 3 Examples and Empirical Results: Modeling Science In order to test and illustrate the correlated topic model, we estimated a 100-topic CTM on 16,351 Science articles spanning 1990 to 1999. [sent-121, score-0.586]
72 We constructed a graph of the latent topics and the connections among them by examining the most probable words from each topic and the between-topic correlations. [sent-122, score-0.824]
73 Positive numbers indicate a better fit by the correlated topic model. [sent-129, score-0.432]
74 edu/˜lemur/science/ to interactively explore this model, including the topics, their connections, and the articles that exhibit them. [sent-134, score-0.182]
75 We compared the CTM to LDA by fitting a smaller collection of articles to models of varying numbers of topics. [sent-135, score-0.228]
76 This collection contains the 1,452 documents from 1960; we used a vocabulary of 5,612 words after pruning common function words and terms that occur once in the collection. [sent-136, score-0.324]
77 A better model of the document collection will assign higher probability to the held out data. [sent-138, score-0.319]
78 To avoid comparing bounds, we used importance sampling to compute the log probability of a document where the fitted variational distribution is the proposal. [sent-139, score-0.524]
79 Figure 3 illustrates the average held out log probability for each model and the average difference between them. [sent-140, score-0.158]
80 The CTM provides a better fit than LDA and supports more topics; the likelihood for LDA peaks near 30 topics while the likelihood for the CTM peaks close to 90 topics. [sent-141, score-0.34]
81 Another quantitative evaluation of the relative strengths of LDA and the CTM is how well the models predict the remaining words after observing a portion of the document. [sent-143, score-0.123]
82 Suppose we observe words w1:P from a document and are interested in which model provides a better predictive distribution p(w | w1:P ) of the remaining words. [sent-144, score-0.355]
83 Mathematically, the perplexity of a word distribution is defined as the inverse of the per-word geometric average of the probability of the observations, Perp(Φ) = D d=1 Nd i=P +1 p(wi | Φ, w1:P ) PD −1 d=1 (Nd −P ) , where Φ denotes the model parameters of an LDA or CTM model. [sent-146, score-0.186]
84 The plot in Figure 4 compares the predictive perplexity under LDA and the CTM. [sent-148, score-0.155]
85 (Right) Predictive perplexity for partially observed held-out documents from the 1960 Science corpus. [sent-150, score-0.162]
86 small number of words have been observed, there is less uncertainty about the remaining words under the CTM than under LDA—the perplexity is reduced by nearly 200 words, or roughly 10%. [sent-151, score-0.274]
87 The reason is that after seeing a few words in one topic, the CTM uses topic correlation to infer that words in a related topic may also be probable. [sent-152, score-0.897]
88 In contrast, LDA cannot predict the remaining words as well until a large portion of the document as been observed so that all of its topics are represented. [sent-153, score-0.548]
89 Grade of membership and latent structure models with application to disability survey data. [sent-181, score-0.147]
90 The author-recipient-topic model for topic and role discovery in social networks. [sent-203, score-0.37]
91 A generalized mean field algorithm for variational inference in exponential families. [sent-235, score-0.302]
92 A Variational Inference We describe a coordinate ascent optimization algorithm for the likelihood bound in equation (4) with respect to the variational parameters. [sent-237, score-0.516]
93 The first term of equation (4) is Eq [log p(η | µ, Σ)] = (1/2) log |Σ−1 | − (K/2) log 2π − (1/2)Eq (η − µ)T Σ−1 (η − µ) , (8) where Eq (η − µ)T Σ−1 (η − µ) = Tr(diag(ν 2 )Σ−1 ) + (λ − µ)T Σ−1 (λ − µ). [sent-238, score-0.217]
94 (9) The second term of equation (4), using the additional bound in equation (7), is Eq [log p(zn | η)] = K i=1 K i=1 λi φn,i − ζ −1 2 exp{λi + νi /2} + 1 − log ζ. [sent-239, score-0.237]
95 (10) The third term of equation (4) is Eq [log p(wn | zn , β)] = K i=1 φn,i log βi,wn . [sent-240, score-0.237]
96 (11) Finally, the fourth term is the entropy of the variational distribution: K 1 2 i=1 2 (log νi + log 2π + 1) − N n=1 k i=1 φn,i log φn,i . [sent-241, score-0.412]
97 (12) We maximize the bound in equation (4) with respect to the variational parameters λ1:K , ν1:K , φ1:N , and ζ. [sent-242, score-0.441]
98 We use a coordinate ascent algorithm, iteratively maximizing the bound with respect to each parameter. [sent-243, score-0.197]
99 First, we maximize equation (4) with respect to ζ, using the second bound in equation (7). [sent-244, score-0.246]
100 (17) ii Iterating between these optimizations defines a coordinate ascent algorithm on equation (4). [sent-256, score-0.149]
wordName wordTfidf (topN-words)
[('ctm', 0.486), ('topic', 0.345), ('lda', 0.297), ('topics', 0.253), ('variational', 0.242), ('eq', 0.241), ('document', 0.172), ('articles', 0.154), ('proportions', 0.121), ('logistic', 0.115), ('latent', 0.112), ('normal', 0.109), ('perplexity', 0.106), ('zn', 0.105), ('dirichlet', 0.089), ('correlated', 0.087), ('log', 0.085), ('words', 0.084), ('collection', 0.074), ('archive', 0.07), ('unstructured', 0.067), ('genetics', 0.064), ('jstor', 0.061), ('scholarly', 0.061), ('inference', 0.06), ('collections', 0.06), ('bound', 0.058), ('maximize', 0.057), ('disease', 0.057), ('blei', 0.057), ('exp', 0.056), ('coordinate', 0.056), ('documents', 0.056), ('brain', 0.054), ('predictive', 0.049), ('amino', 0.048), ('intracellular', 0.048), ('held', 0.048), ('equation', 0.047), ('ascent', 0.046), ('browse', 0.041), ('climate', 0.041), ('fossil', 0.041), ('gtp', 0.041), ('inattentional', 0.041), ('isotopic', 0.041), ('mantle', 0.041), ('phrases', 0.041), ('triphosphate', 0.041), ('article', 0.04), ('collaborative', 0.04), ('correlation', 0.039), ('portion', 0.039), ('simplex', 0.039), ('covariance', 0.038), ('respect', 0.037), ('ago', 0.035), ('acid', 0.035), ('atp', 0.035), ('carbon', 0.035), ('disability', 0.035), ('researcher', 0.035), ('simplicial', 0.035), ('steyvers', 0.035), ('limitation', 0.034), ('memory', 0.034), ('wn', 0.033), ('turbo', 0.032), ('engine', 0.032), ('mutation', 0.032), ('mutations', 0.032), ('multinomial', 0.031), ('parameterization', 0.031), ('likelihood', 0.03), ('factorized', 0.03), ('compositional', 0.03), ('mult', 0.03), ('genes', 0.03), ('probable', 0.03), ('word', 0.03), ('corpus', 0.03), ('exploring', 0.028), ('cognitive', 0.028), ('exhibit', 0.028), ('visualizing', 0.028), ('calcium', 0.028), ('ltp', 0.028), ('draw', 0.027), ('posterior', 0.027), ('composition', 0.027), ('evolutionary', 0.027), ('supports', 0.027), ('release', 0.026), ('stems', 0.026), ('grif', 0.026), ('ths', 0.026), ('vocabulary', 0.026), ('developmental', 0.026), ('distribution', 0.025), ('model', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
2 0.23616709 48 nips-2005-Context as Filtering
Author: Daichi Mochihashi, Yuji Matsumoto
Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1
3 0.19139709 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
4 0.15955485 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes
Author: Xuerui Wang, Natasha Mohanty, Andrew McCallum
Abstract: We present a probabilistic generative model of entity relationships and their attributes that simultaneously discovers groups among the entities and topics among the corresponding textual attributes. Block-models of relationship data have been studied in social network analysis for some time. Here we simultaneously cluster in several modalities at once, incorporating the attributes (here, words) associated with certain relationships. Significantly, joint inference allows the discovery of topics to be guided by the emerging groups, and vice-versa. We present experimental results on two large data sets: sixteen years of bills put before the U.S. Senate, comprising their corresponding text and voting records, and thirteen years of similar data from the United Nations. We show that in comparison with traditional, separate latent-variable models for words, or Blockstructures for votes, the Group-Topic model’s joint inference discovers more cohesive groups and improved topics. 1
5 0.14156447 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
6 0.12241279 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
7 0.096638918 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
8 0.089062214 166 nips-2005-Robust Fisher Discriminant Analysis
9 0.081942163 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
10 0.075715221 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
11 0.070461266 105 nips-2005-Large-Scale Multiclass Transduction
12 0.070406683 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
13 0.066849925 30 nips-2005-Assessing Approximations for Gaussian Process Classification
14 0.063399024 195 nips-2005-Transfer learning for text classification
15 0.063104376 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
16 0.062940747 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
17 0.062543601 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
18 0.06196681 98 nips-2005-Infinite latent feature models and the Indian buffet process
19 0.061478686 80 nips-2005-Gaussian Process Dynamical Models
20 0.059111409 171 nips-2005-Searching for Character Models
topicId topicWeight
[(0, 0.193), (1, 0.02), (2, -0.015), (3, 0.109), (4, 0.118), (5, -0.292), (6, -0.001), (7, 0.067), (8, -0.135), (9, -0.137), (10, -0.062), (11, -0.093), (12, 0.004), (13, -0.177), (14, -0.212), (15, -0.115), (16, -0.067), (17, 0.003), (18, -0.057), (19, -0.004), (20, 0.07), (21, -0.066), (22, -0.095), (23, -0.015), (24, -0.002), (25, -0.055), (26, -0.059), (27, -0.007), (28, -0.037), (29, 0.054), (30, 0.075), (31, 0.002), (32, 0.02), (33, -0.002), (34, -0.096), (35, -0.041), (36, -0.082), (37, 0.06), (38, 0.115), (39, 0.095), (40, -0.134), (41, 0.028), (42, -0.024), (43, -0.07), (44, -0.083), (45, 0.075), (46, -0.038), (47, -0.075), (48, -0.029), (49, -0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.9603563 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
2 0.68657368 48 nips-2005-Context as Filtering
Author: Daichi Mochihashi, Yuji Matsumoto
Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1
3 0.62368935 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes
Author: Xuerui Wang, Natasha Mohanty, Andrew McCallum
Abstract: We present a probabilistic generative model of entity relationships and their attributes that simultaneously discovers groups among the entities and topics among the corresponding textual attributes. Block-models of relationship data have been studied in social network analysis for some time. Here we simultaneously cluster in several modalities at once, incorporating the attributes (here, words) associated with certain relationships. Significantly, joint inference allows the discovery of topics to be guided by the emerging groups, and vice-versa. We present experimental results on two large data sets: sixteen years of bills put before the U.S. Senate, comprising their corresponding text and voting records, and thirteen years of similar data from the United Nations. We show that in comparison with traditional, separate latent-variable models for words, or Blockstructures for votes, the Group-Topic model’s joint inference discovers more cohesive groups and improved topics. 1
4 0.58763427 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
5 0.57839119 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
6 0.47301787 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
7 0.42464566 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
8 0.37217438 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
9 0.37079266 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
10 0.35090104 33 nips-2005-Bayesian Sets
11 0.34420776 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
12 0.3424038 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
13 0.33624679 171 nips-2005-Searching for Character Models
14 0.32626009 185 nips-2005-Subsequence Kernels for Relation Extraction
15 0.32467777 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
16 0.32430649 166 nips-2005-Robust Fisher Discriminant Analysis
17 0.30399287 105 nips-2005-Large-Scale Multiclass Transduction
18 0.2991713 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
19 0.29215783 62 nips-2005-Efficient Estimation of OOMs
20 0.28003767 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
topicId topicWeight
[(3, 0.034), (10, 0.043), (18, 0.011), (27, 0.046), (31, 0.047), (34, 0.048), (39, 0.023), (41, 0.027), (55, 0.025), (57, 0.021), (60, 0.011), (65, 0.014), (69, 0.053), (73, 0.046), (80, 0.285), (88, 0.103), (91, 0.073)]
simIndex simValue paperId paperTitle
1 0.95982337 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes
Author: Xuerui Wang, Natasha Mohanty, Andrew McCallum
Abstract: We present a probabilistic generative model of entity relationships and their attributes that simultaneously discovers groups among the entities and topics among the corresponding textual attributes. Block-models of relationship data have been studied in social network analysis for some time. Here we simultaneously cluster in several modalities at once, incorporating the attributes (here, words) associated with certain relationships. Significantly, joint inference allows the discovery of topics to be guided by the emerging groups, and vice-versa. We present experimental results on two large data sets: sixteen years of bills put before the U.S. Senate, comprising their corresponding text and voting records, and thirteen years of similar data from the United Nations. We show that in comparison with traditional, separate latent-variable models for words, or Blockstructures for votes, the Group-Topic model’s joint inference discovers more cohesive groups and improved topics. 1
same-paper 2 0.81585461 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
3 0.76818812 35 nips-2005-Bayesian model learning in human visual perception
Author: Gergő Orbán, Jozsef Fiser, Richard N. Aslin, Máté Lengyel
Abstract: Humans make optimal perceptual decisions in noisy and ambiguous conditions. Computations underlying such optimal behavior have been shown to rely on probabilistic inference according to generative models whose structure is usually taken to be known a priori. We argue that Bayesian model selection is ideal for inferring similar and even more complex model structures from experience. We find in experiments that humans learn subtle statistical properties of visual scenes in a completely unsupervised manner. We show that these findings are well captured by Bayesian model learning within a class of models that seek to explain observed variables by independent hidden causes. 1
4 0.53560311 48 nips-2005-Context as Filtering
Author: Daichi Mochihashi, Yuji Matsumoto
Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1
5 0.49797982 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
6 0.49123505 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
7 0.48571184 136 nips-2005-Noise and the two-thirds power Law
8 0.48305064 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
9 0.48188955 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
10 0.47947747 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
11 0.47894377 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
12 0.47728932 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
13 0.47610852 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
14 0.47580802 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
15 0.47479489 74 nips-2005-Faster Rates in Regression via Active Learning
16 0.4726584 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
17 0.47259849 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
18 0.47253838 45 nips-2005-Conditional Visual Tracking in Kernel Space
19 0.47196275 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
20 0.47153941 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks