nips nips2004 nips2004-169 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. [sent-11, score-0.428]
2 Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. [sent-13, score-0.199]
3 Such grouped clustering problems occur often in practice, e. [sent-14, score-0.083]
4 1 Introduction One of the most significant conceptual and practical tools in the Bayesian paradigm is the notion of a hierarchical model. [sent-18, score-0.153]
5 Building on the notion that a parameter is a random variable, hierarchical models have applications to a variety of forms of grouped or relational data and to general problems involving “multi-task learning” or “learning to learn. [sent-19, score-0.216]
6 Here we consider the application of hierarchical Bayesian ideas to a problem in “multi-task learning” in which the “tasks” are clustering problems, and our goal is to share clusters among multiple, related clustering problems. [sent-23, score-0.314]
7 We are motivated by the task of discovering topics in document corpora [1]. [sent-24, score-0.168]
8 , a cluster) is a distribution across words while documents are viewed as distributions across topics. [sent-27, score-0.293]
9 We want to discover topics that are common across multiple documents in the same corpus, as well as across multiple corpora. [sent-28, score-0.371]
10 Our work is based on a tool from nonparametric Bayesian analysis known as the Dirichlet process (DP) mixture model [2, 3]. [sent-29, score-0.305]
11 Indeed, at each step of generating data points, a DP mixture model can either assign the data point to a previously-generated cluster or can start a new cluster. [sent-31, score-0.188]
12 Extending the DP mixture model framework to the setting of multiple related clustering problems, we will be able to make the (realistic) assumption that we do not know the number of clusters a priori in any of the problems, nor do we know how clusters should be shared among the problems. [sent-33, score-0.339]
13 When generating a new cluster, a DP mixture model selects the parameters for the cluster (e. [sent-34, score-0.188]
14 Unfortunately, if we now wish to extend DP mixtures to groups of clustering problems, the assumption that G0 is smooth conflicts with the goal of sharing clusters among groups. [sent-40, score-0.209]
15 That is, even if each group shares the same underlying base distribution G0 , the smoothness of G0 implies that they will generate distinct cluster parameters (with probability one). [sent-41, score-0.15]
16 We will show that this problem can be resolved by taking a hierarchical Bayesian approach. [sent-42, score-0.13]
17 We present a notion of a hierarchical Dirichlet process (HDP) in which the base distribution G0 for a set of DPs is itself a draw from a DP. [sent-43, score-0.286]
18 This turns out to provide an elegant and simple solution to the problem of sharing clusters among multiple clustering problems. [sent-44, score-0.13]
19 In Section 2, we provide the basic technical definition of DPs and discuss related representations involving stick-breaking processes and Chinese restaurant processes. [sent-46, score-0.22]
20 As for the DP, we present analogous stick-breaking and Chinese restaurant representations for the HDP. [sent-48, score-0.165]
21 We present empirical results on a number of text corpora in Section 5, demonstrating various aspects of the HDP including its nonparametric nature, hierarchical nature, and the ease with which the framework can be applied to other realms such as hidden Markov models. [sent-49, score-0.268]
22 2 Dirichlet Processes The Dirichlet process (DP) and the DP mixture model are mainstays of nonparametric Bayesian statistics (see, e. [sent-50, score-0.305]
23 The DP can be used in the mixture model setting in the following way. [sent-74, score-0.164]
24 Given a draw G ∼ DP(α0 , G0 ), independently draw n latent factors from G: φi ∼ G. [sent-79, score-0.12]
25 If the factors φi were all distinct, then this setup would yield an (uninteresting) mixture model with n components. [sent-85, score-0.164]
26 Rather, the number of distinct values grows as O(log n), and it is this that defines the random number of mixture components. [sent-87, score-0.192]
27 In this paper we will refer to two: the Chinese restaurant process (CRP), and the stickbreaking process. [sent-89, score-0.198]
28 The CRP is a distribution on partitions that directly captures the clustering of draws from a DP via a metaphor in which customers share tables in a Chinese restaurant [5]. [sent-90, score-0.514]
29 The stick-breaking construction shows that DP mixture models can be viewed as mixture models with a countably infinite number of components. [sent-102, score-0.386]
30 To see this, identify each θk as the parameter of the k th mixture component, with mixing proportion given by βk . [sent-103, score-0.216]
31 3 Hierarchical Dirichlet Processes We will introduce the hierarchical Dirichlet process (HDP) in this section. [sent-104, score-0.163]
32 We assume that we have J groups of data, each consisting of nj data points (xj1 , . [sent-106, score-0.103]
33 We assume that the data points in each group are exchangeable, and are to be modeled with a mixture model. [sent-110, score-0.222]
34 While each mixture model has mixing proportions specific to the group, we require that the different groups share the same set of mixture components. [sent-111, score-0.555]
35 The idea is that while different groups have different characteristics given by a different combination of mixing proportions, using the same set of mixture components allows statistical strength to be shared across groups, and allows generalization to new groups. [sent-112, score-0.4]
36 The HDP is a nonparametric prior which allows the mixture models to share components. [sent-113, score-0.316]
37 To complete the description of the HDP mixture model, we associate each xji with a factor φji , with distributions given by F (φji ) and Gj respectively. [sent-116, score-0.321]
38 The overall model is given in Figure 1 left, with conditional distributions: G0 | γ, H ∼ DP(γ, H) φji | Gj ∼ Gj Gj | α, G0 ∼ DP(α0 , G0 ) xji | φji ∼ F (φji ) . [sent-117, score-0.132]
39 (3) (4) The stick-breaking construction (2) shows that a draw of G0 can be expressed as a weighted ∞ sum of point masses: G0 = k=1 βk δθk . [sent-118, score-0.085]
40 This fact that G0 is atomic plays an important role in ensuring that mixture components are shared across different groups. [sent-119, score-0.245]
41 Since G0 is the base distribution for the individual Gj ’s, (2) again shows that the atoms of the individual Gj are samples from G0 . [sent-120, score-0.079]
42 In particular, since G0 places non-zero mass only on the atoms θ = (θk )∞ , the atoms of Gj must also come from θ, hence we may write: k=1 G0 = ∞ k=1 Gj = βk δ θ k ∞ k=1 πjk δθk . [sent-121, score-0.078]
43 (5) th Identifying θk as the parameters of the k mixture component, we see that each submodel corresponding to distinct groups share the same set of mixture components, but have differing mixing proportions, π j = (πjk )∞ . [sent-122, score-0.531]
44 k=1 Finally, it is useful to explicitly describe the relationships between the mixing proportions β and (π j )J . [sent-123, score-0.104]
45 Each of the 3 restaurants has customers sitting around tables, and each table is served a dish (which corresponds to customers in the Chinese restaurant for the global DP). [sent-129, score-0.565]
46 This view directly leads to an efficient Gibbs sampler for HDP mixture models, which is detailed in the appendix. [sent-141, score-0.164]
47 Imagine nj customers (each corresponds to a φji ) at a Chinese restaurant with an unbounded number of tables. [sent-150, score-0.323]
48 A subsequent customer sits at an occupied table with probability proportional to the number of customers already there, or at the next unoccupied table with probability proportional to α0 . [sent-152, score-0.282]
49 , tji−1 , α0 ∼ P t t njt njt +α0 δt + P t α0 new njt +α0 δt , (7) where njt is the number of customers currently at table t. [sent-157, score-0.431]
50 Once all customers have sat down the seating plan corresponds to a partition of φj1 , . [sent-158, score-0.184]
51 This is an exchangeable process in that the probability of a partition does not depend on the order in which customers sit down. [sent-162, score-0.275]
52 Now we associate with table t a draw ψjt from G0 , and assign φji = ψjtji . [sent-163, score-0.093]
53 Performing this process independently for each group j, we have now integrated out all the Gj ’s, and have an assignment of each φji to a sample ψjtji from G0 , with the partition structures given by CRPs. [sent-164, score-0.117]
54 Let the customer associated with ψjt sit at table kjt . [sent-169, score-0.329]
55 Right: histogram of the number of topics the HDP mixture used over 100 posterior samples. [sent-179, score-0.327]
56 Finally we associate with table k a draw θk from H and assign ψjt = θkjt . [sent-180, score-0.093]
57 We call this generative process the Chinese restaurant franchise (CRF). [sent-182, score-0.231]
58 The metaphor is as follows: we have J restaurants, each with nj customers (φji ’s), who sit at tables (ψjt ’s). [sent-183, score-0.276]
59 Now each table is served a dish (θk ’s) from a menu common to all restaurants. [sent-184, score-0.103]
60 The customers are sociable, prefering large tables with many customers present, and also prefer popular dishes. [sent-185, score-0.304]
61 5 Experiments We describe 3 experiments in this section to highlight the various aspects of the HDP: its nonparametric nature; its hierarchical nature; and the ease with which we can apply the framework to other models, specifically the HMM. [sent-186, score-0.238]
62 To demonstrate the strength of the nonparametric approach as exemplified by the HDP mixture, we compared it against latent Dirichlet allocation (LDA), which is a parametric model similar in structure to the HDP [1]. [sent-188, score-0.132]
63 In particular, we applied both models to a corpus of nematode biology abstracts1 , evaluating the perplexity of both models on held out abstracts. [sent-189, score-0.179]
64 Here abstracts correspond to groups, words correspond to observations, and topics correspond to mixture components, and exchangeability correspond to the typical bag-of-words assumption. [sent-190, score-0.41]
65 In order to study specifically the nonparametric nature of the HDP, we used the same experimental setup for both models2 , except that in LDA we had to vary the number of topics used between 10 and 120, while the HDP obtained posterior samples over this automatically. [sent-191, score-0.271]
66 Further, the posterior over the number of topics used by HDP is consistent with this range. [sent-194, score-0.163]
67 Notice however that the HDP infers the number of topics automatically, while LDA requires some method of model selection. [sent-195, score-0.138]
68 We applied HDP mixture models to a dataset of NIPS 1-12 papers organized into sections3 . [sent-197, score-0.164]
69 After removing standard stop words and words appearing less than 10 times, we are left with 476441 words in total and a vocabulary size of 5699. [sent-202, score-0.105]
70 Right: perplexity of test VS documents given LT, AA and AP documents respectively, using M3, averaged over 5 runs. [sent-209, score-0.456]
71 show improvements to the modeling of a section when the model is also given documents from another section. [sent-211, score-0.155]
72 The training set always consist of 80 documents from the other section (so that larger sections like AA (algorithms and architecures) do not get an unfair advantage), plus between 0 and 80 documents from VS. [sent-213, score-0.369]
73 The first model (M1) simply ignores documents from the additional section, and uses a HDP to model the VS documents. [sent-216, score-0.155]
74 The second model (M2) uses a HDP mixture model, with one group per document, but lumping together training documents from both sections. [sent-218, score-0.377]
75 The third model (M3) takes a hierarchical approach and models each section separately using a HDP mixture model, and places another DP prior over the common base distributions for both submodels4 . [sent-219, score-0.359]
76 As we see in Figure 3 left, the more hierarchical approach of M3 performs best, with perplexity decreasing drastically with modest values of N , while M1 does worst for small N . [sent-220, score-0.276]
77 This is because M2 lumps all the documents together, so is not able to differentiate between the sections, as a result the influence of documents from the other section is unduly strong. [sent-222, score-0.31]
78 This result confirms that the hierarchical approach to the transfer-of-learning problem is a useful one, as it allows useful information to be transfered to a new task (here the modeling of a new section), without the data from the previous tasks overwhelming those in the new task. [sent-223, score-0.13]
79 We also looked at the performance of the M3 model on VS documents given specific other sections. [sent-224, score-0.155]
80 In Table 1 we show the topics pertinent to VS discovered by the M3 model. [sent-227, score-0.138]
81 First we trained the model on all documents from the other section. [sent-228, score-0.155]
82 Then, keeping the assignments of words to topics fixed in the other section, we introduced VS documents and the model decides to reuse some topics from the other section, as well as create new ones. [sent-229, score-0.466]
83 The topics reused by VS documents confirm to our expectations of the overlap between VS and other sections. [sent-230, score-0.293]
84 As sections differ over the years, we assigned by hand the various sections to one of 9 prototypical sections: CS, NS, LT, AA, IM, SP, VS, AP and CN. [sent-232, score-0.118]
85 Shown are the two topics with most numbers of VS words, but also with significant numbers of words from the other section. [sent-239, score-0.173]
86 The infinite hidden Markov model (iHMM) is a nonparametric model for sequential data where the number of hidden states is open-ended and inferred from data [11]. [sent-241, score-0.108]
87 In [10] we show that the HDP framework can be applied to obtain a cleaner formulation of the iHMM, providing effective new inference algorithms and potentially hierarchical extensions. [sent-242, score-0.13]
88 In fact the original iHMM paper [11] served as inspiration for this work and first coined the term “hierarchical Dirichlet processes”—though their model is not hierarchical in the Bayesian sense, involving priors upon priors, but is rather a set of coupled urn models similar to the CRF. [sent-243, score-0.196]
89 6 Discussion We have described the hierarchical Dirichlet process, a hierarchical, nonparametric model for clustering problems involving multiple groups of data. [sent-252, score-0.395]
90 HDP mixture models are able to automatically determine the appropriate number of mixture components needed, and exhibit sharing of statistical strength across groups by having components shared across groups. [sent-253, score-0.59]
91 We have described the HDP as a distribution over distributions, using both the stick-breaking construction and the Chinese restaurant franchise. [sent-254, score-0.19]
92 In [10] we also describe a fourth perspective based on the infinite limit of finite mixture models, and give detail for how the HDP can be applied to the iHMM. [sent-255, score-0.164]
93 Direct extensions of the model include use of nonparametric priors other than the DP, building higher level hierarchies as in our NIPS experiment, as well as hierarchical extensions to the iHMM. [sent-256, score-0.238]
94 We describe an inference procedure for the HDP mixture model based on Gibbs sampling t, k and θ given data items x. [sent-258, score-0.164]
95 Let f (·|θ) and h be the density functions for F (θ) and H respectively, n−i be the number of tji ’s equal to t except tji , and m−jt be jt k the number of kj t ’s equal to k except kjt . [sent-260, score-0.636]
96 The conditional probability for tji given the other variables is proportional to the product of a prior and likelihood term. [sent-261, score-0.172]
97 The prior term is given by (7) where, by exchangeability, we can take tji to be the last one assigned. [sent-262, score-0.172]
98 jt (9) Similarly the conditional distributions for kjt and θk are: p(kjt = k | t, k\kjt , θ, x) ∝ γ i:tji =t f (xji |θk ) if k = k new −t mk i:tji =t f (xji |θk ) if k currently used. [sent-265, score-0.348]
99 Markov chain sampling methods for Dirichlet process mixture models. [sent-306, score-0.197]
100 Hierarchical topic models and the nested Chinese restaurant process. [sent-322, score-0.197]
wordName wordTfidf (topN-words)
[('hdp', 0.609), ('dp', 0.229), ('kjt', 0.198), ('dirichlet', 0.172), ('tji', 0.172), ('restaurant', 0.165), ('gj', 0.165), ('mixture', 0.164), ('documents', 0.155), ('perplexity', 0.146), ('chinese', 0.143), ('topics', 0.138), ('vs', 0.138), ('customers', 0.134), ('xji', 0.132), ('hierarchical', 0.13), ('ji', 0.116), ('ihmm', 0.115), ('nonparametric', 0.108), ('jt', 0.094), ('vb', 0.086), ('groups', 0.079), ('lda', 0.072), ('aa', 0.07), ('beal', 0.066), ('crp', 0.066), ('dps', 0.066), ('njt', 0.066), ('ap', 0.063), ('jk', 0.061), ('draw', 0.06), ('sections', 0.059), ('group', 0.058), ('crf', 0.055), ('lt', 0.054), ('proportions', 0.052), ('mixing', 0.052), ('alice', 0.049), ('exchangeability', 0.049), ('jnj', 0.049), ('sit', 0.049), ('bayesian', 0.049), ('clustering', 0.049), ('customer', 0.049), ('share', 0.044), ('shared', 0.042), ('clusters', 0.042), ('base', 0.04), ('across', 0.039), ('atoms', 0.039), ('sharing', 0.039), ('served', 0.037), ('ar', 0.036), ('concentration', 0.036), ('tables', 0.036), ('words', 0.035), ('marginalize', 0.034), ('grouped', 0.034), ('nips', 0.034), ('table', 0.033), ('countably', 0.033), ('dish', 0.033), ('exchangeable', 0.033), ('franchise', 0.033), ('jtji', 0.033), ('metaphor', 0.033), ('nematode', 0.033), ('sits', 0.033), ('tnew', 0.033), ('blei', 0.033), ('process', 0.033), ('ml', 0.032), ('topic', 0.032), ('layer', 0.032), ('mk', 0.031), ('gibbs', 0.031), ('corpora', 0.03), ('involving', 0.029), ('sentences', 0.029), ('beta', 0.029), ('restaurants', 0.029), ('knew', 0.029), ('distinct', 0.028), ('draws', 0.027), ('partitions', 0.026), ('processes', 0.026), ('partition', 0.026), ('construction', 0.025), ('bars', 0.025), ('distributions', 0.025), ('posterior', 0.025), ('sat', 0.024), ('abstracts', 0.024), ('strength', 0.024), ('cluster', 0.024), ('nj', 0.024), ('notion', 0.023), ('nite', 0.023), ('teh', 0.023), ('ns', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
2 0.15651901 77 nips-2004-Hierarchical Clustering of a Mixture Model
Author: Jacob Goldberger, Sam T. Roweis
Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1
3 0.13175412 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
4 0.10732338 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang
Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1
5 0.097905397 87 nips-2004-Integrating Topics and Syntax
Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum
Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively. 1
6 0.092077948 163 nips-2004-Semi-parametric Exponential Family PCA
7 0.087714665 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
8 0.087564625 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
9 0.074099064 145 nips-2004-Parametric Embedding for Class Visualization
10 0.065173581 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
11 0.064488895 197 nips-2004-Two-Dimensional Linear Discriminant Analysis
12 0.061063491 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
13 0.059274744 161 nips-2004-Self-Tuning Spectral Clustering
14 0.057230987 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
15 0.052837167 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
16 0.052170258 82 nips-2004-Incremental Algorithms for Hierarchical Classification
17 0.05162216 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
18 0.050247464 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
19 0.048940077 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
20 0.04794715 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
topicId topicWeight
[(0, -0.161), (1, 0.016), (2, -0.038), (3, -0.092), (4, -0.035), (5, 0.052), (6, -0.125), (7, 0.173), (8, 0.045), (9, 0.078), (10, 0.05), (11, 0.076), (12, -0.002), (13, -0.08), (14, 0.056), (15, -0.127), (16, 0.036), (17, 0.017), (18, -0.008), (19, -0.01), (20, 0.011), (21, -0.195), (22, 0.004), (23, 0.0), (24, -0.012), (25, -0.115), (26, -0.05), (27, -0.125), (28, 0.018), (29, 0.001), (30, 0.001), (31, -0.216), (32, 0.125), (33, -0.019), (34, -0.02), (35, -0.02), (36, -0.008), (37, -0.037), (38, -0.154), (39, -0.037), (40, 0.01), (41, -0.01), (42, -0.136), (43, 0.029), (44, -0.037), (45, 0.036), (46, -0.049), (47, 0.038), (48, -0.07), (49, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.9419021 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
2 0.69968933 77 nips-2004-Hierarchical Clustering of a Mixture Model
Author: Jacob Goldberger, Sam T. Roweis
Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1
3 0.59340703 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang
Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1
4 0.53958875 87 nips-2004-Integrating Topics and Syntax
Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum
Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively. 1
5 0.53276533 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
6 0.47614056 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
7 0.39558521 158 nips-2004-Sampling Methods for Unsupervised Learning
8 0.35693517 163 nips-2004-Semi-parametric Exponential Family PCA
9 0.34327453 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
10 0.30918431 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
11 0.30908835 122 nips-2004-Modelling Uncertainty in the Game of Go
13 0.30247906 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
14 0.30153164 145 nips-2004-Parametric Embedding for Class Visualization
15 0.28998476 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
16 0.28717965 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
17 0.28032553 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
18 0.2800397 130 nips-2004-Newscast EM
19 0.27851397 82 nips-2004-Incremental Algorithms for Hierarchical Classification
20 0.27462432 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
topicId topicWeight
[(2, 0.355), (13, 0.064), (15, 0.096), (17, 0.01), (26, 0.039), (31, 0.07), (32, 0.011), (33, 0.152), (35, 0.02), (39, 0.024), (50, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.75110328 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
2 0.71684569 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
Author: Michael P. Holmes, Charles Jr.
Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for finding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1
3 0.65069515 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun
Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.
4 0.50800979 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang
Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1
5 0.50392956 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
6 0.49952632 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.49508265 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
8 0.49317899 164 nips-2004-Semi-supervised Learning by Entropy Minimization
9 0.49289301 23 nips-2004-Analysis of a greedy active learning strategy
10 0.4927156 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
11 0.4922556 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
12 0.49198285 77 nips-2004-Hierarchical Clustering of a Mixture Model
13 0.49114665 131 nips-2004-Non-Local Manifold Tangent Learning
14 0.49054527 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
15 0.49031442 137 nips-2004-On the Adaptive Properties of Decision Trees
16 0.49003133 163 nips-2004-Semi-parametric Exponential Family PCA
17 0.48885006 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
18 0.48880631 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
19 0.48850444 127 nips-2004-Neighbourhood Components Analysis
20 0.48824936 158 nips-2004-Sampling Methods for Unsupervised Learning