nips nips2010 nips2010-276 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. [sent-7, score-0.665]
2 By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. [sent-9, score-0.486]
3 We apply our method to hierarchical clustering of images and topic modeling of text data. [sent-10, score-0.284]
4 We take a direct nonparametric Bayesian approach, constructing a prior on tree-structured partitions of data that provides for unbounded width and depth while still allowing tractable posterior inference. [sent-14, score-0.422]
5 Probabilistic approaches to latent hierarchies have been explored in a variety of domains. [sent-15, score-0.135]
6 Unsupervised learning of densities and nested mixtures has received particular attention via finite-depth trees [1], diffusive branching processes [2] and hierarchical clustering [3, 4]. [sent-16, score-0.404]
7 Bayesian approaches to learning latent hierarchies have also been useful for semi-supervised learning [5], relational learning [6] and multi-task learning [7]. [sent-17, score-0.135]
8 These hierarchies have unbounded width and depth and the data may live at internal nodes on the tree. [sent-20, score-0.52]
9 As the process is defined in terms of a distribution over probability measures and not as a distribution over data per se, data from this model are infinitely exchangeable; the probability of any set of data is not dependent on its ordering. [sent-21, score-0.159]
10 Unlike other infinitely exchangeable models [2, 4], a pseudo-time process is not required to describe the distribution on trees and it can be understood in terms of other popular Bayesian nonparametric models. [sent-22, score-0.347]
11 Our new approach allows the components of an infinite mixture model to be interpreted as part of a diffusive evolutionary process. [sent-23, score-0.158]
12 edu/˜rpa/ 1 (a) Dirichlet process stick breaking (b) Tree-structured stick breaking Figure 1: a) Dirichlet process stick-breaking procedure, with a linear partitioning. [sent-30, score-0.498]
13 “prototypes”) should be able to live at internal nodes in the tree, and 2) as the ancestor/descendant relationships are not known a priori, the data should be infinitely exchangeable. [sent-34, score-0.282]
14 2 A Tree-Structured Stick-Breaking Process Stick-breaking processes based on the beta distribution have played a prominent role in the development of Bayesian nonparametric methods, most significantly with the constructive approach to the Dirichlet process (DP) due to Sethuraman [10]. [sent-35, score-0.321]
15 A random probability measure G can be drawn from a DP with base measure αH using a sequence of beta variates via: ∞ G= i−1 πi δθi i=1 (1 − νi ) πi = ν i θi ∼ H νi ∼ Be(1, α) π1 = ν1 . [sent-36, score-0.143]
16 (1) i =1 We can view this as taking a stick of unit length and breaking it at a random location. [sent-37, score-0.186]
17 We call the left side of the stick π1 and then break the right side at a new place, calling the left side of this new break π2 . [sent-38, score-0.208]
18 The distribution over the sequence (π1 , π2 , · · · ) is a case of the GEM distribution [11], which also includes the Pitman-Yor process [12]. [sent-41, score-0.159]
19 The GEM construction provides a distribution over infinite partitions of the unit interval, with natural numbers as the index set as in Fig. [sent-47, score-0.153]
20 In this paper, we extend this idea to create a distribution over infinite partitions that also possess a hierarchical graph topology. [sent-49, score-0.246]
21 Borrowing notation from the P´ lya o tree (PT) construction [13], let = ( 1 , 2 , · · · , K ), denote a length-K sequence of positive integers, i. [sent-51, score-0.304]
22 These strings will index the nodes in the tree and | | will then be the depth of node . [sent-55, score-0.673]
23 The first has beta variates ν ∼ Be(1, α(| |)) which determine the size of a given node’s partition as a function of depth. [sent-58, score-0.143]
24 The second has beta variates ψ ∼ Be(1, γ), which determine the branching probabilities. [sent-59, score-0.143]
25 When viewing these strings as identifying nodes on a tree, { i : i ∈ 1, 2, · · · } are the children of and { : } are the ancestors of . [sent-62, score-0.323]
26 (2) can be seen as products of several decisions on how to allocate mass to nodes and branches in the tree: the {ϕ } determine the probability of a particular sequence of children and the ν and (1−ν ) terms determine the proportion of mass allotted to versus nodes that are descendants of . [sent-64, score-0.301]
27 Note that some of the sampled trees have represented nodes with no data associated with them and that the branch ordering does not correspond to a size-biased permutation. [sent-67, score-0.264]
28 An Urn-based View When a Bayesian nonparametric model induces partitions over data, it is sometimes possible to construct a Blackwell-MacQueen [15] type urn scheme that corresponds to sequentially generating data, while integrating out the underlying random measure. [sent-74, score-0.327]
29 The urn process can be seen as a path-reinforcing Bernoulli trip down the tree where each datum starts at the root and descends into children until it stops at some node. [sent-77, score-0.726]
30 The first datum lives at the root node with probability 1/(α(0)+1), otherwise it descends and instantiates a new child. [sent-78, score-0.538]
31 It stays at this new child with probability 1/(α(1)+1) or descends again and so on. [sent-79, score-0.233]
32 A later datum stays at node with probability (N +1)/(N +N · +α(| |)+1), where N is the number of previous data that stopped at , and N · is the number of previous data that came down this path of the tree but did not stop at , i. [sent-80, score-0.565]
33 If a datum descends to but does not stop then it chooses which child to descend to according to a Chinese restaurant process where the previous customers are only those data who have also descended to this point. [sent-83, score-0.503]
34 That is, if it has reached node but will not stay there, it descends to existing child i with probability (N i +N i · )/(N · +γ) and instantiates a new child with probability γ/(N · +γ). [sent-84, score-0.612]
35 Note that a node can be a part of a popular path without having any data of its own. [sent-86, score-0.234]
36 Note that the branch ordering in a realization of the urn scheme will not necessarily be the same as that of the size-biased ordering [16] of the partitions in Fig. [sent-89, score-0.371]
37 1b: the former is a tree over a finite set of data and the latter is over a random infinite partition. [sent-90, score-0.209]
38 The urn view allows us to compare this model to other priors on infinite trees. [sent-91, score-0.175]
39 One contribution of this model is that the data can live at internal nodes in the tree, but are nevertheless infinitely exchangeable. [sent-92, score-0.282]
40 The nested Chinese restaurant process (nCRP) [17] provides a distribution over trees of unbounded width and depth, but data correspond to paths of infinite length, requiring an additional distribution over depths that is not path-dependent. [sent-94, score-0.514]
41 The P´ lya tree [13] uses a recursive stick-breaking process to specify a o distribution over nested partitions in a binary tree, however the data live at infinitely-deep leaf nodes. [sent-95, score-0.721]
42 The marginal distribution on the topology of a Dirichlet diffusion tree [2] (and the clustering variant of Kingman’s coalescent [4]) provides path-reinforcement and infinite exchangeability, however it requires a pseudo-time hazard process and data do not live at internal nodes. [sent-96, score-0.592]
43 Rather, the distribution over the parameters at node , denoted θ , should depend in an interesting way on its ancestors {θ : }. [sent-106, score-0.363]
44 Such a kernel captures the simple idea that the child’s parameters are noisy versions of the parent’s, as specified by the covariance matrix Λ, while η ensures that all parameters in the tree have a finite marginal variance. [sent-113, score-0.209]
45 Chained Dirichlet-Multinomial Distributions If each datum is a set of counts over M discrete outcomes, as in many finite topic models, a multinomial model for f (x | θ) may be appropriate. [sent-116, score-0.324]
46 In our case, one flexible approach would be to model the data at node with a distribution G as in Eq. [sent-122, score-0.282]
47 The hierarchical Dirichlet process (HDP) [18] provides a natural parent-to-child transition kernel for the tree-structured model, again with concentration parameter κ: Thdp (G i ← G ) = DP(κG ). [sent-125, score-0.156]
48 4 Inference via Markov chain Monte Carlo We have so far defined a model for data that are generated from the parameters associated with the nodes of a random tree. [sent-133, score-0.185]
49 As in most complex probabilistic models, closed form inference is impossible and we instead generate posterior samples via Markov chain Monte Carlo (MCMC). [sent-135, score-0.133]
50 To operate efficiently over a variety of regimes without tuning, we use slice sampling [19] extensively. [sent-136, score-0.305]
51 The primary data structure in our Markov chain is the set of N strings describing the current assignments of data to nodes, which we denote { n }N . [sent-138, score-0.177]
52 We also represent all ψ-sticks in the “hull” of the tree that contains the data: if at some node one of the N data paths passes through child i , then we represent all the ψ-sticks in the set n {ψ j : j ≤ i }. [sent-142, score-0.541]
53 To avoid this difficulty, we use a slice sampling approach that can be viewed as a combination of the Dirichlet slice sampler of Walker [20] and the retrospective sampler of Papaspiliopolous and Roberts [21]. [sent-144, score-0.585]
54 An alternative method is to draw a uniform variate u on (0, 1) and break sticks until we know what π the u fell into. [sent-146, score-0.232]
55 We would draw the sticks and parameters from the prior, as needed, conditioning on the state instantiated from any previous draws and with parent-to-child transitions enforcing the prior downwards in the tree. [sent-149, score-0.233]
56 This representation leads to a slice sampling scheme on u that does not require any tuning parameters. [sent-151, score-0.305]
57 To slice sample the assignment of the nth datum, currently assigned to n , we initialize our slice sampling bounds to (0, 1). [sent-152, score-0.538]
58 We shrink the slice sampling bounds appropriately, depending on the comparison, until we find a u that satisfies the slice. [sent-156, score-0.305]
59 We iterate over each instantiated node and perform a Gibbs update of the ordering of its immediate children using its invariance under size-biased permutation (SBP) [16]. [sent-162, score-0.418]
60 We repeatedly draw without replacement from the discrete distribution implied by the weights and keep the ordering that results. [sent-164, score-0.165]
61 We initialize the bounds of the slice sampler with the bounds of the top-hat prior. [sent-167, score-0.233]
62 5 Figure 3: These figures show a subset of the tree learned from the 50,000 CIFAR-100 images. [sent-168, score-0.209]
63 The top tree only shows nodes for which there were at least 250 images. [sent-169, score-0.326]
64 The ten shown at each node are those with the highest probability under the node’s distribution. [sent-170, score-0.234]
65 Selecting a Single Tree We have so far described a procedure for generating posterior samples from the tree structures and associated stick-breaking processes. [sent-173, score-0.274]
66 Following [17], we report a best single tree structure over the data by choosing the sample from our Markov chain that has the highest complete-data likelihood p({xn , n }N | {ν }, {ψ }, α0 , λ, γ). [sent-175, score-0.277]
67 d=1 The prior over the parameters of a child node was Gaussian with its parent’s value as the mean. [sent-185, score-0.332]
68 To efficiently learn the node parameters, we used Hamiltonian (hybrid) Monte Carlo (HMC) [24], taking 25 leapfrog HMC steps, with a randomized step size. [sent-189, score-0.234]
69 We occasionally interleaved a slice sampling move for robustness. [sent-190, score-0.305]
70 Each node shows three aggregated statistics at that node: the five most common author names, the five most common words and a histogram over the years of proceedings. [sent-197, score-0.234]
71 Using Python on a single core of a modern workstation each MCMC iteration of the entire model (including slice sampled reassignment of all 50,000 images) requires approximately three minutes. [sent-201, score-0.233]
72 3 represents a part of the tree with the best complete-data log likelihood after 4000 such iterations. [sent-203, score-0.209]
73 The tree provides a useful visualization of the data set, capturing broad variations in color at the higher levels of the tree, with lower branches varying in texture and shape. [sent-204, score-0.209]
74 A larger version of this tree is provided in the supplementary material. [sent-205, score-0.209]
75 6 Hierarchical Modeling of Document Topics We also used our approach in a bag-of-words topic model, applying it to 1740 papers from NIPS 1–12 2 . [sent-206, score-0.145]
76 As in latent Dirichlet allocation (LDA) [25], we consider a topic to be a distribution over words and each document to be described by a distribution over topics. [sent-207, score-0.298]
77 In LDA, each document has a unique topic distribution. [sent-208, score-0.145]
78 In our model, however, each document lives at a node and that node has a unique topic distribution. [sent-209, score-0.613]
79 Thus multiple documents share a distribution over topics if they inhabit the same node. [sent-210, score-0.27]
80 Each node’s topic distribution is from a chained Dirichlet-multinomial as described in Section 3. [sent-211, score-0.274]
81 The topics each have symmetric Dirichlet priors over their word distributions. [sent-212, score-0.241]
82 This results in a different kind of topic model than that provided by the nested Chinese restaurant process. [sent-213, score-0.316]
83 In the nCRP, each node corresponds to a topic and documents are spread across infinitely-long paths down the tree. [sent-214, score-0.469]
84 a) Mean improvement in perplexity per word over Laplace-smoothed multinomial, as a function of topics (larger is better). [sent-226, score-0.307]
85 b) Best predictive perplexity per word for each fold (smaller is better). [sent-228, score-0.223]
86 expanded version of this tree is provided in the supplementary material. [sent-230, score-0.209]
87 , 100) and evaluated the predictive perplexity of the held-out data using an empirical likelihood estimate taken from a mixture of multinomials (pseudo-documents of infinite length, see, e. [sent-236, score-0.21]
88 This improvement appears to be due to the constraints on possible topic distributions that are imposed by the diffusion. [sent-241, score-0.145]
89 In absolute terms, more topics did not appear to improve predictive performance for LDA or the tree-structured model. [sent-243, score-0.18]
90 Both models performed best with fewer than fifty topics and the best tree model outperformed the best LDA model on all folds, as shown in Fig. [sent-244, score-0.341]
91 The MCMC inference procedure we used to train our model was as follows: first, we ran Gibbs sampling of a standard LDA topic model for 1000 iterations. [sent-246, score-0.217]
92 We then burned in the tree inference for 500 iterations with fixed word-topic associations. [sent-247, score-0.29]
93 For the comparison, we burned in LDA for 1000 iterations and then drew 5000 samples from the posterior [27]. [sent-249, score-0.146]
94 The mixing of the topic model seems to be somewhat sensitive to the initialization of the κ parameter in the chained Dirichlet-multinomial and we initialized this parameter to be the same as the number of topics. [sent-251, score-0.226]
95 Our approach is novel in that it combines infinite exchangeability with a representation that allows data to live at internal nodes on the tree, without a hazard rate process. [sent-253, score-0.282]
96 In this light, our notion of evolutionary diffusion down a tree sits within the larger class of models that construct dependencies between distributions on random measures [28, 29, 18]. [sent-256, score-0.323]
97 Learning stick-figure models using nonparametric Bayesian priors over trees. [sent-296, score-0.137]
98 The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. [sent-332, score-0.471]
99 Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. [sent-350, score-0.224]
100 80 million tiny images: A large data set for nonparametric object and scene recognition. [sent-354, score-0.141]
wordName wordTfidf (topN-words)
[('dirichlet', 0.248), ('node', 0.234), ('slice', 0.233), ('tree', 0.209), ('lda', 0.148), ('topic', 0.145), ('descends', 0.135), ('sticks', 0.135), ('topics', 0.132), ('urn', 0.13), ('datum', 0.122), ('nodes', 0.117), ('live', 0.115), ('stick', 0.112), ('perplexity', 0.111), ('gem', 0.108), ('tssb', 0.108), ('partitions', 0.105), ('child', 0.098), ('lya', 0.095), ('hierarchical', 0.093), ('nitely', 0.092), ('nonparametric', 0.092), ('documents', 0.09), ('fty', 0.087), ('nested', 0.086), ('restaurant', 0.085), ('ancestors', 0.081), ('sejnowski', 0.081), ('burned', 0.081), ('chained', 0.081), ('giles', 0.081), ('ncrp', 0.081), ('umax', 0.081), ('umin', 0.081), ('trees', 0.079), ('hierarchies', 0.078), ('mcmc', 0.075), ('breaking', 0.074), ('chinese', 0.072), ('beta', 0.072), ('sampling', 0.072), ('jim', 0.071), ('koch', 0.071), ('variates', 0.071), ('uni', 0.068), ('monte', 0.068), ('ordering', 0.068), ('chain', 0.068), ('children', 0.067), ('find', 0.067), ('exchangeable', 0.065), ('radford', 0.065), ('posterior', 0.065), ('word', 0.064), ('bayesian', 0.063), ('process', 0.063), ('diffusion', 0.061), ('mozer', 0.061), ('carlo', 0.059), ('gibbs', 0.058), ('zemel', 0.058), ('strings', 0.058), ('multinomial', 0.057), ('latent', 0.057), ('unbounded', 0.056), ('whye', 0.055), ('yee', 0.055), ('depth', 0.055), ('diffusive', 0.054), ('nats', 0.054), ('perm', 0.054), ('pslice', 0.054), ('samp', 0.054), ('thdp', 0.054), ('evolutionary', 0.053), ('mixture', 0.051), ('assignments', 0.051), ('internal', 0.05), ('dp', 0.05), ('tiny', 0.049), ('murray', 0.049), ('instantiated', 0.049), ('width', 0.049), ('draw', 0.049), ('predictive', 0.048), ('break', 0.048), ('distribution', 0.048), ('instantiates', 0.047), ('interleaving', 0.047), ('retrospective', 0.047), ('williamson', 0.047), ('markov', 0.047), ('folds', 0.047), ('clustering', 0.046), ('processes', 0.046), ('david', 0.045), ('nite', 0.045), ('priors', 0.045), ('ths', 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
2 0.25583079 150 nips-2010-Learning concept graphs from text with stick-breaking priors
Author: America Chambers, Padhraic Smyth, Mark Steyvers
Abstract: We present a generative probabilistic model for learning general graph structures, which we term concept graphs, from text. Concept graphs provide a visual summary of the thematic content of a collection of documents—a task that is difficult to accomplish using only keyword search. The proposed model can learn different types of concept graph structures and is capable of utilizing partial prior knowledge about graph structure as well as labeled documents. We describe a generative model that is based on a stick-breaking process for graphs, and a Markov Chain Monte Carlo inference procedure. Experiments on simulated data show that the model can recover known graph structure when learning in both unsupervised and semi-supervised modes. We also show that the proposed model is competitive in terms of empirical log likelihood with existing structure-based topic models (hPAM and hLDA) on real-world text data sets. Finally, we illustrate the application of the model to the problem of updating Wikipedia category graphs. 1
3 0.23474653 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
4 0.22214349 194 nips-2010-Online Learning for Latent Dirichlet Allocation
Author: Matthew Hoffman, Francis R. Bach, David M. Blei
Abstract: We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
5 0.20562257 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
6 0.20294636 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
7 0.20046912 286 nips-2010-Word Features for Latent Dirichlet Allocation
8 0.18810776 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
9 0.166026 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
10 0.15641764 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
11 0.13788775 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
12 0.12917629 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
13 0.12025722 144 nips-2010-Learning Efficient Markov Networks
14 0.11751692 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
15 0.095253378 101 nips-2010-Gaussian sampling by local perturbations
16 0.089927785 264 nips-2010-Synergies in learning words and their referents
17 0.086380906 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
18 0.08470694 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
19 0.08345256 128 nips-2010-Infinite Relational Modeling of Functional Connectivity in Resting State fMRI
20 0.082320273 190 nips-2010-On the Convexity of Latent Social Network Inference
topicId topicWeight
[(0, 0.265), (1, 0.063), (2, -0.026), (3, 0.014), (4, -0.439), (5, 0.104), (6, 0.219), (7, 0.043), (8, 0.014), (9, 0.043), (10, 0.073), (11, 0.054), (12, -0.127), (13, 0.053), (14, -0.006), (15, -0.043), (16, 0.011), (17, 0.005), (18, -0.064), (19, -0.049), (20, -0.077), (21, -0.026), (22, 0.0), (23, -0.082), (24, -0.093), (25, 0.186), (26, -0.037), (27, 0.101), (28, 0.035), (29, -0.063), (30, -0.046), (31, 0.098), (32, 0.01), (33, 0.009), (34, 0.029), (35, -0.029), (36, -0.014), (37, -0.087), (38, -0.055), (39, -0.019), (40, 0.114), (41, -0.075), (42, -0.085), (43, -0.075), (44, -0.066), (45, 0.042), (46, -0.007), (47, -0.067), (48, 0.049), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.96077132 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
2 0.73176646 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
3 0.70194674 150 nips-2010-Learning concept graphs from text with stick-breaking priors
Author: America Chambers, Padhraic Smyth, Mark Steyvers
Abstract: We present a generative probabilistic model for learning general graph structures, which we term concept graphs, from text. Concept graphs provide a visual summary of the thematic content of a collection of documents—a task that is difficult to accomplish using only keyword search. The proposed model can learn different types of concept graph structures and is capable of utilizing partial prior knowledge about graph structure as well as labeled documents. We describe a generative model that is based on a stick-breaking process for graphs, and a Markov Chain Monte Carlo inference procedure. Experiments on simulated data show that the model can recover known graph structure when learning in both unsupervised and semi-supervised modes. We also show that the proposed model is competitive in terms of empirical log likelihood with existing structure-based topic models (hPAM and hLDA) on real-world text data sets. Finally, we illustrate the application of the model to the problem of updating Wikipedia category graphs. 1
4 0.68598408 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
Author: Bela Frigyik, Maya Gupta, Yihua Chen
Abstract: Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data. 1 Modeling Probabilities for Machine Learning Modeling probability mass functions (pmfs) as random is useful in solving many real-world problems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate to the multinomial and hence mathematically convenient for Bayesian inference, and the number of parameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distribution over the entire probability simplex, and for many problems this is simply the wrong domain if there is application-specific prior knowledge that the pmfs come from a restricted subset of the simplex. For example, in natural language modeling, it is common to regularize a pmf over n-grams by some generic language model distribution q0 , that is, the pmf to be modeled is assumed to have the form θ = λq + (1 − λ)q0 for some q in the simplex, λ ∈ (0, 1) and a fixed generic model q0 [2]. But once q0 and λ are fixed, the pmf θ can only come from a subset of the simplex. Another natural language processing example is modeling the probability of keywords in a dictionary where some words are related, such as espresso and latte, and evidence for the one is to some extent evidence for the other. This relationship can be captured with a bounded variation model that would constrain the modeled probability of espresso to be within some of the modeled probability of latte. We show that such bounds on the variation between pmf components also restrict the domain of the pmf to a subset of the simplex. As a third example of restricting the domain, the similarity discriminant analysis classifier estimates class-conditional pmfs that are constrained to be monotonically increasing over an ordered sample space of discrete similarity values [3]. In this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex, explore its properties, and show how to learn the model from data. We first discuss the alternative solution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work. Then we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet for three types of restricted domains: the regularized pmf case, bounded variation between pmf components, and monotonic pmfs; and discuss the most general case. We show how to use the expectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter α, and present simulation results for the estimation. 1 Dirichlet Shadow Dirichlet Renormalized Dirichlet Figure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for α = [3.94 2.25 2.81]. 2 Related Work One solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of ˜ ˜ the Dirichlet to the desired support S, and renormalize the Dirichlet over S (see Fig. 1 for an example). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the multinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found it difficult to use because the density requires numerical integration to compute the normalizer [4] . In addition, there is no closed form solution for the mean, covariance, or peak of the renormalized Dirichlet, making it difficult to work with. Table 1 summarizes these properties. Additionally, generating samples from the renormalized Dirichlet is inefficient: one draws samples from the stan˜ dard Dirichlet, then rejects realizations that are outside S. For high-dimensional sample spaces, this could greatly increase the time to generate samples. Although the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it “is totally inadequate for the description of the variability of compositional data,” because of its “implied independence structure and so the Dirichlet class is unlikely to be of any great use for describing compositions whose components have even weak forms of dependence” [5]. Aitchison instead championed a logistic normal distribution with more parameters to control covariance between components. A number of variants of the Dirichlet that can capture more dependence have been proposed and analyzed. For example, the scaled Dirichlet enables a more flexible shape for the distribution [5], but does not change the support. The original Dirichlet(α1 , α2 , . . . αd ) can be derived as Yj / j Yj where Yj ∼ Γ(αj , β), whereas the scaled Dirichlet is derived from Yj ∼ Γ(αj , βj ), resulting in α density p(θ) = γ j ( α −1 βj j θ j j βi θi )α1 +···+αd i , where β, α ∈ Rd are parameters, and γ is the normalizer. + Another variant is the generalized Dirichlet [6] which also has parameters β, α ∈ Rd , and allows + greater control of the covariance structure, again without changing the support. As perhaps first noted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data can be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman variant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet models. Ongaro et al. [8] propose the flexible Dirichlet distribution by forming a re-parameterized mixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure for the general Dirichlet family called the generalized Liouville distributions. In contrast to prior efforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that arise frequently in machine learning problems. 3 Shadow Dirichlet Distribution We introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob˜ ability (d − 1)-simplex, and let Θ ∈ S be a random pmf drawn from a Dirichlet distribution with density pD and unnormalized parameter α ∈ Rd . Then we say the random pmf Θ ∈ S is distributed + ˜ according to a shadow Dirichlet distribution if Θ = M Θ for some fixed d × d left-stochastic (that ˜ is, each column of M sums to 1) full-rank (and hence invertible) matrix M , and we call Θ the gen2 erating Dirichlet of Θ, or Θ’s Dirichlet shadow. Because M is a left-stochastic linear map between finite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and compact subset of S that we denote SM . The shadow Dirichlet has two parameters: the generating Dirichlet’s parameter α ∈ Rd , and the + d × d matrix M . Both α and M can be estimated from data. However, as we show in the following subsections, the matrix M can be profitably used as a design parameter that is chosen based on application-specific knowledge or side-information to specify the restricted domain SM , and in that way impose dependency between the components of the random pmfs. The shadow Dirichlet density p(θ) is the normalized pushforward of the Dirichlet density, that is, it is the composition of the Dirichlet density and M −1 with the Jacobian: 1 α −1 p(θ) = (M −1 θ)j j , (1) B(α) |det(M )| j Γ(αj ) d j is the standard Dirichlet normalizer, and α0 = j=1 αj is the standard where B(α) Γ(α0 ) Dirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1 shows an example shadow Dirichlet distribution. Generating samples from the shadow Dirichlet is trivial: generate samples from its generating Dirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to create the corresponding shadow Dirichlet sample. Table 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirichlet distributions. Dirichlet(α) Density p(θ) Mean 1 B(α) d j=1 α −1 θj j Shadow Dirichlet (α, M ) 1 B(α)|det(M )| α α0 Renormalized ˜ Dirichlet (α, S) d −1 αj −1 θ)j j=1 (M 1 ˜ S M d j=1 α α0 αj −1 qj dq d j=1 α −1 θj j ˜ θp(θ)dθ S ¯ ¯ − θ)(θ − θ)T p(θ)dθ Cov(Θ) M Cov(Θ)M T αj −1 α0 −d j M α0 −d max p(θ) stick-breaking, urn-drawing draw from Dirichlet(α), multiply by M draw from Dirichlet(α), ˜ reject if not in S ML Estimate iterative (simple functions) iterative (simple functions) unknown complexity ML Compound Estimate iterative (simple functions) iterative (numerical integration) unknown complexity Covariance Mode (if α > 1) How to Sample 3.1 α −1 ˜ (θ S ˜ θ∈S Example: Regularized Pmfs The shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM = ˜ ˘ ˜ ˘ ˘ {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for specific values of λ and θ. In general, for a given λ and θ ∈ S, the following d × d matrix M will change the support to the desired subset SM by mapping the extreme points of S to the extreme points of SM : ˘ M = (1 − λ)θ1T + λI, (2) where I is the d × d identity matrix. In Section 4 we show that the M given in (2) is optimal in a maximum entropy sense. 3 3.2 Example: Bounded Variation Pmfs We describe how to use the shadow Dirichlet to model a random pmf that has bounded variation such that |θk − θl | ≤ k,l for any k, ∈ {1, 2, . . . , d} and k,l > 0. To construct specified bounds on the variation, we first analyze the variation for a given M . For any d × d left stochastic matrix T d d ˜ ˜ ˜ M, θ = Mθ = M1j θj . . . Mdj θj , so the difference between any two entries is j=1 j=1 ˜ |Mkj − Mlj | θj . ˜ (Mkj − Mlj )θj ≤ |θk − θl | = (3) j j Thus, to obtain a distribution over pmfs with bounded |θk − θ | ≤ k,l for any k, components, it is sufficient to choose components of the matrix M such that |Mkj − Mlj | ≤ k,l for all j = 1, . . . , d ˜ because θ in (3) sums to 1. One way to create such an M is using the regularization strategy described in Section 3.1. For this ˜ ˜ ˘ case, the jth component of θ is θj = M θ = λθj + (1 − λ)θj , and thus the variation between the j ith and jth component of any pmf in SM is: ˜ ˘ ˜ ˘ ˜ ˜ ˘ ˘ |θi − θj | = λθi + (1 − λ)θi − λθj − (1 − λ)θj ≤ λ θi − θj + (1 − λ) θi − θj ˘ ˘ ≤ λ + (1 − λ) max θi − θj . (4) i,j ˘ Thus by choosing an appropriate λ and regularizing pmf θ, one can impose the bounded variation ˘ to be the uniform pmf, and choose any λ ∈ (0, 1), then the matrix given by (4). For example, set θ M given by (2) will guarantee that the difference between any two entries of any pmf drawn from the shadow Dirichlet (M, α) will be less than or equal to λ. 3.3 Example: Monotonic Pmfs For pmfs over ordered components, it may be desirable to restrict the support of the random pmf distribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs). A d × d left-stochastic matrix M that will result in a shadow Dirichlet that generates only monotonically increasing d × 1 pmfs has kth column [0 . . . 0 1/(d − k + 1) . . . 1/(d − k + 1)]T , we call this the monotonic M . It is easy to see that with this M only monotonic θ’s can be produced, 1˜ 1˜ 1 ˜ because θ1 = d θ1 which is less than or equal to θ2 = d θ1 + d−1 θ2 and so on. In Section 4 we show that the monotonic M is optimal in a maximum entropy sense. Note that to provide support over both monotonically increasing and decreasing pmfs with one distribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two shadow Dirichlets. 3.4 What Restricted Subsets are Possible? Above we have described solutions to construct M for three kinds of dependence that arise in machine learning applications. Here we consider the more general question: What subsets of the simplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a particular support? For any matrix M , by the Krein-Milman theorem [10], SM = M S is the convex hull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d × d matrix M will have d extreme points that occur for the d choices of θ that have only one nonzero component, as the rest of the θ will create a non-trivial convex combination of the columns of M , and therefore cannot result in extreme points of SM by definition. That is, the extreme points of SM are the d columns of M , and one can design any SM with d extreme points by setting the columns of M to be those extreme pmfs. However, if one wants the new support to be a polytope in the probability (d − 1)-simplex with m > d extreme points, then one must use a fat M with d × m entries. Let S m denote the probability 4 (m − 1)-simplex, then the domain of the shadow Dirichlet will be M S m , which is the convex hull of the m columns of M and forms a convex polytope in S with at most m vertices. In this case M cannot be injective, and hence it is not bijective between S m and M S m . However, a density on M S m can be defined as: p(θ) = 1 B(α) ˜ {θ ˜α −1 ˜ θj j dθ. ˜ M θ=θ} j (5) On the other hand, if one wants the support to be a low-dimensional polytope subset of a higherdimensional probability simplex, then a thin d × m matrix M , where m < d, can be used to implement this. If M is injective, then it has a left inverse M ∗ that is a matrix of dimension m × d, and the normalized pushforward of the original density can be used as a density on the image M S m : p(θ) = 1 α −1 1/2 B(α) |det(M T M )| (M ∗ θ)j j , j If M is not injective then one way to determine a density is to use (5). 4 Information-theoretic Properties In this section we note two information-theoretic properties of the shadow Dirichlet. Let Θ be drawn ˜ from shadow Dirichlet density pM , and let its generating Dirichlet Θ be drawn from pD . Then the differential entropy of the shadow Dirichlet is h(pM ) = log |det(M )| + h(pD ), where h(pD ) is the differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less entropy than its Dirichlet shadow because log |det(M )| ≤ 0, which can be shown as a corollary to the following lemma (proof not included due to lack of space): Lemma 4.1. Let {x1 , . . . , xn } and {y1 , . . . , yn } be column vectors in Rn . If each yj is a convex n n combination of the xi ’s, i.e. yj = i=1 γji xi , i=1 γji = 1, γjk ≥ 0, ∀j, k ∈ {1, . . . , n} then |det[y1 , . . . , yn ]| ≤ |det[x1 , . . . , xn ]|. It follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M are optimal in the sense of maximizing entropy: Corollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet ˜ ˘ ˜ ˘ distributions with support in {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for a specific choice of λ and θ. Then the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves arg maxM ∈Mreg h(pM ). Corollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow Dirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Section 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves arg maxM ∈Mmono h(pM ). 5 Estimating the Distribution from Data In this section, we discuss the estimation of α for the shadow Dirichlet and compound shadow Dirichlet, and the estimation of M . 5.1 Estimating α for the Shadow Dirichlet Let matrix M be specified (for example, as described in the subsections of Section 3), and let q be a d × N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N , and let (qi )j be the jth component of the ith sample pmf for j = 1, . . . , d. Then finding the maximum likelihood estimate 5 of α for the shadow Dirichlet is straightforward: N 1 arg max log + log p(qi |α) ≡ arg max log B(α) |det(M )| α∈Rk α∈Rk + + i=1 1 αj −1 (˜i )j q , ≡ arg max log B(α)N i j α∈Rk + N α −1 (M −1 qi )j j i j (6) where q = M −1 q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis˜ tribution given the matrix q , and can be solved using the standard methods for that problem (see ˜ e.g. [11, 12]). 5.2 Estimating α for the Compound Shadow Dirichlet For many machine learning applications the given data are modeled as samples from realizations of a random pmf, and given these samples one must estimate the random pmf model’s parameters. We refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also called the multivariate P´ lya distribution). Assuming one has already specified M , we first discuss o method of moments estimation, and then describe an expectation-maximization (EM) method for computing the maximum likelihood estimate α. ˘ One can form an estimate of α by the method of moments. For the standard compound Dirichlet, one treats the samples of the realizations as normalized empirical histograms, sets the normalized α parameter equal to the empirical mean of the normalized histograms, and uses the empirical variances to determine the precision α0 . By definition, this estimate will be less likely than the maximum likelihood estimate, but may be a practical short-cut in some cases. For the compound shadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general. The problem is that if one draws samples from a pmf θ from a restricted subset SM of the simplex, ˘ then the normalized empirical histogram θ of those samples may not be in SM . For example given a monotonic pmf, the histogram of five samples drawn from it may not be monotonic. Then the empirical mean of such normalized empirical histograms may not be in SM , and so setting the shadow Dirichlet mean M α equal to the empirical mean may lead to an infeasible estimate (one that is outside SM ). A heuristic solution is to project the empirical mean into SM first, for example, by finding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet, this may still be a useful approach in practice for some problems. Next we state an EM method to find the maximum likelihood estimate α. Let s be a d × N matrix ˘ of sample histograms from different experiments, such that the ith column si is the ith histogram for i = 1, . . . , N , and (si )j is the number of times we have observed the jth event from the ith pmf vi . Then the maximum log-likelihood estimate of α solves arg max log p(s|α) for α ∈ Rk . + If the random pmfs are drawn from a Dirichlet distribution, then finding this maximum likelihood estimate requires an iterative procedure, and can be done in several ways including a gradient descent (ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution, then a direct gradient descent approach is highly inconvenient as it requires taking derivatives of numerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm [13][14], as we describe in the rest of this section. Code to perform the EM estimation of α can be downloaded from idl.ee.washington.edu/publications.php. We assume that the experiments are independent and therefore p(s|α) = p({si }|α) = and hence arg maxα∈Rk log p(s|α) = arg maxα∈Rk i log p(si |α). + + i p(si |α) To apply the EM method, we consider the complete data to be the sample histograms s and the pmfs that generated them (s, v1 , v2 , . . . , vN ), whose expected log-likelihood will be maximized. Specifically, because of the assumed independence of the {vi }, the EM method requires one to repeatedly maximize the Q-function such that the estimate of α at the (m + 1)th iteration is: N α(m+1) = arg max α∈Rk + Evi |si ,α(m) [log p(vi |α)] . i=1 6 (7) Like the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not necessarily concave. However, note that the Q-function given in (7) is concave, because log p(vi |α) = − log |det(M )| + log pD,α M −1 vi , where pD,α is the Dirichlet distribution with parameter α, and by a theorem of Ronning [11], log pD,α is a concave function, and adding a constant does not change the concavity. The Q-function is a finite integration of such concave functions and hence also concave [15]. We simplify (7) without destroying the concavity to yield the equivalent problem α(m+1) = d d arg max g(α) for α ∈ Rk , where g(α) = log Γ(α0 ) − j=1 log Γ(αj ) + j=1 βj αj , and + βj = N tij i=1 zi , 1 N where tij and zi are integrals we compute with Monte Carlo integration: d (s ) log(M −1 vi )j γi tij = SM (vi )k i k pM (vi |α(m) )dvi k=1 d zi = (vi )j k(si )k pM (vi |α(m) )dvi , γi SM k=1 where γi is the normalization constant for the multinomial with histogram si . We apply the Newton method [16] to maximize g(α), where the gradient g(α) has kth component ψ0 (α0 ) − ψ0 (α1 ) + β1 , where ψ0 denotes the digamma function. Let ψ1 denote the trigamma function, then the Hessian matrix of g(α) is: H = ψ1 (α0 )11T − diag (ψ1 (α1 ), . . . , ψ1 (αd )) . Note that because H has a very simple structure, the inversion of H required by the Newton step is greatly simplified by using the Woodbury identity [17]: H −1 = − diag(ξ1 , . . . , ξd ) − 1 1 1 [ξi ξj ]d×d , where ξ0 = ψ1 (α0 ) and ξj = ψ1 (αj ) , j = 1, . . . , d. ξ − d ξ 0 5.3 j=1 j Estimating M for the Shadow Dirichlet Thus far we have discussed how to construct M to achieve certain desired properties and how to interpret a given M ’s effect on the support. In some cases it may be useful to estimate M directly from data, for example, finding the maximum likelihood M . In general, this is a non-convex problem because the set of rank d − 1 matrices is not convex. However, we offer two approximations. First, note that as in estimating the support of a uniform distribution, the maximum likelihood M will correspond to a support that is no larger than needed to contain the convex hull of sample pmfs. Second, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth column of M (which corresponds to the kth vertex of the support) to be a convex combination of the kth vertex of the standard probability simplex and the empirical mean pmf. We provide code that finds the d optimal such convex combinations such that a specificed percentage of the sample pmfs are within the support, which reduces the non-convex problem of finding the maximum likelihood d × d matrix M to a d-dimensional convex relaxation. 6 Demonstrations It is reasonable to believe that if the shadow Dirichlet better matches the problem’s statistics, it will perform better in practice, but an open question is how much better? To motivate the reader to investigate this question further in applications, we provide two small demonstrations. 6.1 Verifying the EM Estimation We used a broad suite of simulations to test and verify the EM estimation. Here we include a simple visual confirmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet with monotonic M for d = 3 and α = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs, we drew 100 i.i.d. samples. Then we applied the EM algorithm to find the α for both the standard compound Dirichlet, and the compound shadow Dirichlet with the correct M . Fig. 2 shows the true distribution and the two estimated distributions. 7 True Distribution (Shadow Dirichlet) Estimated Shadow Dirichlet Estimated Dirichlet Figure 2: Samples were drawn from the true distribution and the given EM method was applied to form the estimated distributions. 6.2 Estimating Proportions from Sales Manufacturers often have constrained manufacturing resources, such as equipment, inventory of raw materials, and employee time, with which to produce multiple products. The manufacturer must decide how to proportionally allocate such constrained resources across their product line based on their estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales data for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict the proportion of sales expected for each puzzle. These estimates were then tested on the next five months of sales data, for January 2010 through April 2010. The company also provided a similarity between puzzles S, where S(A, B) is the proportion of times an order during the six training months included both puzzle A and B if it included puzzle A. We compared treating each of the six training months of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet. For the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so that it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought together to have closer estimated proportions. We estimated each α parameter by EM to maximize the likelihood of the past sales data, and then estimated the future sales proportions to be the mean of the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all six months of sales data as coming from one multinomial which we estimated as the maximum likelihood multinomial, and to taking the mean of the six empirical pmfs. Table 2: Squared errors between estimates and actual proportional sales. Jan. Feb. Mar. Apr. 7 Multinomial .0129 .0185 .0231 .0240 Mean Pmf .0106 .0206 .0222 .0260 Dirichlet .0109 .0172 .0227 .0235 Shadow Dirichlet .0093 .0164 .0197 .0222 Summary In this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of the dependent structure that arises often in machine learning applications. We have discussed some of its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded variation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the EM method and made available code to estimate both the shadow Dirichlet and compound shadow Dirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow Dirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice. 8 References [1] B. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Tech. Rep., University of Washington, 2010. [2] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval,” ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179–214, 2004. [3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similarity-based classification: Concepts and algorithms,” Journal of Machine Learning Research, vol. 10, pp. 747–776, March 2009. [4] R. Nallapati, T. Minka, and S. Robertson, “The smoothed-Dirichlet distribution: a building block for generative topic models,” Tech. Rep., Microsoft Research, Cambridge, 2007. [5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986. [6] R. J. Connor and J. E. Mosiman, “Concepts of independence for proportions with a generalization of the Dirichlet distibution,” Journal of the American Statistical Association, vol. 64, pp. 194–206, 1969. [7] K. Pearson, “Mathematical contributions to the theory of evolution–on a form of spurious correlation which may arise when indices are used in the measurement of organs,” Proc. Royal Society of London, vol. 60, pp. 489–498, 1897. [8] A. Ongaro, S. Migliorati, and G. S. Monti, “A new distribution on the simplex containing the Dirichlet family,” Proc. 3rd Compositional Data Analysis Workshop, 2008. [9] W. S. Rayens and C. Srinivasan, “Dependence properties of generalized Liouville distributions on the simplex,” Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465– 1470, 1994. [10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991. [11] G. Ronning, “Maximum likelihood estimation of Dirichlet distributions,” Journal of Statistical Computation and Simulation, vol. 34, no. 4, pp. 215221, 1989. [12] T. Minka, “Estimating a Dirichlet distribution,” Tech. Rep., Microsoft Research, Cambridge, 2009. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in Signal Processing, Hanover, MA, 2010. [15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com. [18] R. E. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” in Proc. Intl. Conf. Machine Learning, 2005. 9
5 0.651591 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
Author: Issei Sato, Kenichi Kurihara, Hiroshi Nakagawa
Abstract: We develop a deterministic single-pass algorithm for latent Dirichlet allocation (LDA) in order to process received documents one at a time and then discard them in an excess text stream. Our algorithm does not need to store old statistics for all data. The proposed algorithm is much faster than a batch algorithm and is comparable to the batch algorithm in terms of perplexity in experiments.
6 0.6122812 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
7 0.60849029 286 nips-2010-Word Features for Latent Dirichlet Allocation
8 0.59402561 120 nips-2010-Improvements to the Sequence Memoizer
9 0.55749947 194 nips-2010-Online Learning for Latent Dirichlet Allocation
10 0.51199442 215 nips-2010-Probabilistic Deterministic Infinite Automata
11 0.49373046 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
12 0.46373174 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
13 0.46149531 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
14 0.4554885 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
15 0.45050326 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
16 0.44052839 144 nips-2010-Learning Efficient Markov Networks
17 0.431858 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
18 0.42377609 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
19 0.4078342 168 nips-2010-Monte-Carlo Planning in Large POMDPs
20 0.40360597 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
topicId topicWeight
[(13, 0.053), (22, 0.267), (27, 0.12), (30, 0.076), (35, 0.025), (45, 0.186), (50, 0.067), (52, 0.034), (60, 0.041), (77, 0.028), (90, 0.032)]
simIndex simValue paperId paperTitle
1 0.90493709 8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration
Author: Yi-da Wu, Shi-jie Lin, Hsin Chen
Abstract: The Diffusion Network(DN) is a stochastic recurrent network which has been shown capable of modeling the distributions of continuous-valued, continuoustime paths. However, the dynamics of the DN are governed by stochastic differential equations, making the DN unfavourable for simulation in a digital computer. This paper presents the implementation of the DN in analogue Very Large Scale Integration, enabling the DN to be simulated in real time. Moreover, the logdomain representation is applied to the DN, allowing the supply voltage and thus the power consumption to be reduced without limiting the dynamic ranges for diffusion processes. A VLSI chip containing a DN with two stochastic units has been designed and fabricated. The design of component circuits will be described, so will the simulation of the full system be presented. The simulation results demonstrate that the DN in VLSI is able to regenerate various types of continuous paths in real-time. 1
same-paper 2 0.79830587 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
3 0.69670677 127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics
Author: Kanaka Rajan, L Abbott, Haim Sompolinsky
Abstract: How are the spatial patterns of spontaneous and evoked population responses related? We study the impact of connectivity on the spatial pattern of fluctuations in the input-generated response, by comparing the distribution of evoked and intrinsically generated activity across the different units of a neural network. We develop a complementary approach to principal component analysis in which separate high-variance directions are derived for each input condition. We analyze subspace angles to compute the difference between the shapes of trajectories corresponding to different network states, and the orientation of the low-dimensional subspaces that driven trajectories occupy within the full space of neuronal activity. In addition to revealing how the spatiotemporal structure of spontaneous activity affects input-evoked responses, these methods can be used to infer input selectivity induced by network dynamics from experimentally accessible measures of spontaneous activity (e.g. from voltage- or calcium-sensitive optical imaging experiments). We conclude that the absence of a detailed spatial map of afferent inputs and cortical connectivity does not limit our ability to design spatially extended stimuli that evoke strong responses. 1 1 Motivation Stimulus selectivity in neural networks was historically measured directly from input-driven responses [1], and only later were similar selectivity patterns observed in spontaneous activity across the cortical surface [2, 3]. We argue that it is possible to work in the reverse order, and show that analyzing the distribution of spontaneous activity across the different units in the network can inform us about the selectivity of evoked responses to stimulus features, even when no apparent sensory map exists. Sensory-evoked responses are typically divided into a signal component generated by the stimulus and a noise component corresponding to ongoing activity that is not directly related to the stimulus. Subsequent effort focuses on understanding how the signal depends on properties of the stimulus, while the remaining, irregular part of the response is treated as additive noise. The distinction between external stochastic processes and the noise generated deterministically as a function of intrinsic recurrence has been previously studied in chaotic neural networks [4]. It has also been suggested that internally generated noise is not additive and can be more sensitive to the frequency and amplitude of the input, compared to the signal component of the response [5 - 8]. In this paper, we demonstrate that the interaction between deterministic intrinsic noise and the spatial properties of the external stimulus is also complex and nonlinear. We study the impact of network connectivity on the spatial pattern of input-driven responses by comparing the structure of evoked and spontaneous activity, and show how the unique signature of these dynamics determines the selectivity of networks to spatial features of the stimuli driving them. 2 Model description In this section, we describe the network model and the methods we use to analyze its dynamics. Subsequent sections explore how the spatial patterns of spontaneous and evoked responses are related in terms of the distribution of the activity across the network. Finally, we show how the stimulus selectivity of the network can be inferred from its spontaneous activity patterns. 2.1 Network elements We build a firing rate model of N interconnected units characterized by a statistical description of the underlying circuitry (as N → ∞, the system “self averages” making the description independent of a specific network architecture, see also [11, 12]). Each unit is characterized by an activation variable xi ∀ i = 1, 2, . . . N , and a nonlinear response function ri which relates to xi through ri = R0 + φ(xi ) where, R0 tanh x for x ≤ 0 R0 φ(x) = (1) x (Rmax − R0 ) tanh otherwise. Rmax −R0 Eq. 1 allows us to independently set the maximum firing rate Rmax and the background rate R0 to biologically reasonable values, while retaining a maximum gradient at x = 0 to guarantee the smoothness of the transition to chaos [4]. We introduce a recurrent weight matrix with element Jij equivalent to the strength of the synapse from unit j → unit i. The individual weights are chosen independently and randomly from a Gaus2 sian distribution with mean and variance given by [Jij ]J = 0 and Jij J = g 2 /N , where square brackets are ensemble averages [9 - 11,13]. The control parameter g which scales as the variance of the synaptic weights, is particularly important in determining whether or not the network produces spontaneous activity with non-trivial dynamics (Specifically, g = 0 corresponds to a completely uncoupled network and a network with g = 1 generates non-trivial spontaneous activity [4, 9, 10]). The activation variable for each unit xi is therefore determined by the relation, N τr dxi = −xi + g Jij rj + Ii , dt j=1 with the time scale of the network set by the single-neuron time constant τr of 10 ms. 2 (2) The amplitude I of an oscillatory external input of frequency f , is always the same for each unit, but in some examples shown in this paper, we introduce a neuron-specific phase factor θi , chosen randomly from a uniform distribution between 0 and 2π, such that Ii = I cos(2πf t + θi ) ∀ i = 1, 2, . . . N. (3) In visually responsive neurons, this mimics a population of simple cells driven by a drifting grating of temporal frequency f , with the different phases arising from offsets in spatial receptive field locations. The randomly assigned phases in our model ensure that the spatial pattern of input is not correlated with the pattern of recurrent connectivity. In our selectivity analysis however (Fig. 3), we replace the random phases with spatial input patterns that are aligned with network connectivity. 2.2 PCA redux Principal component analysis (PCA) has been applied profitably to neuronal recordings (see for example [14]) but these analyses often plot activity trajectories corresponding to different network states using the fixed principal component coordinates derived from combined activities under all stimulus conditions. Our analysis offers a complementary approach whereby separate principal components are derived for each stimulus condition, and the resulting principal angles reveal not only the difference between the shapes of trajectories corresponding to different network states, but also the orientation of the low-dimensional subspaces these trajectories occupy within the full N -dimensional space of neuronal activity. The instantaneous network state can be described by a point in an N -dimensional space with coordinates equal to the firing rates of the N units. Over time, the network activity traverses a trajectory in this N -dimensional space and PCA can be used to delineate the subspace in which this trajectory lies. The analysis is done by diagonalizing the equal-time cross-correlation matrix of network firing rates given by, Dij = (ri (t) − ri )(rj (t) − rj ) , (4) where
4 0.67607582 194 nips-2010-Online Learning for Latent Dirichlet Allocation
Author: Matthew Hoffman, Francis R. Bach, David M. Blei
Abstract: We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
5 0.673585 161 nips-2010-Linear readout from a neural population with partial correlation data
Author: Adrien Wohrer, Ranulfo Romo, Christian K. Machens
Abstract: How much information does a neural population convey about a stimulus? Answers to this question are known to strongly depend on the correlation of response variability in neural populations. These noise correlations, however, are essentially immeasurable as the number of parameters in a noise correlation matrix grows quadratically with population size. Here, we suggest to bypass this problem by imposing a parametric model on a noise correlation matrix. Our basic assumption is that noise correlations arise due to common inputs between neurons. On average, noise correlations will therefore reflect signal correlations, which can be measured in neural populations. We suggest an explicit parametric dependency between signal and noise correlations. We show how this dependency can be used to ”fill the gaps” in noise correlations matrices using an iterative application of the Wishart distribution over positive definitive matrices. We apply our method to data from the primary somatosensory cortex of monkeys performing a two-alternativeforced choice task. We compare the discrimination thresholds read out from the population of recorded neurons with the discrimination threshold of the monkey and show that our method predicts different results than simpler, average schemes of noise correlations. 1
6 0.67118466 268 nips-2010-The Neural Costs of Optimal Control
7 0.67079735 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
8 0.67005706 98 nips-2010-Functional form of motion priors in human motion perception
9 0.66951466 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
10 0.66292739 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
11 0.66272861 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
12 0.66092944 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
13 0.66027492 155 nips-2010-Learning the context of a category
14 0.66005385 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
15 0.65880913 81 nips-2010-Evaluating neuronal codes for inference using Fisher information
16 0.65773231 123 nips-2010-Individualized ROI Optimization via Maximization of Group-wise Consistency of Structural and Functional Profiles
17 0.65771002 17 nips-2010-A biologically plausible network for the computation of orientation dominance
18 0.6570611 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
19 0.65704131 117 nips-2010-Identifying graph-structured activation patterns in networks
20 0.65673715 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence