nips nips2010 nips2010-51 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a novel method for constructing dependent Dirichlet processes. [sent-6, score-0.095]
2 The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. [sent-7, score-0.553]
3 1 Introduction As the cornerstone of Bayesian nonparametric modeling, Dirichlet processes (DP) [22] have been applied to a wide variety of inference and estimation problems [3, 10, 20] with Dirichlet process mixtures (DPMs) [15, 17] being one of the most successful. [sent-11, score-0.31]
4 DPMs are a generalization of finite mixture models that allow an indefinite number of mixture components. [sent-12, score-0.168]
5 Consider a document model where each document is generated under a particular topic and each topic is characterized by a distribution over words. [sent-19, score-0.11]
6 Over time, topics change: some old topics fade while new ones emerge. [sent-20, score-0.184]
7 A natural approach to model such topics is to use a Markov chain of DPs as a prior, such that the DP at each time is generated by varying the previous one in three possible ways: creating a new topic, removing an existing topic, and changing the word distribution of a topic. [sent-22, score-0.156]
8 Since MacEachern introduced the notion of dependent Dirichlet processes (DDP) [12], a variety of DDP constructions have been developed, which are based on either weighted mixtures of DPs [6, 14, 18], generalized Chinese restaurant processes [4, 21, 24], or the stick breaking construction [5, 7]. [sent-23, score-0.715]
9 Here, we propose a fundamentally different approach, taking advantage of the intrinsic relationship between Dirichlet processes and Poisson processes: a Dirichlet process is a normalized Gamma process, while a Gamma process is essentially a compound Poisson process. [sent-24, score-0.423]
10 The key idea is motivated by the following: observations that preserve complete randomness when applied to Poisson processes result in a new process that remains Poisson. [sent-25, score-0.311]
11 Consequently, one can obtain a Dirichlet process which is dependent on other DPs by applying such operations to their underlying compound Poisson processes. [sent-26, score-0.302]
12 We develop a Markov chain of DPs by combining these operations, leading to a framework that allows creation, removal, and location variation of particles. [sent-28, score-0.136]
13 Our approach relates to previous efforts in constructing dependent DPs while overcoming inherent limitations. [sent-30, score-0.095]
14 2 Poisson, Gamma, and Dirichlet Processes Our construction of dependent Dirichlet processes rests upon the connection between Poisson, Gamma, and Dirichlet processes, as well as the concept of complete randomness. [sent-32, score-0.33]
15 Let (Ω, FΩ ) be a measurable space, and Π be a random point process on Ω. [sent-34, score-0.152]
16 A Poisson process Π on Ω with mean measure µ, denoted Π ∼ PoissonP(µ), is defined to be a point process such that NΠ (A) has a Poisson distribution with mean µ(A) and that for any disjoint measurable sets A1 , . [sent-37, score-0.246]
17 Poisson processes are the only point process that satisfies this property [9]: Theorem 1. [sent-45, score-0.279]
18 A random point process Π on a regular measure space is a Poisson process if and only if NΠ is completely random. [sent-46, score-0.188]
19 Note that Σ∗ is also a completely random measure (but not a point process in general), and is essentially a generalization of the compound Poisson process. [sent-50, score-0.144]
20 (1) is called a Gamma process with base measure µ, denoted by G ∼ ΓP(µ). [sent-52, score-0.128]
21 3 Construction of Dependent Dirichlet Processes Motivated by the relationship between Poisson and Dirichlet processes, we develop a new approach for constructing dependent Dirichlet processes (DDPs). [sent-55, score-0.28]
22 Our approach can be described as follows: given a collection of Dirichlet processes, one can apply operations that preserve the complete randomness of their underlying Poisson processes. [sent-56, score-0.095]
23 This yields a new Poisson process (due to theorem 1) and a related DP which depends on the source. [sent-57, score-0.13]
24 Superposition of Poisson processes: Combining a set of independent Poisson processes yields a Poisson process whose mean measure is the sum of mean measures of the individual ones. [sent-59, score-0.279]
25 , Πm be independent Poisson processes on Ω with Πk ∼ PoissonP(µk ), then their union has Π1 ∪ · · · ∪ Πm ∼ PoissonP(µ1 + · · · + µm ). [sent-64, score-0.185]
26 (4) Given a collection of independent Gamma processes G1 , . [sent-65, score-0.185]
27 (5) Due to the relationship between Gamma processes and their underlying Poisson processes, such a combination is equivalent to the direct superposition of the Gamma processes themselves, as G := G1 + · · · + Gm ∼ ΓP(µ1 + · · · + µm ). [sent-73, score-0.482]
28 (6) Let Dk = Gk /Gk (Ω), and gk = Gk (Ω), then Dk is independent of gk , and thus D := G /G (Ω) = (g1 D1 + · · · + gm Dm )/(g1 + · · · + gm ) = c1 D1 + · · · + cm Dm . [sent-74, score-0.282]
29 (7) m l=1 gl , Here, ck = gk / which has (c1 , . [sent-75, score-0.139]
30 , Dm be independent Dirichlet processes on Ω with Dk ∼ DP(µk ), and (c1 , . [sent-87, score-0.185]
31 (8) Here, we use the symbol ⊕ to indicate superposition via a random convex combination. [sent-97, score-0.112]
32 α (9) Subsampling Poisson processes: Random subsampling of a Poisson process via independent Bernoulli trials yields a new Poisson process. [sent-99, score-0.277]
33 Let Π ∼ PoissonP(µ) be a Poisson process on the space Ω, and q : Ω → [0, 1] be a measurable function. [sent-101, score-0.152]
34 If we independently draw zθ ∈ {0, 1} for each θ ∈ Π0 with P(zθ = 1) = q(θ), and let Πk = {θ ∈ Π : zθ = k} for k = 0, 1, then Π0 and Π1 are independent Poisson processes on Ω, with Π0 ∼ PoissonP((1 − q)µ) and Π1 ∼ PoissonP(qµ)1 . [sent-102, score-0.247]
35 We emphasize that subsampling is via independent Bernoulli trials rather than choosing a fixed number of particles. [sent-103, score-0.183]
36 Note that subsampling the underlying Poisson process of a Gamma ∞ process G is equivalent to subsampling the terms of G. [sent-105, score-0.554]
37 Let G = i=1 wi δθi , and for each i, we draw zi with P(zi = 1) = q(θi ). [sent-106, score-0.114]
38 (10) i:zi =1 Let D be a Dirichlet process given by D = G/G(Ω), then we can construct a new Dirichlet process D = G /G (Ω) by subsampling the terms of D and renormalizing their coefficients. [sent-108, score-0.371]
39 Let D ∼ DP(µ) be represented by D = i=1 ri δθi and q : Ω → [0, 1] be a measurable function. [sent-111, score-0.109]
40 For each i we independently draw zi with P(zi = 1) = q(θi ), then D = Sq (D) := ri δθi ∼ DP(qµ), (11) i:zi =1 where ri := ri / j:zj =1 rj are the re-normalized coefficients for those i with zi = 1. [sent-112, score-0.319]
41 As a consequence, we can derive a Gamma process and thus a Dirichlet process by applying the probabilistic transition to the location of each term, leading to the following: ∞ Theorem 7. [sent-121, score-0.282]
42 Let D = i=1 ri δθi ∼ DP(µ) be a Dirichlet process on Ω, then ∞ ri δT (θi ) ∼ DP(T µ). [sent-122, score-0.196]
43 (16) The model can be explained as follows: given Dt−1 , we choose a subset of terms by subsampling, then move their locations via a probabilistic transition T , and finally superimpose a new DP Ht on the resultant process to form Dt . [sent-128, score-0.159]
44 Hence, creating new particles, removing existing particles, and varying particle locations are all allowed, respectively, via superposition, subsampling, and point transition. [sent-129, score-0.139]
45 Note that while they are based on the operations of the underlying Poisson processes, due to theorems 3, 5, and 7, we operate directly on the DPs, without the need of explicitly instantiating the associated Poisson processes or Gamma processes. [sent-130, score-0.248]
46 We aim to use the Markov chain of DPs as a prior of evolving mixture models. [sent-140, score-0.183]
47 M¨ ller et al [14] formulated each DP as a weighted mixture of a common u DP and an independent DP. [sent-144, score-0.126]
48 Caron et al [4] developed a generalized Polya Urn scheme while Ahmed and Xing [1] developed the recurrent Chinese Restaurant process (CRP). [sent-150, score-0.136]
49 In particular, [4] supports innovation and deletion of particles, but does not support variation of locations. [sent-153, score-0.099]
50 It can be considered a special case of the proposed framework in which subsampling operation is not incorporated. [sent-156, score-0.183]
51 Grifin and Steel [7] present the πDDP based on the stick breaking construction [19], reordering the stick breaking ratios for each time so as to obtain different distributions over the particles. [sent-158, score-0.317]
52 This work is further extended [8] to a generic stick breaking processes. [sent-159, score-0.117]
53 Rather than reordering the stick breaking ratios, they regroup them locally such that dependent DPs can be constructed over a general covariate space. [sent-161, score-0.245]
54 They construct a universal Gamma process in an auxiliary space and obtain dependent DPs by normalizing it within overlapped local regions. [sent-165, score-0.218]
55 In [16], the dependency is established through region overlapping; while in our work, this is accomplished by explicitly transferring particles from one DP to another. [sent-167, score-0.13]
56 In addition, this work does not support location variation, as it relies on a universal particle pool that is fixed over time. [sent-168, score-0.139]
57 The key idea is to derive sampling steps by exploiting the fact that our construction maintains the property of being marginally DP via connections to the underlying Poisson processes. [sent-170, score-0.137]
58 Given Φ ∼ D, we have m D |Φ ∼ DP αν pν + α0 pqµ + qk ck T (φk , ·) . [sent-176, score-0.205]
59 Marginalizing over D , we get αν α0 θ1 |Φ ∼ pν + pqµ + α1 α1 m k=1 qk ck T (φk , ·) with α1 = αν + α0 + α1 m qk ck . [sent-179, score-0.41]
60 (20) k=1 Thus we sample θ1 from three types of sources: the innovation distribution pν , the q-subsampled base distribution pqµ , and the transition distribution T (φk , ·). [sent-180, score-0.155]
61 When u1 = −1 or 0, θ1 is a new particle, and we have m D |θ1 , {u1 ≤ 0} ∼ DP αν pν + α0 pqµ + qk ck T (φk , ·) + δθ1 . [sent-187, score-0.205]
62 (21) k=1 If u1 = l > 0, we know that the particle φl is retained in the subsampling process (i. [sent-188, score-0.416]
63 Hence, D |θ1 , {u1 = l > 0} ∼ DP αν pν + α0 pqµ + qk ck T (θk , ·) + (cl + 1)δθ1 . [sent-191, score-0.205]
64 We use the Markov chain of DPs as the prior of evolving mixture models. [sent-195, score-0.183]
65 (1) Let m denote the number of particles, which is initialized to be m and will ˜ increase as we draw new particles from pν or pqµ . [sent-215, score-0.192]
66 Particularly, we set wk = qk ck for k > 0, w−1 = αν , and w0 = α0 . [sent-217, score-0.245]
67 (3) Let ψk denote the particles, whose value is decided when a new particle or the transited version of a previous one is sampled. [sent-218, score-0.18]
68 (4) The label li indicates to which particle θi corresponds and the counter rk records the number of times that ψk has been sampled (set to 0 initially). [sent-219, score-0.259]
69 , n, we first draw the indicator ui with probability P(ui = k) ∝ wk F (k, i). [sent-226, score-0.153]
70 (1) If ui = −1 or 0, we draw θi from pν |xi or pqµ |xi , respectively, and then add it as a new particle. [sent-229, score-0.113]
71 If rk = 0 then it is ˜ the first time we have drawn ui = k. [sent-233, score-0.097]
72 If rk > 0, the k-th particle has been sampled before. [sent-235, score-0.185]
73 In both cases, we set the label li = k, increase the weight wi and the counter ri by 1, and update F (k, i) to f (xi |ψk ) for each i. [sent-237, score-0.156]
74 Note that this procedure is inefficient in that it samples each particle φk merely based on the first observation with label k. [sent-238, score-0.175]
75 Therefore, we use this procedure for bootstrapping, and then run a Gibbs sampling scheme that iterates between parameter update and label update. [sent-239, score-0.111]
76 (Parameter update): We resample each particle ψk from its source distribution conditioned on all samples with label k. [sent-240, score-0.175]
77 In particular, for k ∈ [1, m] with rk > 0, we draw ψk ∼ T (φk , ·)|{xi : li = k}, and for k ∈ [m + 1, m], we draw ψk ∼ p|{xi : li = k}, where p = pqµ or pν , depending which ˜ source ψk was initially sampled from. [sent-241, score-0.17]
78 The only difference is that when we update a label from k to k , we need to decrease the weight and counter for k. [sent-244, score-0.105]
79 If rk decreases to zero, we remove ψk , and reset wk to qk ck when k ≤ m. [sent-245, score-0.291]
80 At the end of each phase t, we sample ψk ∼ T (φk , ·) for each k with rk = 0. [sent-246, score-0.106]
81 In addition, for each such particle, we update the acceptance probability as qk ← qk ·q(φk ), which is the prior probability that the particle φk will survive in next phase. [sent-247, score-0.518]
82 In the synthetic case, we compare our method with dynamic FMM in modeling mixtures of Gaussians whose number and centers evolve over time. [sent-252, score-0.1]
83 For real data, we test the approach in modeling the motion of people in crowded scenes and the trends of research topics reflected in index terms. [sent-253, score-0.282]
84 We initialized the model with two Gaussian components, and added new components following a temporal Poisson process (one per 20 phases 6 0. [sent-256, score-0.196]
85 The particles of the last iteration at each phase were incorporated into the model as a prior for sampling in the next phase. [sent-288, score-0.234]
86 D-FMM with K = 2 or 3 at phases 30−50 and 66−76); when K is greater than the actual number, samples from the same component are divided into multiple groups and assigned to different components (e. [sent-295, score-0.131]
87 1 creates new components rather than inheriting from previous phases, leading to poor performance when the number of samples is limited. [sent-301, score-0.101]
88 9, the components in previous phases have a higher survival rate, resulting in more reliable estimation of the component parameters from multiple phases. [sent-303, score-0.102]
89 It was observed [11] that the majority of people walking in crowded areas such as a rail station tend to follow motion flows. [sent-311, score-0.154]
90 ) (b) left: the timelines of the top 10 topics; right: the two leading keywords for these topics. [sent-321, score-0.111]
91 ) sequence into 60 phases (each for one minute), extract location-velocity pairs from all tracks, and randomly choose 3000 pairs for each phase for model inference. [sent-323, score-0.131]
92 Consequently, with a much smaller number of components (12 active components on average), our method attains a similar modeling accuracy as a D-FMM with 50 components. [sent-334, score-0.095]
93 Figure 2(b) shows the timelines of top 10 topics, and together with the top two index terms for each of them. [sent-344, score-0.118]
94 7 Conclusion and Future Directions We developed a principled framework for constructing dependent Dirichlet processes. [sent-346, score-0.095]
95 In contrast to most DP-based approaches, our construction is motivated by the intrinsic relation between Dirichlet processes and compound Poisson processes. [sent-347, score-0.285]
96 We further combined these operations to derive a Markov chain of DPs, leading to a prior of mixture models that allows creation, removal, and location variation of component models under a unified formulation. [sent-349, score-0.283]
97 The simulations on synthetic data and the experiments on modeling people flows and paper topics clearly demonstrate that the proposed method is effective in estimating mixture models that evolve over time. [sent-351, score-0.292]
98 The fact that each completely random point process is a Poisson process suggests that any operation that preserves the complete randomness can be applied to obtain dependent Poisson processes, and thus dependent DPs. [sent-353, score-0.41]
99 For example, random merging and random splitting of particles also possess this property, which would lead to an extended framework that allows merging and splitting of component models. [sent-355, score-0.13]
100 We believe that as a starting point, this paper would stimulate further efforts to exploit the relation between Poisson processes and Dirichlet processes. [sent-358, score-0.185]
wordName wordTfidf (topN-words)
[('poisson', 0.371), ('dirichlet', 0.339), ('dps', 0.329), ('dp', 0.276), ('poissonp', 0.247), ('processes', 0.185), ('subsampling', 0.183), ('pq', 0.166), ('particle', 0.139), ('qk', 0.135), ('particles', 0.13), ('gamma', 0.127), ('superposition', 0.112), ('dependent', 0.095), ('process', 0.094), ('topics', 0.092), ('ddp', 0.09), ('ows', 0.085), ('mixture', 0.084), ('fmm', 0.082), ('timelines', 0.082), ('acceptance', 0.078), ('phases', 0.071), ('ck', 0.07), ('gk', 0.069), ('sq', 0.066), ('dm', 0.066), ('transition', 0.065), ('chain', 0.064), ('var', 0.063), ('operations', 0.063), ('draw', 0.062), ('stick', 0.061), ('phase', 0.06), ('measurable', 0.058), ('dk', 0.057), ('breaking', 0.056), ('innovation', 0.056), ('topic', 0.055), ('dt', 0.054), ('restaurant', 0.052), ('zi', 0.052), ('ri', 0.051), ('ui', 0.051), ('gm', 0.05), ('compound', 0.05), ('construction', 0.05), ('removal', 0.047), ('people', 0.047), ('rk', 0.046), ('median', 0.044), ('sampling', 0.044), ('qd', 0.044), ('cm', 0.044), ('chinese', 0.044), ('variation', 0.043), ('marginally', 0.043), ('al', 0.042), ('csail', 0.042), ('caron', 0.041), ('ddps', 0.041), ('dpms', 0.041), ('grimson', 0.041), ('inheriting', 0.041), ('transited', 0.041), ('motion', 0.041), ('creation', 0.041), ('wk', 0.04), ('markov', 0.039), ('counter', 0.038), ('cov', 0.037), ('maceachern', 0.036), ('polya', 0.036), ('label', 0.036), ('index', 0.036), ('theorem', 0.036), ('realization', 0.036), ('evolve', 0.036), ('evolving', 0.035), ('base', 0.034), ('modeling', 0.033), ('reordering', 0.033), ('crowded', 0.033), ('crp', 0.033), ('station', 0.033), ('urn', 0.033), ('teh', 0.033), ('randomness', 0.032), ('ht', 0.032), ('dunson', 0.031), ('hdp', 0.031), ('ren', 0.031), ('update', 0.031), ('components', 0.031), ('mixtures', 0.031), ('actual', 0.029), ('gibbs', 0.029), ('overlapped', 0.029), ('trait', 0.029), ('leading', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
2 0.23474653 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
3 0.21139517 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
Author: Bela Frigyik, Maya Gupta, Yihua Chen
Abstract: Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data. 1 Modeling Probabilities for Machine Learning Modeling probability mass functions (pmfs) as random is useful in solving many real-world problems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate to the multinomial and hence mathematically convenient for Bayesian inference, and the number of parameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distribution over the entire probability simplex, and for many problems this is simply the wrong domain if there is application-specific prior knowledge that the pmfs come from a restricted subset of the simplex. For example, in natural language modeling, it is common to regularize a pmf over n-grams by some generic language model distribution q0 , that is, the pmf to be modeled is assumed to have the form θ = λq + (1 − λ)q0 for some q in the simplex, λ ∈ (0, 1) and a fixed generic model q0 [2]. But once q0 and λ are fixed, the pmf θ can only come from a subset of the simplex. Another natural language processing example is modeling the probability of keywords in a dictionary where some words are related, such as espresso and latte, and evidence for the one is to some extent evidence for the other. This relationship can be captured with a bounded variation model that would constrain the modeled probability of espresso to be within some of the modeled probability of latte. We show that such bounds on the variation between pmf components also restrict the domain of the pmf to a subset of the simplex. As a third example of restricting the domain, the similarity discriminant analysis classifier estimates class-conditional pmfs that are constrained to be monotonically increasing over an ordered sample space of discrete similarity values [3]. In this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex, explore its properties, and show how to learn the model from data. We first discuss the alternative solution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work. Then we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet for three types of restricted domains: the regularized pmf case, bounded variation between pmf components, and monotonic pmfs; and discuss the most general case. We show how to use the expectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter α, and present simulation results for the estimation. 1 Dirichlet Shadow Dirichlet Renormalized Dirichlet Figure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for α = [3.94 2.25 2.81]. 2 Related Work One solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of ˜ ˜ the Dirichlet to the desired support S, and renormalize the Dirichlet over S (see Fig. 1 for an example). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the multinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found it difficult to use because the density requires numerical integration to compute the normalizer [4] . In addition, there is no closed form solution for the mean, covariance, or peak of the renormalized Dirichlet, making it difficult to work with. Table 1 summarizes these properties. Additionally, generating samples from the renormalized Dirichlet is inefficient: one draws samples from the stan˜ dard Dirichlet, then rejects realizations that are outside S. For high-dimensional sample spaces, this could greatly increase the time to generate samples. Although the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it “is totally inadequate for the description of the variability of compositional data,” because of its “implied independence structure and so the Dirichlet class is unlikely to be of any great use for describing compositions whose components have even weak forms of dependence” [5]. Aitchison instead championed a logistic normal distribution with more parameters to control covariance between components. A number of variants of the Dirichlet that can capture more dependence have been proposed and analyzed. For example, the scaled Dirichlet enables a more flexible shape for the distribution [5], but does not change the support. The original Dirichlet(α1 , α2 , . . . αd ) can be derived as Yj / j Yj where Yj ∼ Γ(αj , β), whereas the scaled Dirichlet is derived from Yj ∼ Γ(αj , βj ), resulting in α density p(θ) = γ j ( α −1 βj j θ j j βi θi )α1 +···+αd i , where β, α ∈ Rd are parameters, and γ is the normalizer. + Another variant is the generalized Dirichlet [6] which also has parameters β, α ∈ Rd , and allows + greater control of the covariance structure, again without changing the support. As perhaps first noted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data can be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman variant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet models. Ongaro et al. [8] propose the flexible Dirichlet distribution by forming a re-parameterized mixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure for the general Dirichlet family called the generalized Liouville distributions. In contrast to prior efforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that arise frequently in machine learning problems. 3 Shadow Dirichlet Distribution We introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob˜ ability (d − 1)-simplex, and let Θ ∈ S be a random pmf drawn from a Dirichlet distribution with density pD and unnormalized parameter α ∈ Rd . Then we say the random pmf Θ ∈ S is distributed + ˜ according to a shadow Dirichlet distribution if Θ = M Θ for some fixed d × d left-stochastic (that ˜ is, each column of M sums to 1) full-rank (and hence invertible) matrix M , and we call Θ the gen2 erating Dirichlet of Θ, or Θ’s Dirichlet shadow. Because M is a left-stochastic linear map between finite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and compact subset of S that we denote SM . The shadow Dirichlet has two parameters: the generating Dirichlet’s parameter α ∈ Rd , and the + d × d matrix M . Both α and M can be estimated from data. However, as we show in the following subsections, the matrix M can be profitably used as a design parameter that is chosen based on application-specific knowledge or side-information to specify the restricted domain SM , and in that way impose dependency between the components of the random pmfs. The shadow Dirichlet density p(θ) is the normalized pushforward of the Dirichlet density, that is, it is the composition of the Dirichlet density and M −1 with the Jacobian: 1 α −1 p(θ) = (M −1 θ)j j , (1) B(α) |det(M )| j Γ(αj ) d j is the standard Dirichlet normalizer, and α0 = j=1 αj is the standard where B(α) Γ(α0 ) Dirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1 shows an example shadow Dirichlet distribution. Generating samples from the shadow Dirichlet is trivial: generate samples from its generating Dirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to create the corresponding shadow Dirichlet sample. Table 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirichlet distributions. Dirichlet(α) Density p(θ) Mean 1 B(α) d j=1 α −1 θj j Shadow Dirichlet (α, M ) 1 B(α)|det(M )| α α0 Renormalized ˜ Dirichlet (α, S) d −1 αj −1 θ)j j=1 (M 1 ˜ S M d j=1 α α0 αj −1 qj dq d j=1 α −1 θj j ˜ θp(θ)dθ S ¯ ¯ − θ)(θ − θ)T p(θ)dθ Cov(Θ) M Cov(Θ)M T αj −1 α0 −d j M α0 −d max p(θ) stick-breaking, urn-drawing draw from Dirichlet(α), multiply by M draw from Dirichlet(α), ˜ reject if not in S ML Estimate iterative (simple functions) iterative (simple functions) unknown complexity ML Compound Estimate iterative (simple functions) iterative (numerical integration) unknown complexity Covariance Mode (if α > 1) How to Sample 3.1 α −1 ˜ (θ S ˜ θ∈S Example: Regularized Pmfs The shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM = ˜ ˘ ˜ ˘ ˘ {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for specific values of λ and θ. In general, for a given λ and θ ∈ S, the following d × d matrix M will change the support to the desired subset SM by mapping the extreme points of S to the extreme points of SM : ˘ M = (1 − λ)θ1T + λI, (2) where I is the d × d identity matrix. In Section 4 we show that the M given in (2) is optimal in a maximum entropy sense. 3 3.2 Example: Bounded Variation Pmfs We describe how to use the shadow Dirichlet to model a random pmf that has bounded variation such that |θk − θl | ≤ k,l for any k, ∈ {1, 2, . . . , d} and k,l > 0. To construct specified bounds on the variation, we first analyze the variation for a given M . For any d × d left stochastic matrix T d d ˜ ˜ ˜ M, θ = Mθ = M1j θj . . . Mdj θj , so the difference between any two entries is j=1 j=1 ˜ |Mkj − Mlj | θj . ˜ (Mkj − Mlj )θj ≤ |θk − θl | = (3) j j Thus, to obtain a distribution over pmfs with bounded |θk − θ | ≤ k,l for any k, components, it is sufficient to choose components of the matrix M such that |Mkj − Mlj | ≤ k,l for all j = 1, . . . , d ˜ because θ in (3) sums to 1. One way to create such an M is using the regularization strategy described in Section 3.1. For this ˜ ˜ ˘ case, the jth component of θ is θj = M θ = λθj + (1 − λ)θj , and thus the variation between the j ith and jth component of any pmf in SM is: ˜ ˘ ˜ ˘ ˜ ˜ ˘ ˘ |θi − θj | = λθi + (1 − λ)θi − λθj − (1 − λ)θj ≤ λ θi − θj + (1 − λ) θi − θj ˘ ˘ ≤ λ + (1 − λ) max θi − θj . (4) i,j ˘ Thus by choosing an appropriate λ and regularizing pmf θ, one can impose the bounded variation ˘ to be the uniform pmf, and choose any λ ∈ (0, 1), then the matrix given by (4). For example, set θ M given by (2) will guarantee that the difference between any two entries of any pmf drawn from the shadow Dirichlet (M, α) will be less than or equal to λ. 3.3 Example: Monotonic Pmfs For pmfs over ordered components, it may be desirable to restrict the support of the random pmf distribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs). A d × d left-stochastic matrix M that will result in a shadow Dirichlet that generates only monotonically increasing d × 1 pmfs has kth column [0 . . . 0 1/(d − k + 1) . . . 1/(d − k + 1)]T , we call this the monotonic M . It is easy to see that with this M only monotonic θ’s can be produced, 1˜ 1˜ 1 ˜ because θ1 = d θ1 which is less than or equal to θ2 = d θ1 + d−1 θ2 and so on. In Section 4 we show that the monotonic M is optimal in a maximum entropy sense. Note that to provide support over both monotonically increasing and decreasing pmfs with one distribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two shadow Dirichlets. 3.4 What Restricted Subsets are Possible? Above we have described solutions to construct M for three kinds of dependence that arise in machine learning applications. Here we consider the more general question: What subsets of the simplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a particular support? For any matrix M , by the Krein-Milman theorem [10], SM = M S is the convex hull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d × d matrix M will have d extreme points that occur for the d choices of θ that have only one nonzero component, as the rest of the θ will create a non-trivial convex combination of the columns of M , and therefore cannot result in extreme points of SM by definition. That is, the extreme points of SM are the d columns of M , and one can design any SM with d extreme points by setting the columns of M to be those extreme pmfs. However, if one wants the new support to be a polytope in the probability (d − 1)-simplex with m > d extreme points, then one must use a fat M with d × m entries. Let S m denote the probability 4 (m − 1)-simplex, then the domain of the shadow Dirichlet will be M S m , which is the convex hull of the m columns of M and forms a convex polytope in S with at most m vertices. In this case M cannot be injective, and hence it is not bijective between S m and M S m . However, a density on M S m can be defined as: p(θ) = 1 B(α) ˜ {θ ˜α −1 ˜ θj j dθ. ˜ M θ=θ} j (5) On the other hand, if one wants the support to be a low-dimensional polytope subset of a higherdimensional probability simplex, then a thin d × m matrix M , where m < d, can be used to implement this. If M is injective, then it has a left inverse M ∗ that is a matrix of dimension m × d, and the normalized pushforward of the original density can be used as a density on the image M S m : p(θ) = 1 α −1 1/2 B(α) |det(M T M )| (M ∗ θ)j j , j If M is not injective then one way to determine a density is to use (5). 4 Information-theoretic Properties In this section we note two information-theoretic properties of the shadow Dirichlet. Let Θ be drawn ˜ from shadow Dirichlet density pM , and let its generating Dirichlet Θ be drawn from pD . Then the differential entropy of the shadow Dirichlet is h(pM ) = log |det(M )| + h(pD ), where h(pD ) is the differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less entropy than its Dirichlet shadow because log |det(M )| ≤ 0, which can be shown as a corollary to the following lemma (proof not included due to lack of space): Lemma 4.1. Let {x1 , . . . , xn } and {y1 , . . . , yn } be column vectors in Rn . If each yj is a convex n n combination of the xi ’s, i.e. yj = i=1 γji xi , i=1 γji = 1, γjk ≥ 0, ∀j, k ∈ {1, . . . , n} then |det[y1 , . . . , yn ]| ≤ |det[x1 , . . . , xn ]|. It follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M are optimal in the sense of maximizing entropy: Corollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet ˜ ˘ ˜ ˘ distributions with support in {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for a specific choice of λ and θ. Then the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves arg maxM ∈Mreg h(pM ). Corollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow Dirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Section 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves arg maxM ∈Mmono h(pM ). 5 Estimating the Distribution from Data In this section, we discuss the estimation of α for the shadow Dirichlet and compound shadow Dirichlet, and the estimation of M . 5.1 Estimating α for the Shadow Dirichlet Let matrix M be specified (for example, as described in the subsections of Section 3), and let q be a d × N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N , and let (qi )j be the jth component of the ith sample pmf for j = 1, . . . , d. Then finding the maximum likelihood estimate 5 of α for the shadow Dirichlet is straightforward: N 1 arg max log + log p(qi |α) ≡ arg max log B(α) |det(M )| α∈Rk α∈Rk + + i=1 1 αj −1 (˜i )j q , ≡ arg max log B(α)N i j α∈Rk + N α −1 (M −1 qi )j j i j (6) where q = M −1 q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis˜ tribution given the matrix q , and can be solved using the standard methods for that problem (see ˜ e.g. [11, 12]). 5.2 Estimating α for the Compound Shadow Dirichlet For many machine learning applications the given data are modeled as samples from realizations of a random pmf, and given these samples one must estimate the random pmf model’s parameters. We refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also called the multivariate P´ lya distribution). Assuming one has already specified M , we first discuss o method of moments estimation, and then describe an expectation-maximization (EM) method for computing the maximum likelihood estimate α. ˘ One can form an estimate of α by the method of moments. For the standard compound Dirichlet, one treats the samples of the realizations as normalized empirical histograms, sets the normalized α parameter equal to the empirical mean of the normalized histograms, and uses the empirical variances to determine the precision α0 . By definition, this estimate will be less likely than the maximum likelihood estimate, but may be a practical short-cut in some cases. For the compound shadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general. The problem is that if one draws samples from a pmf θ from a restricted subset SM of the simplex, ˘ then the normalized empirical histogram θ of those samples may not be in SM . For example given a monotonic pmf, the histogram of five samples drawn from it may not be monotonic. Then the empirical mean of such normalized empirical histograms may not be in SM , and so setting the shadow Dirichlet mean M α equal to the empirical mean may lead to an infeasible estimate (one that is outside SM ). A heuristic solution is to project the empirical mean into SM first, for example, by finding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet, this may still be a useful approach in practice for some problems. Next we state an EM method to find the maximum likelihood estimate α. Let s be a d × N matrix ˘ of sample histograms from different experiments, such that the ith column si is the ith histogram for i = 1, . . . , N , and (si )j is the number of times we have observed the jth event from the ith pmf vi . Then the maximum log-likelihood estimate of α solves arg max log p(s|α) for α ∈ Rk . + If the random pmfs are drawn from a Dirichlet distribution, then finding this maximum likelihood estimate requires an iterative procedure, and can be done in several ways including a gradient descent (ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution, then a direct gradient descent approach is highly inconvenient as it requires taking derivatives of numerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm [13][14], as we describe in the rest of this section. Code to perform the EM estimation of α can be downloaded from idl.ee.washington.edu/publications.php. We assume that the experiments are independent and therefore p(s|α) = p({si }|α) = and hence arg maxα∈Rk log p(s|α) = arg maxα∈Rk i log p(si |α). + + i p(si |α) To apply the EM method, we consider the complete data to be the sample histograms s and the pmfs that generated them (s, v1 , v2 , . . . , vN ), whose expected log-likelihood will be maximized. Specifically, because of the assumed independence of the {vi }, the EM method requires one to repeatedly maximize the Q-function such that the estimate of α at the (m + 1)th iteration is: N α(m+1) = arg max α∈Rk + Evi |si ,α(m) [log p(vi |α)] . i=1 6 (7) Like the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not necessarily concave. However, note that the Q-function given in (7) is concave, because log p(vi |α) = − log |det(M )| + log pD,α M −1 vi , where pD,α is the Dirichlet distribution with parameter α, and by a theorem of Ronning [11], log pD,α is a concave function, and adding a constant does not change the concavity. The Q-function is a finite integration of such concave functions and hence also concave [15]. We simplify (7) without destroying the concavity to yield the equivalent problem α(m+1) = d d arg max g(α) for α ∈ Rk , where g(α) = log Γ(α0 ) − j=1 log Γ(αj ) + j=1 βj αj , and + βj = N tij i=1 zi , 1 N where tij and zi are integrals we compute with Monte Carlo integration: d (s ) log(M −1 vi )j γi tij = SM (vi )k i k pM (vi |α(m) )dvi k=1 d zi = (vi )j k(si )k pM (vi |α(m) )dvi , γi SM k=1 where γi is the normalization constant for the multinomial with histogram si . We apply the Newton method [16] to maximize g(α), where the gradient g(α) has kth component ψ0 (α0 ) − ψ0 (α1 ) + β1 , where ψ0 denotes the digamma function. Let ψ1 denote the trigamma function, then the Hessian matrix of g(α) is: H = ψ1 (α0 )11T − diag (ψ1 (α1 ), . . . , ψ1 (αd )) . Note that because H has a very simple structure, the inversion of H required by the Newton step is greatly simplified by using the Woodbury identity [17]: H −1 = − diag(ξ1 , . . . , ξd ) − 1 1 1 [ξi ξj ]d×d , where ξ0 = ψ1 (α0 ) and ξj = ψ1 (αj ) , j = 1, . . . , d. ξ − d ξ 0 5.3 j=1 j Estimating M for the Shadow Dirichlet Thus far we have discussed how to construct M to achieve certain desired properties and how to interpret a given M ’s effect on the support. In some cases it may be useful to estimate M directly from data, for example, finding the maximum likelihood M . In general, this is a non-convex problem because the set of rank d − 1 matrices is not convex. However, we offer two approximations. First, note that as in estimating the support of a uniform distribution, the maximum likelihood M will correspond to a support that is no larger than needed to contain the convex hull of sample pmfs. Second, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth column of M (which corresponds to the kth vertex of the support) to be a convex combination of the kth vertex of the standard probability simplex and the empirical mean pmf. We provide code that finds the d optimal such convex combinations such that a specificed percentage of the sample pmfs are within the support, which reduces the non-convex problem of finding the maximum likelihood d × d matrix M to a d-dimensional convex relaxation. 6 Demonstrations It is reasonable to believe that if the shadow Dirichlet better matches the problem’s statistics, it will perform better in practice, but an open question is how much better? To motivate the reader to investigate this question further in applications, we provide two small demonstrations. 6.1 Verifying the EM Estimation We used a broad suite of simulations to test and verify the EM estimation. Here we include a simple visual confirmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet with monotonic M for d = 3 and α = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs, we drew 100 i.i.d. samples. Then we applied the EM algorithm to find the α for both the standard compound Dirichlet, and the compound shadow Dirichlet with the correct M . Fig. 2 shows the true distribution and the two estimated distributions. 7 True Distribution (Shadow Dirichlet) Estimated Shadow Dirichlet Estimated Dirichlet Figure 2: Samples were drawn from the true distribution and the given EM method was applied to form the estimated distributions. 6.2 Estimating Proportions from Sales Manufacturers often have constrained manufacturing resources, such as equipment, inventory of raw materials, and employee time, with which to produce multiple products. The manufacturer must decide how to proportionally allocate such constrained resources across their product line based on their estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales data for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict the proportion of sales expected for each puzzle. These estimates were then tested on the next five months of sales data, for January 2010 through April 2010. The company also provided a similarity between puzzles S, where S(A, B) is the proportion of times an order during the six training months included both puzzle A and B if it included puzzle A. We compared treating each of the six training months of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet. For the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so that it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought together to have closer estimated proportions. We estimated each α parameter by EM to maximize the likelihood of the past sales data, and then estimated the future sales proportions to be the mean of the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all six months of sales data as coming from one multinomial which we estimated as the maximum likelihood multinomial, and to taking the mean of the six empirical pmfs. Table 2: Squared errors between estimates and actual proportional sales. Jan. Feb. Mar. Apr. 7 Multinomial .0129 .0185 .0231 .0240 Mean Pmf .0106 .0206 .0222 .0260 Dirichlet .0109 .0172 .0227 .0235 Shadow Dirichlet .0093 .0164 .0197 .0222 Summary In this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of the dependent structure that arises often in machine learning applications. We have discussed some of its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded variation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the EM method and made available code to estimate both the shadow Dirichlet and compound shadow Dirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow Dirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice. 8 References [1] B. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Tech. Rep., University of Washington, 2010. [2] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval,” ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179–214, 2004. [3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similarity-based classification: Concepts and algorithms,” Journal of Machine Learning Research, vol. 10, pp. 747–776, March 2009. [4] R. Nallapati, T. Minka, and S. Robertson, “The smoothed-Dirichlet distribution: a building block for generative topic models,” Tech. Rep., Microsoft Research, Cambridge, 2007. [5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986. [6] R. J. Connor and J. E. Mosiman, “Concepts of independence for proportions with a generalization of the Dirichlet distibution,” Journal of the American Statistical Association, vol. 64, pp. 194–206, 1969. [7] K. Pearson, “Mathematical contributions to the theory of evolution–on a form of spurious correlation which may arise when indices are used in the measurement of organs,” Proc. Royal Society of London, vol. 60, pp. 489–498, 1897. [8] A. Ongaro, S. Migliorati, and G. S. Monti, “A new distribution on the simplex containing the Dirichlet family,” Proc. 3rd Compositional Data Analysis Workshop, 2008. [9] W. S. Rayens and C. Srinivasan, “Dependence properties of generalized Liouville distributions on the simplex,” Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465– 1470, 1994. [10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991. [11] G. Ronning, “Maximum likelihood estimation of Dirichlet distributions,” Journal of Statistical Computation and Simulation, vol. 34, no. 4, pp. 215221, 1989. [12] T. Minka, “Estimating a Dirichlet distribution,” Tech. Rep., Microsoft Research, Cambridge, 2009. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in Signal Processing, Hanover, MA, 2010. [15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com. [18] R. E. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” in Proc. Intl. Conf. Machine Learning, 2005. 9
Author: Felipe Gerhard, Wulfram Gerstner
Abstract: Generalized Linear Models (GLMs) are an increasingly popular framework for modeling neural spike trains. They have been linked to the theory of stochastic point processes and researchers have used this relation to assess goodness-of-fit using methods from point-process theory, e.g. the time-rescaling theorem. However, high neural firing rates or coarse discretization lead to a breakdown of the assumptions necessary for this connection. Here, we show how goodness-of-fit tests from point-process theory can still be applied to GLMs by constructing equivalent surrogate point processes out of time-series observations. Furthermore, two additional tests based on thinning and complementing point processes are introduced. They augment the instruments available for checking model adequacy of point processes as well as discretized models. 1
5 0.14947087 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
Author: Lauren Hannah, Warren Powell, David M. Blei
Abstract: In this paper we study convex stochastic optimization problems where a noisy objective function value is observed after a decision is made. There are many stochastic optimization problems whose behavior depends on an exogenous state variable which affects the shape of the objective function. Currently, there is no general purpose algorithm to solve this class of problems. We use nonparametric density estimation to take observations from the joint state-outcome distribution and use them to infer the optimal decision for a given query state s. We propose two solution methods that depend on the problem characteristics: function-based and gradient-based optimization. We examine two weighting schemes, kernel based weights and Dirichlet process based weights, for use with the solution methods. The weights and solution methods are tested on a synthetic multi-product newsvendor problem and the hour ahead wind commitment problem. Our results show that in some cases Dirichlet process weights offer substantial benefits over kernel based weights and more generally that nonparametric estimation methods provide good solutions to otherwise intractable problems. 1
6 0.10410295 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
7 0.10382982 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
8 0.096720472 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
9 0.085589945 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
10 0.080859125 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
11 0.080531381 286 nips-2010-Word Features for Latent Dirichlet Allocation
12 0.079608671 194 nips-2010-Online Learning for Latent Dirichlet Allocation
13 0.077786632 150 nips-2010-Learning concept graphs from text with stick-breaking priors
14 0.072307296 269 nips-2010-Throttling Poisson Processes
15 0.070631295 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
16 0.069449589 120 nips-2010-Improvements to the Sequence Memoizer
17 0.068523705 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
18 0.066446796 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
19 0.064037994 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
20 0.057248753 148 nips-2010-Learning Networks of Stochastic Differential Equations
topicId topicWeight
[(0, 0.192), (1, 0.021), (2, -0.023), (3, 0.071), (4, -0.209), (5, 0.11), (6, 0.13), (7, 0.055), (8, 0.029), (9, -0.012), (10, 0.074), (11, -0.036), (12, -0.08), (13, 0.051), (14, 0.027), (15, 0.034), (16, 0.039), (17, 0.1), (18, 0.091), (19, -0.037), (20, -0.113), (21, 0.071), (22, -0.028), (23, 0.024), (24, -0.077), (25, 0.237), (26, -0.069), (27, -0.023), (28, 0.044), (29, 0.009), (30, -0.137), (31, 0.224), (32, -0.049), (33, -0.03), (34, -0.091), (35, -0.093), (36, -0.027), (37, -0.089), (38, 0.046), (39, -0.019), (40, 0.033), (41, -0.168), (42, -0.128), (43, -0.109), (44, 0.045), (45, 0.025), (46, 0.083), (47, -0.078), (48, 0.009), (49, -0.0)]
simIndex simValue paperId paperTitle
same-paper 1 0.96661443 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
2 0.86679554 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
Author: Bela Frigyik, Maya Gupta, Yihua Chen
Abstract: Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data. 1 Modeling Probabilities for Machine Learning Modeling probability mass functions (pmfs) as random is useful in solving many real-world problems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate to the multinomial and hence mathematically convenient for Bayesian inference, and the number of parameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distribution over the entire probability simplex, and for many problems this is simply the wrong domain if there is application-specific prior knowledge that the pmfs come from a restricted subset of the simplex. For example, in natural language modeling, it is common to regularize a pmf over n-grams by some generic language model distribution q0 , that is, the pmf to be modeled is assumed to have the form θ = λq + (1 − λ)q0 for some q in the simplex, λ ∈ (0, 1) and a fixed generic model q0 [2]. But once q0 and λ are fixed, the pmf θ can only come from a subset of the simplex. Another natural language processing example is modeling the probability of keywords in a dictionary where some words are related, such as espresso and latte, and evidence for the one is to some extent evidence for the other. This relationship can be captured with a bounded variation model that would constrain the modeled probability of espresso to be within some of the modeled probability of latte. We show that such bounds on the variation between pmf components also restrict the domain of the pmf to a subset of the simplex. As a third example of restricting the domain, the similarity discriminant analysis classifier estimates class-conditional pmfs that are constrained to be monotonically increasing over an ordered sample space of discrete similarity values [3]. In this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex, explore its properties, and show how to learn the model from data. We first discuss the alternative solution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work. Then we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet for three types of restricted domains: the regularized pmf case, bounded variation between pmf components, and monotonic pmfs; and discuss the most general case. We show how to use the expectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter α, and present simulation results for the estimation. 1 Dirichlet Shadow Dirichlet Renormalized Dirichlet Figure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for α = [3.94 2.25 2.81]. 2 Related Work One solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of ˜ ˜ the Dirichlet to the desired support S, and renormalize the Dirichlet over S (see Fig. 1 for an example). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the multinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found it difficult to use because the density requires numerical integration to compute the normalizer [4] . In addition, there is no closed form solution for the mean, covariance, or peak of the renormalized Dirichlet, making it difficult to work with. Table 1 summarizes these properties. Additionally, generating samples from the renormalized Dirichlet is inefficient: one draws samples from the stan˜ dard Dirichlet, then rejects realizations that are outside S. For high-dimensional sample spaces, this could greatly increase the time to generate samples. Although the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it “is totally inadequate for the description of the variability of compositional data,” because of its “implied independence structure and so the Dirichlet class is unlikely to be of any great use for describing compositions whose components have even weak forms of dependence” [5]. Aitchison instead championed a logistic normal distribution with more parameters to control covariance between components. A number of variants of the Dirichlet that can capture more dependence have been proposed and analyzed. For example, the scaled Dirichlet enables a more flexible shape for the distribution [5], but does not change the support. The original Dirichlet(α1 , α2 , . . . αd ) can be derived as Yj / j Yj where Yj ∼ Γ(αj , β), whereas the scaled Dirichlet is derived from Yj ∼ Γ(αj , βj ), resulting in α density p(θ) = γ j ( α −1 βj j θ j j βi θi )α1 +···+αd i , where β, α ∈ Rd are parameters, and γ is the normalizer. + Another variant is the generalized Dirichlet [6] which also has parameters β, α ∈ Rd , and allows + greater control of the covariance structure, again without changing the support. As perhaps first noted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data can be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman variant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet models. Ongaro et al. [8] propose the flexible Dirichlet distribution by forming a re-parameterized mixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure for the general Dirichlet family called the generalized Liouville distributions. In contrast to prior efforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that arise frequently in machine learning problems. 3 Shadow Dirichlet Distribution We introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob˜ ability (d − 1)-simplex, and let Θ ∈ S be a random pmf drawn from a Dirichlet distribution with density pD and unnormalized parameter α ∈ Rd . Then we say the random pmf Θ ∈ S is distributed + ˜ according to a shadow Dirichlet distribution if Θ = M Θ for some fixed d × d left-stochastic (that ˜ is, each column of M sums to 1) full-rank (and hence invertible) matrix M , and we call Θ the gen2 erating Dirichlet of Θ, or Θ’s Dirichlet shadow. Because M is a left-stochastic linear map between finite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and compact subset of S that we denote SM . The shadow Dirichlet has two parameters: the generating Dirichlet’s parameter α ∈ Rd , and the + d × d matrix M . Both α and M can be estimated from data. However, as we show in the following subsections, the matrix M can be profitably used as a design parameter that is chosen based on application-specific knowledge or side-information to specify the restricted domain SM , and in that way impose dependency between the components of the random pmfs. The shadow Dirichlet density p(θ) is the normalized pushforward of the Dirichlet density, that is, it is the composition of the Dirichlet density and M −1 with the Jacobian: 1 α −1 p(θ) = (M −1 θ)j j , (1) B(α) |det(M )| j Γ(αj ) d j is the standard Dirichlet normalizer, and α0 = j=1 αj is the standard where B(α) Γ(α0 ) Dirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1 shows an example shadow Dirichlet distribution. Generating samples from the shadow Dirichlet is trivial: generate samples from its generating Dirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to create the corresponding shadow Dirichlet sample. Table 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirichlet distributions. Dirichlet(α) Density p(θ) Mean 1 B(α) d j=1 α −1 θj j Shadow Dirichlet (α, M ) 1 B(α)|det(M )| α α0 Renormalized ˜ Dirichlet (α, S) d −1 αj −1 θ)j j=1 (M 1 ˜ S M d j=1 α α0 αj −1 qj dq d j=1 α −1 θj j ˜ θp(θ)dθ S ¯ ¯ − θ)(θ − θ)T p(θ)dθ Cov(Θ) M Cov(Θ)M T αj −1 α0 −d j M α0 −d max p(θ) stick-breaking, urn-drawing draw from Dirichlet(α), multiply by M draw from Dirichlet(α), ˜ reject if not in S ML Estimate iterative (simple functions) iterative (simple functions) unknown complexity ML Compound Estimate iterative (simple functions) iterative (numerical integration) unknown complexity Covariance Mode (if α > 1) How to Sample 3.1 α −1 ˜ (θ S ˜ θ∈S Example: Regularized Pmfs The shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM = ˜ ˘ ˜ ˘ ˘ {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for specific values of λ and θ. In general, for a given λ and θ ∈ S, the following d × d matrix M will change the support to the desired subset SM by mapping the extreme points of S to the extreme points of SM : ˘ M = (1 − λ)θ1T + λI, (2) where I is the d × d identity matrix. In Section 4 we show that the M given in (2) is optimal in a maximum entropy sense. 3 3.2 Example: Bounded Variation Pmfs We describe how to use the shadow Dirichlet to model a random pmf that has bounded variation such that |θk − θl | ≤ k,l for any k, ∈ {1, 2, . . . , d} and k,l > 0. To construct specified bounds on the variation, we first analyze the variation for a given M . For any d × d left stochastic matrix T d d ˜ ˜ ˜ M, θ = Mθ = M1j θj . . . Mdj θj , so the difference between any two entries is j=1 j=1 ˜ |Mkj − Mlj | θj . ˜ (Mkj − Mlj )θj ≤ |θk − θl | = (3) j j Thus, to obtain a distribution over pmfs with bounded |θk − θ | ≤ k,l for any k, components, it is sufficient to choose components of the matrix M such that |Mkj − Mlj | ≤ k,l for all j = 1, . . . , d ˜ because θ in (3) sums to 1. One way to create such an M is using the regularization strategy described in Section 3.1. For this ˜ ˜ ˘ case, the jth component of θ is θj = M θ = λθj + (1 − λ)θj , and thus the variation between the j ith and jth component of any pmf in SM is: ˜ ˘ ˜ ˘ ˜ ˜ ˘ ˘ |θi − θj | = λθi + (1 − λ)θi − λθj − (1 − λ)θj ≤ λ θi − θj + (1 − λ) θi − θj ˘ ˘ ≤ λ + (1 − λ) max θi − θj . (4) i,j ˘ Thus by choosing an appropriate λ and regularizing pmf θ, one can impose the bounded variation ˘ to be the uniform pmf, and choose any λ ∈ (0, 1), then the matrix given by (4). For example, set θ M given by (2) will guarantee that the difference between any two entries of any pmf drawn from the shadow Dirichlet (M, α) will be less than or equal to λ. 3.3 Example: Monotonic Pmfs For pmfs over ordered components, it may be desirable to restrict the support of the random pmf distribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs). A d × d left-stochastic matrix M that will result in a shadow Dirichlet that generates only monotonically increasing d × 1 pmfs has kth column [0 . . . 0 1/(d − k + 1) . . . 1/(d − k + 1)]T , we call this the monotonic M . It is easy to see that with this M only monotonic θ’s can be produced, 1˜ 1˜ 1 ˜ because θ1 = d θ1 which is less than or equal to θ2 = d θ1 + d−1 θ2 and so on. In Section 4 we show that the monotonic M is optimal in a maximum entropy sense. Note that to provide support over both monotonically increasing and decreasing pmfs with one distribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two shadow Dirichlets. 3.4 What Restricted Subsets are Possible? Above we have described solutions to construct M for three kinds of dependence that arise in machine learning applications. Here we consider the more general question: What subsets of the simplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a particular support? For any matrix M , by the Krein-Milman theorem [10], SM = M S is the convex hull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d × d matrix M will have d extreme points that occur for the d choices of θ that have only one nonzero component, as the rest of the θ will create a non-trivial convex combination of the columns of M , and therefore cannot result in extreme points of SM by definition. That is, the extreme points of SM are the d columns of M , and one can design any SM with d extreme points by setting the columns of M to be those extreme pmfs. However, if one wants the new support to be a polytope in the probability (d − 1)-simplex with m > d extreme points, then one must use a fat M with d × m entries. Let S m denote the probability 4 (m − 1)-simplex, then the domain of the shadow Dirichlet will be M S m , which is the convex hull of the m columns of M and forms a convex polytope in S with at most m vertices. In this case M cannot be injective, and hence it is not bijective between S m and M S m . However, a density on M S m can be defined as: p(θ) = 1 B(α) ˜ {θ ˜α −1 ˜ θj j dθ. ˜ M θ=θ} j (5) On the other hand, if one wants the support to be a low-dimensional polytope subset of a higherdimensional probability simplex, then a thin d × m matrix M , where m < d, can be used to implement this. If M is injective, then it has a left inverse M ∗ that is a matrix of dimension m × d, and the normalized pushforward of the original density can be used as a density on the image M S m : p(θ) = 1 α −1 1/2 B(α) |det(M T M )| (M ∗ θ)j j , j If M is not injective then one way to determine a density is to use (5). 4 Information-theoretic Properties In this section we note two information-theoretic properties of the shadow Dirichlet. Let Θ be drawn ˜ from shadow Dirichlet density pM , and let its generating Dirichlet Θ be drawn from pD . Then the differential entropy of the shadow Dirichlet is h(pM ) = log |det(M )| + h(pD ), where h(pD ) is the differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less entropy than its Dirichlet shadow because log |det(M )| ≤ 0, which can be shown as a corollary to the following lemma (proof not included due to lack of space): Lemma 4.1. Let {x1 , . . . , xn } and {y1 , . . . , yn } be column vectors in Rn . If each yj is a convex n n combination of the xi ’s, i.e. yj = i=1 γji xi , i=1 γji = 1, γjk ≥ 0, ∀j, k ∈ {1, . . . , n} then |det[y1 , . . . , yn ]| ≤ |det[x1 , . . . , xn ]|. It follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M are optimal in the sense of maximizing entropy: Corollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet ˜ ˘ ˜ ˘ distributions with support in {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for a specific choice of λ and θ. Then the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves arg maxM ∈Mreg h(pM ). Corollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow Dirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Section 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves arg maxM ∈Mmono h(pM ). 5 Estimating the Distribution from Data In this section, we discuss the estimation of α for the shadow Dirichlet and compound shadow Dirichlet, and the estimation of M . 5.1 Estimating α for the Shadow Dirichlet Let matrix M be specified (for example, as described in the subsections of Section 3), and let q be a d × N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N , and let (qi )j be the jth component of the ith sample pmf for j = 1, . . . , d. Then finding the maximum likelihood estimate 5 of α for the shadow Dirichlet is straightforward: N 1 arg max log + log p(qi |α) ≡ arg max log B(α) |det(M )| α∈Rk α∈Rk + + i=1 1 αj −1 (˜i )j q , ≡ arg max log B(α)N i j α∈Rk + N α −1 (M −1 qi )j j i j (6) where q = M −1 q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis˜ tribution given the matrix q , and can be solved using the standard methods for that problem (see ˜ e.g. [11, 12]). 5.2 Estimating α for the Compound Shadow Dirichlet For many machine learning applications the given data are modeled as samples from realizations of a random pmf, and given these samples one must estimate the random pmf model’s parameters. We refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also called the multivariate P´ lya distribution). Assuming one has already specified M , we first discuss o method of moments estimation, and then describe an expectation-maximization (EM) method for computing the maximum likelihood estimate α. ˘ One can form an estimate of α by the method of moments. For the standard compound Dirichlet, one treats the samples of the realizations as normalized empirical histograms, sets the normalized α parameter equal to the empirical mean of the normalized histograms, and uses the empirical variances to determine the precision α0 . By definition, this estimate will be less likely than the maximum likelihood estimate, but may be a practical short-cut in some cases. For the compound shadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general. The problem is that if one draws samples from a pmf θ from a restricted subset SM of the simplex, ˘ then the normalized empirical histogram θ of those samples may not be in SM . For example given a monotonic pmf, the histogram of five samples drawn from it may not be monotonic. Then the empirical mean of such normalized empirical histograms may not be in SM , and so setting the shadow Dirichlet mean M α equal to the empirical mean may lead to an infeasible estimate (one that is outside SM ). A heuristic solution is to project the empirical mean into SM first, for example, by finding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet, this may still be a useful approach in practice for some problems. Next we state an EM method to find the maximum likelihood estimate α. Let s be a d × N matrix ˘ of sample histograms from different experiments, such that the ith column si is the ith histogram for i = 1, . . . , N , and (si )j is the number of times we have observed the jth event from the ith pmf vi . Then the maximum log-likelihood estimate of α solves arg max log p(s|α) for α ∈ Rk . + If the random pmfs are drawn from a Dirichlet distribution, then finding this maximum likelihood estimate requires an iterative procedure, and can be done in several ways including a gradient descent (ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution, then a direct gradient descent approach is highly inconvenient as it requires taking derivatives of numerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm [13][14], as we describe in the rest of this section. Code to perform the EM estimation of α can be downloaded from idl.ee.washington.edu/publications.php. We assume that the experiments are independent and therefore p(s|α) = p({si }|α) = and hence arg maxα∈Rk log p(s|α) = arg maxα∈Rk i log p(si |α). + + i p(si |α) To apply the EM method, we consider the complete data to be the sample histograms s and the pmfs that generated them (s, v1 , v2 , . . . , vN ), whose expected log-likelihood will be maximized. Specifically, because of the assumed independence of the {vi }, the EM method requires one to repeatedly maximize the Q-function such that the estimate of α at the (m + 1)th iteration is: N α(m+1) = arg max α∈Rk + Evi |si ,α(m) [log p(vi |α)] . i=1 6 (7) Like the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not necessarily concave. However, note that the Q-function given in (7) is concave, because log p(vi |α) = − log |det(M )| + log pD,α M −1 vi , where pD,α is the Dirichlet distribution with parameter α, and by a theorem of Ronning [11], log pD,α is a concave function, and adding a constant does not change the concavity. The Q-function is a finite integration of such concave functions and hence also concave [15]. We simplify (7) without destroying the concavity to yield the equivalent problem α(m+1) = d d arg max g(α) for α ∈ Rk , where g(α) = log Γ(α0 ) − j=1 log Γ(αj ) + j=1 βj αj , and + βj = N tij i=1 zi , 1 N where tij and zi are integrals we compute with Monte Carlo integration: d (s ) log(M −1 vi )j γi tij = SM (vi )k i k pM (vi |α(m) )dvi k=1 d zi = (vi )j k(si )k pM (vi |α(m) )dvi , γi SM k=1 where γi is the normalization constant for the multinomial with histogram si . We apply the Newton method [16] to maximize g(α), where the gradient g(α) has kth component ψ0 (α0 ) − ψ0 (α1 ) + β1 , where ψ0 denotes the digamma function. Let ψ1 denote the trigamma function, then the Hessian matrix of g(α) is: H = ψ1 (α0 )11T − diag (ψ1 (α1 ), . . . , ψ1 (αd )) . Note that because H has a very simple structure, the inversion of H required by the Newton step is greatly simplified by using the Woodbury identity [17]: H −1 = − diag(ξ1 , . . . , ξd ) − 1 1 1 [ξi ξj ]d×d , where ξ0 = ψ1 (α0 ) and ξj = ψ1 (αj ) , j = 1, . . . , d. ξ − d ξ 0 5.3 j=1 j Estimating M for the Shadow Dirichlet Thus far we have discussed how to construct M to achieve certain desired properties and how to interpret a given M ’s effect on the support. In some cases it may be useful to estimate M directly from data, for example, finding the maximum likelihood M . In general, this is a non-convex problem because the set of rank d − 1 matrices is not convex. However, we offer two approximations. First, note that as in estimating the support of a uniform distribution, the maximum likelihood M will correspond to a support that is no larger than needed to contain the convex hull of sample pmfs. Second, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth column of M (which corresponds to the kth vertex of the support) to be a convex combination of the kth vertex of the standard probability simplex and the empirical mean pmf. We provide code that finds the d optimal such convex combinations such that a specificed percentage of the sample pmfs are within the support, which reduces the non-convex problem of finding the maximum likelihood d × d matrix M to a d-dimensional convex relaxation. 6 Demonstrations It is reasonable to believe that if the shadow Dirichlet better matches the problem’s statistics, it will perform better in practice, but an open question is how much better? To motivate the reader to investigate this question further in applications, we provide two small demonstrations. 6.1 Verifying the EM Estimation We used a broad suite of simulations to test and verify the EM estimation. Here we include a simple visual confirmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet with monotonic M for d = 3 and α = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs, we drew 100 i.i.d. samples. Then we applied the EM algorithm to find the α for both the standard compound Dirichlet, and the compound shadow Dirichlet with the correct M . Fig. 2 shows the true distribution and the two estimated distributions. 7 True Distribution (Shadow Dirichlet) Estimated Shadow Dirichlet Estimated Dirichlet Figure 2: Samples were drawn from the true distribution and the given EM method was applied to form the estimated distributions. 6.2 Estimating Proportions from Sales Manufacturers often have constrained manufacturing resources, such as equipment, inventory of raw materials, and employee time, with which to produce multiple products. The manufacturer must decide how to proportionally allocate such constrained resources across their product line based on their estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales data for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict the proportion of sales expected for each puzzle. These estimates were then tested on the next five months of sales data, for January 2010 through April 2010. The company also provided a similarity between puzzles S, where S(A, B) is the proportion of times an order during the six training months included both puzzle A and B if it included puzzle A. We compared treating each of the six training months of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet. For the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so that it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought together to have closer estimated proportions. We estimated each α parameter by EM to maximize the likelihood of the past sales data, and then estimated the future sales proportions to be the mean of the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all six months of sales data as coming from one multinomial which we estimated as the maximum likelihood multinomial, and to taking the mean of the six empirical pmfs. Table 2: Squared errors between estimates and actual proportional sales. Jan. Feb. Mar. Apr. 7 Multinomial .0129 .0185 .0231 .0240 Mean Pmf .0106 .0206 .0222 .0260 Dirichlet .0109 .0172 .0227 .0235 Shadow Dirichlet .0093 .0164 .0197 .0222 Summary In this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of the dependent structure that arises often in machine learning applications. We have discussed some of its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded variation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the EM method and made available code to estimate both the shadow Dirichlet and compound shadow Dirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow Dirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice. 8 References [1] B. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Tech. Rep., University of Washington, 2010. [2] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval,” ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179–214, 2004. [3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similarity-based classification: Concepts and algorithms,” Journal of Machine Learning Research, vol. 10, pp. 747–776, March 2009. [4] R. Nallapati, T. Minka, and S. Robertson, “The smoothed-Dirichlet distribution: a building block for generative topic models,” Tech. Rep., Microsoft Research, Cambridge, 2007. [5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986. [6] R. J. Connor and J. E. Mosiman, “Concepts of independence for proportions with a generalization of the Dirichlet distibution,” Journal of the American Statistical Association, vol. 64, pp. 194–206, 1969. [7] K. Pearson, “Mathematical contributions to the theory of evolution–on a form of spurious correlation which may arise when indices are used in the measurement of organs,” Proc. Royal Society of London, vol. 60, pp. 489–498, 1897. [8] A. Ongaro, S. Migliorati, and G. S. Monti, “A new distribution on the simplex containing the Dirichlet family,” Proc. 3rd Compositional Data Analysis Workshop, 2008. [9] W. S. Rayens and C. Srinivasan, “Dependence properties of generalized Liouville distributions on the simplex,” Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465– 1470, 1994. [10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991. [11] G. Ronning, “Maximum likelihood estimation of Dirichlet distributions,” Journal of Statistical Computation and Simulation, vol. 34, no. 4, pp. 215221, 1989. [12] T. Minka, “Estimating a Dirichlet distribution,” Tech. Rep., Microsoft Research, Cambridge, 2009. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in Signal Processing, Hanover, MA, 2010. [15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com. [18] R. E. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” in Proc. Intl. Conf. Machine Learning, 2005. 9
3 0.68259794 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
4 0.61612076 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
Author: Lauren Hannah, Warren Powell, David M. Blei
Abstract: In this paper we study convex stochastic optimization problems where a noisy objective function value is observed after a decision is made. There are many stochastic optimization problems whose behavior depends on an exogenous state variable which affects the shape of the objective function. Currently, there is no general purpose algorithm to solve this class of problems. We use nonparametric density estimation to take observations from the joint state-outcome distribution and use them to infer the optimal decision for a given query state s. We propose two solution methods that depend on the problem characteristics: function-based and gradient-based optimization. We examine two weighting schemes, kernel based weights and Dirichlet process based weights, for use with the solution methods. The weights and solution methods are tested on a synthetic multi-product newsvendor problem and the hour ahead wind commitment problem. Our results show that in some cases Dirichlet process weights offer substantial benefits over kernel based weights and more generally that nonparametric estimation methods provide good solutions to otherwise intractable problems. 1
5 0.60598344 120 nips-2010-Improvements to the Sequence Memoizer
Author: Jan Gasthaus, Yee W. Teh
Abstract: The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1
7 0.55809575 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
8 0.54264235 215 nips-2010-Probabilistic Deterministic Infinite Automata
9 0.44084775 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
10 0.38271552 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
11 0.38211188 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
12 0.37912557 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
13 0.37806755 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
14 0.36071503 150 nips-2010-Learning concept graphs from text with stick-breaking priors
15 0.35483676 155 nips-2010-Learning the context of a category
16 0.35103852 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
17 0.34909672 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
18 0.34682229 129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
19 0.3419362 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
20 0.3296563 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
topicId topicWeight
[(13, 0.035), (17, 0.013), (22, 0.013), (27, 0.099), (30, 0.08), (35, 0.03), (45, 0.162), (50, 0.103), (52, 0.074), (60, 0.032), (77, 0.049), (90, 0.054), (93, 0.17)]
simIndex simValue paperId paperTitle
same-paper 1 0.85378921 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
Author: Achintya Kundu, Vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-tal
Abstract: In this paper we consider the problem of learning an n × n kernel matrix from m(≥ 1) similarity matrices under general convex loss. Past research have extensively studied the m = 1 case and have derived several algorithms which require sophisticated techniques like ACCP, SOCP, etc. The existing algorithms do not apply if one uses arbitrary losses and often can not handle m > 1 case. We present several provably convergent iterative algorithms, where each iteration requires either an SVM or a Multiple Kernel Learning (MKL) solver for m > 1 case. One of the major contributions of the paper is to extend the well known Mirror Descent(MD) framework to handle Cartesian product of psd matrices. This novel extension leads to an algorithm, called EMKL, which solves the problem in 2 O( m ǫlog n ) iterations; in each iteration one solves an MKL involving m kernels 2 and m eigen-decomposition of n × n matrices. By suitably defining a restriction on the objective function, a faster version of EMKL is proposed, called REKL, which avoids the eigen-decomposition. An alternative to both EMKL and REKL is also suggested which requires only an SVM solver. Experimental results on real world protein data set involving several similarity matrices illustrate the efficacy of the proposed algorithms.
3 0.78011078 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.77556741 22 nips-2010-Active Estimation of F-Measures
Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer
Abstract: We address the problem of estimating the Fα -measure of a given model as accurately as possible on a fixed labeling budget. This problem occurs whenever an estimate cannot be obtained from held-out training data; for instance, when data that have been used to train the model are held back for reasons of privacy or do not reflect the test distribution. In this case, new test instances have to be drawn and labeled at a cost. An active estimation procedure selects instances according to an instrumental sampling distribution. An analysis of the sources of estimation error leads to an optimal sampling distribution that minimizes estimator variance. We explore conditions under which active estimates of Fα -measures are more accurate than estimates based on instances sampled from the test distribution. 1
5 0.77109647 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
6 0.76601398 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
7 0.76486552 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
8 0.76085281 96 nips-2010-Fractionally Predictive Spiking Neurons
9 0.75478041 161 nips-2010-Linear readout from a neural population with partial correlation data
10 0.75319165 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
11 0.75266707 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
12 0.75033087 155 nips-2010-Learning the context of a category
13 0.75019711 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
14 0.74994826 268 nips-2010-The Neural Costs of Optimal Control
15 0.74731004 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
16 0.74717098 148 nips-2010-Learning Networks of Stochastic Differential Equations
17 0.74677205 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication
18 0.74499673 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
19 0.74496907 17 nips-2010-A biologically plausible network for the computation of orientation dominance
20 0.74487692 194 nips-2010-Online Learning for Latent Dirichlet Allocation