nips nips2012 nips2012-57 knowledge-graph by maker-knowledge-mining

57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors


Source: pdf

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bayesian estimation of discrete entropy with mixtures of stick-breaking priors Evan Archer⇤124 , Il Memming Park⇤234 , & Jonathan W. [sent-1, score-0.5]

2 Division of Statistics & Scientific Computation The University of Texas at Austin Abstract We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. [sent-7, score-0.543]

3 Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. [sent-8, score-0.448]

4 Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. [sent-9, score-0.995]

5 We derive formulas for the posterior mean and variance of H given data. [sent-10, score-0.125]

6 However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. [sent-11, score-0.63]

7 We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. [sent-12, score-0.337]

8 1 Introduction An important statistical problem in the study of natural systems is to estimate the entropy of an unknown discrete distribution on the basis of an observed sample. [sent-14, score-0.462]

9 This is often much easier than the problem of estimating the distribution itself; in many cases, entropy can be accurately estimated with fewer samples than the number of distinct symbols. [sent-15, score-0.398]

10 Entropy estimation remains a difficult problem, however, as there is no unbiased estimator for entropy, and the maximum likelihood estimator exhibits severe bias for small datasets. [sent-16, score-0.186]

11 The basic idea is to place a prior over the space of probability distributions that might have generated the data, and then perform inference using the induced posterior distribution over entropy. [sent-19, score-0.396]

12 We focus on the setting where our data are a finite sample from an unknown, or possibly even countably infinite, number of symbols. [sent-22, score-0.104]

13 However, we show that a fixed PYP prior imposes a narrow ⇤ These authors contributed equally. [sent-26, score-0.258]

14 data Figure 1: Graphical model illustrating the ingredients for Bayesian entropy estimation. [sent-30, score-0.337]

15 Arrows indicate conditional dependencies between variables, and the gray “plate” denotes multiple copies of a random variable (with the number of copies N indicated at bottom). [sent-31, score-0.096]

16 For entropy estimation, the joint probability distribution over entropy H, data x = {xj }, discrete distribution ⇡ = {⇡i }, and parameter ✓ factorizes as: p(H, x, ⇡, ✓) = p(H|⇡)p(x|⇡)p(⇡|✓)p(✓). [sent-32, score-0.784]

17 prior over entropy, leading to severe bias and overly narrow credible intervals for small datasets. [sent-34, score-0.48]

18 We address this shortcoming by introducing a set of mixing measures such that the resulting Pitman-Yor Mixture (PYM) prior provides an approximately non-informative (i. [sent-35, score-0.242]

19 In Section 2, we introduce the entropy estimation problem and review prior work. [sent-39, score-0.54]

20 In Section 4, we introduce a novel entropy estimator based on PYM priors and derive several of its theoretical properties. [sent-41, score-0.483]

21 2 Entropy Estimation N A Consider samples x := {xj }j=1 drawn iid from an unknown discrete distribution ⇡ := {⇡i }i=1 on a finite or (countably) infinite alphabet X. [sent-43, score-0.205]

22 We wish to estimate the entropy of ⇡, H(⇡) = A X ⇡i log ⇡i , (1) i=1 where we identify X = {1, 2, . [sent-44, score-0.359]

23 , A} as the set of alphabets without loss of generality (where the alphabet size A may be infinite), and ⇡i > 0 denotes the probability of observing symbol i. [sent-47, score-0.098]

24 1 Bayesian entropy estimation The Bayesian approach to entropy estimation involves formulating a prior over distributions ⇡, and then turning the crank of Bayesian inference to infer H using the posterior distribution. [sent-55, score-1.072]

25 Bayes’ least squares (BLS) estimators take the form: Z ˆ H(x) = E[H|x] = H(⇡)p(⇡|x) d⇡ (2) where p(⇡|x) is the posterior over ⇡ under some prior p(⇡) and categorical likelihood p(x|⇡) = Q P j p(xj |⇡), where p(xj = i) = ⇡i . [sent-56, score-0.335]

26 To the extent that p(⇡) expresses our true prior uncertainty over the unknown distribution that generated the data, this estimate is optimal in a least-squares sense, and the corresponding credible intervals capture our uncertainty about H given the data. [sent-58, score-0.436]

27 For distributions with known finite alphabet size A, the Dirichlet distribution provides an obvious choice of prior due to its conjugacy to the discrete (or multinomial) likelihood. [sent-59, score-0.401]

28 We compare samples from the DP (red) and PYP (blue) priors for two datasets with heavy tails (black). [sent-61, score-0.212]

29 2e6 neural spike words from 27 simultaneously-recorded retinal ganglion cells, binarized and binned at 10 ms [18]. [sent-65, score-0.134]

30 Many previously proposed estimators can be viewed as Bayesian estimators with a particular fixed choice of ↵. [sent-68, score-0.136]

31 2 Nemenman-Shafee-Bialek (NSB) estimator In a seminal paper, Nemenman et al [6] showed that Dirichlet priors impose a narrow prior over entropy. [sent-71, score-0.429]

32 In the under-sampled regime, Bayesian estimates using a fixed Dirichlet prior are severely biased, and have small credible intervals (i. [sent-72, score-0.384]

33 To address this problem, [6] suggested a mixture-of-Dirichlets prior: Z p(⇡) = pDir (⇡|↵)p(↵)d↵, (3) where pDir (⇡|↵) denotes a Dir(↵) prior on ⇡. [sent-76, score-0.202]

34 To construct an approximately flat prior on entropy, [6] proposed the mixing weights on ↵ given by, d E[H|↵] = A 1 (A↵ + 1) (4) p(↵) / 1 (↵ + 1), d↵ where E[H|↵] denotes the expected value of H under a Dir(↵) prior, and 1 (·) denotes the trigamma function. [sent-77, score-0.296]

35 The BLS estimator under the NSB prior can then be written as, ZZ Z p(x|↵)p(↵) ˆ Hnsb = E[H|x] = H(⇡)p(⇡|x, ↵) p(↵|x) d⇡ d↵ = E[H|x, ↵] d↵, (5) p(x) where E[H|x, ↵] is the posterior mean under a Dir(↵) prior, and p(x|↵) denotes the evidence, which has a Polya distribution. [sent-80, score-0.355]

36 Nemenman et al proposed one such extension using an approximation to Hnsb in the ˆ nsb [15, 16]. [sent-84, score-0.195]

37 3 Stick-Breaking Priors To construct a prior over countably infinite discrete distributions we employ a class of distributions from nonparametric Bayesian statistics known as stick-breaking processes [19]. [sent-87, score-0.523]

38 Both are stochastic processes whose samples are discrete probability P1 distributions [7, 20]. [sent-89, score-0.2]

39 A sample from a DP or PYP may be written as i=1 ⇡i i , where ⇡ = {⇡i } denotes a countably infinite set of ‘weights’ on a set of atoms { i } drawn from some base probability measure, where i denotes a delta function on the atom i . [sent-90, score-0.18]

40 1 The prior distribution over ⇡ under the DP and PYP is technically called the GEM distribution or the two-parameter Poisson-Dirichlet distribution, but we will abuse terminology and refer to it more simply as script notation DP or PY. [sent-91, score-0.235]

41 The DP weight distribution DP(↵) may be described as a limit of the finite Dirichlet distributions where the alphabet size grows and concentration parameter shrinks, A ! [sent-92, score-0.176]

42 Asymptoti˜ cally, the tails of a (sorted) sample from DP(↵) decay exponentially, while for PY(d, ↵) with d 6= 0, 1 the tails approximately follow a power-law: ⇡i / (i) d ( [7], pp. [sent-102, score-0.237]

43 These expectations are required for computing the mean and variance of H under the prior (or posterior) over ⇡. [sent-110, score-0.239]

44 The direct consequence of this lemma is that first two moments of H(⇡) under the DP and PY priors have closed forms , which can be obtained using (from eq. [sent-116, score-0.131]

45 This allows us to ignore the base measure, making entropy of the distribution equal to the entropy of the weights ⇡. [sent-119, score-0.704]

46 4 Prior Uncertainty standard deviation (nats) expected entropy (nats) Prior Mean 30 20 10 0 0 10 5 1 0 10 10 10 d=0. [sent-121, score-0.337]

47 0 2 0 5 10 10 10 10 Figure 3: Prior mean and standard deviation over entropy H under a fixed PY prior, as a function of ↵ and d. [sent-131, score-0.337]

48 Note that expected entropy is approximately linear in log ↵. [sent-132, score-0.382]

49 Small prior standard deviations (right) indicate that p(H(⇡)|d, ↵) is highly concentrated around the prior mean (left). [sent-133, score-0.35]

50 2 Posterior distribution over weights A second desirable property of the PY distribution is that the posterior p(⇡post |x, d, ↵) takes the form of a (finite) Dirichlet mixture of point masses and a PY distribution [8]. [sent-135, score-0.233]

51 This makes it possible to apply the above results to the posterior mean and variance of H. [sent-136, score-0.125]

52 Then let ↵i = ni d, N = ni , P P PA and A = ↵i = i ni Kd = N Kd, where K = i=1 1{ni >0} is the number of unique symbols observed. [sent-138, score-0.325]

53 Given data, the posterior over (countably infinite) discrete distributions, written as ⇡post = (p1 , p2 , p3 , . [sent-139, score-0.142]

54 1 d, ↵ + Kd) (9) Bayesian entropy inference with PY priors Fixed PY priors Using the results of the previous section (eqs. [sent-153, score-0.507]

55 7 and 8), we can derive the prior mean and variance of H under a PY(d, ↵) prior on ⇡: E[H(⇡)|d, ↵] = 0 (1 + ↵) ↵+d var[H(⇡)|d, ↵] = (1 + ↵)2 (1 d), 1 d + d) 1 + ↵ (10) 0 (1 1 (2 d) 1 (2 + ↵), (11) where n is the polygamma of n-th order (i. [sent-154, score-0.383]

56 These reveal the same phenomenon that [6] observed for finite Dirichlet distributions: a PY prior with fixed (d, ↵) induces a narrow prior over H. [sent-159, score-0.433]

57 In the undersampled regime, Bayesian estimates under PY priors will therefore be strongly determined by the choice of (d, ↵), and posterior credible intervals will be unrealistically narrow. [sent-160, score-0.429]

58 2 Pitman-Yor process mixture (PYM) prior The narrow prior on H induced by fixed PY priors suggests a strategy for constructing a noninformative prior: mix together a family of PY distributions with some hyper-prior p(d, ↵) selected to yield an approximately flat prior on H. [sent-162, score-0.888]

59 There, one can obtain arbitrarily large prior variance over H for given mean. [sent-167, score-0.208]

60 However, these such priors have very heavy tails and seem poorly suited to data with finite or exponential tails; we do not explore them further here. [sent-168, score-0.181]

61 02 0 0 1 2 3 4 5 Entropy (H) Figure 4: Expected entropy under Pitman-Yor and Pitman-Yor Mixture priors. [sent-177, score-0.337]

62 (A) Left: expected entropy as a function of the natural parameters (d, ↵). [sent-178, score-0.337]

63 Right: expected entropy as a function of transformed parameters (h, ). [sent-179, score-0.337]

64 Note that the implied prior over H is approximately flat. [sent-181, score-0.22]

65 prior entropies can arise either from large values of ↵ (as in the DP) or from values of d near 1. [sent-182, score-0.206]

66 We can explicitly control this trade-off by reparametrizing the PY distribution, letting d) 0 (1 , (12) + ↵) d) 0 (1 where h > 0 is equal to the expected entropy of the prior (eq. [sent-185, score-0.512]

67 10) and > 0 captures prior beliefs about tail behavior of ⇡. [sent-186, score-0.175]

68 h= 0 (1 + ↵) 0 (1 d), = 0 (1) 0 (1 We can construct an (approximately) flat improper prior over H on [0, 1] by setting p(h, ) = q( ), where q is any density on [0, 1]. [sent-191, score-0.215]

69 The induced prior on entropy is thus: ZZ p(H) = p(H|⇡)pPY (⇡| , h)p( , h)d dh, (13) where pPY (⇡| , h) denotes a PY distribution on ⇡ with parameters , h. [sent-192, score-0.593]

70 4B shows samples from this prior under three different choices of q( ), for h uniform on [0, 3]. [sent-194, score-0.206]

71 We refer to the resulting prior distribution over ⇡ as the Pitman-Yor mixture (PYM) prior. [sent-195, score-0.256]

72 All results in the figures are generated using the prior q( ) / max(1 , 0). [sent-196, score-0.175]

73 Just as the case with the prior mean, the posterior mean E[H|x, d, ↵] is given by a convenient analytic form (derived in the Appendix), "K # X ↵ + Kd 1 E[H|↵, d, x] = 0 (↵ + N + 1) d) (ni d) 0 (ni d + 1) . [sent-199, score-0.311]

74 (16) ˆ ˆ We can obtain confidence regions for HPYM by computing the posterior variance E[(H HPYM )2 |x]. [sent-201, score-0.125]

75 Computation of the integrand can be carried out more efficiently using a representation in terms of multiplicities (also known as the empirical histogram distribution function [4]), the number of symbols that have occurred with a given frequency in the sample. [sent-207, score-0.215]

76 Letting mk = |{i : ni = k}| denote the total number of symbols with exactly k observations in the sample gives the compressed statistic m = > [m0 , m1 , . [sent-208, score-0.161]

77 The multiplicities representation significantly reduces the time and space complexity of our computations for most datasets, as we need only compute sums and products involving the number symbols with distinct frequencies (at most nmax ), rather than the total number of symbols K. [sent-216, score-0.279]

78 5 Existence of posterior mean Given that the PYM prior with p(h) / 1 on [0, 1] is improper, the prior expectation E[H] does not exist. [sent-226, score-0.442]

79 Given a fixed dataset x of N samples and any bounded (potentially improper) prior ˆ p( , h), HPYM < 1 when N K 2. [sent-230, score-0.206]

80 This result says that the BLS entropy estimate is finite whenever there are at least two “coincidences”, i. [sent-231, score-0.359]

81 , two fewer unique symbols than samples, even though the prior expectation is infinite. [sent-233, score-0.254]

82 5 Results We compare PYM to other proposed entropy estimators using four example datasets in Fig. [sent-234, score-0.405]

83 The Miller-Maddow estimator is a well-known method for bias correction based on a first-order Taylor expansion of the entropy functional. [sent-236, score-0.434]

84 The CAE (“Coverage Adjusted Estimator”) addresses bias by combining the Horvitz-Thompson estimator with a nonparametric estimate of the proportion of total probability mass (the “coverage”) accounted for by the observed data x [17, 25]. [sent-237, score-0.119]

85 5 arise from direct compution of the posterior variance of the entropy. [sent-244, score-0.125]

86 6 Discussion In this paper we introduced PYM, a novel entropy estimator for distributions with unknown support. [sent-245, score-0.496]

87 We derived analytic forms for the conditional mean and variance of entropy under a DP and PY prior for fixed parameters. [sent-246, score-0.589]

88 Inspired by the work of [6], we defined a novel PY mixture prior, PYM, which implies an approximately flat prior on entropy. [sent-247, score-0.271]

89 4 # of samples 100 1600 18000 # of samples 20 210000 90 400 10000 Figure 5: Convergence of entropy estimators with sample size, on two simulated and two real datasets. [sent-272, score-0.467]

90 Credible intervals are indicated by two standard deviation of the posterior for DPM and PYM. [sent-275, score-0.171]

91 We have shown that PYM performs well in comparison to other entropy estimators, and indicated its practicality in example applications to data. [sent-283, score-0.362]

92 There may be specific distributions for which the PYM estimate is so heavily biased that the credible intervals fail to bracket the true entropy. [sent-285, score-0.283]

93 This reflects a general state of affairs for entropy estimation on countable distributions: any convergence rate result must depend on restricting to a subclass of distributions [26]. [sent-286, score-0.465]

94 Rather than working within some analytically-defined subclass of discrete distributions (such as, for instance, those with finite “entropy variance” [17]), we work within the space of distributions parametrized by PY which spans both the exponential and power-law tail distributions. [sent-287, score-0.225]

95 We feel this is a key feature for small-sample inference, where the choice of prior is most relevant. [sent-289, score-0.175]

96 In conclusion, we have defined the PYM prior through a reparametrization that assures an approximately flat prior on entropy. [sent-291, score-0.395]

97 Moreover, although parametrized over the space of countably-infinite discrete distributions, the computation of PYM depends primarily on the first two conditional moments of entropy under PY. [sent-292, score-0.433]

98 We derive closed-form expressions for these moments that are fast to compute, and allow the efficient computation of both the PYM estimate and its posterior credible interval. [sent-293, score-0.32]

99 Generalized weighted chinese restaurant processes for species sampling mixture models. [sent-349, score-0.119]

100 When do finite sample effects significantly affect entropy estimates? [sent-368, score-0.337]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pym', 0.511), ('py', 0.376), ('entropy', 0.337), ('pyp', 0.255), ('dp', 0.216), ('prior', 0.175), ('nsb', 0.17), ('credible', 0.132), ('nemenman', 0.106), ('countably', 0.104), ('dirichlet', 0.102), ('tails', 0.096), ('posterior', 0.092), ('moby', 0.085), ('wordcount', 0.085), ('priors', 0.085), ('narrow', 0.083), ('ni', 0.082), ('nats', 0.081), ('symbols', 0.079), ('distributions', 0.075), ('alphabet', 0.071), ('estimators', 0.068), ('multiplicities', 0.065), ('bls', 0.064), ('dick', 0.064), ('hnsb', 0.064), ('hpym', 0.064), ('estimator', 0.061), ('kd', 0.06), ('dpm', 0.057), ('nmax', 0.056), ('dir', 0.055), ('intervals', 0.054), ('mixture', 0.051), ('discrete', 0.05), ('ishwaran', 0.049), ('retinal', 0.048), ('bayesian', 0.047), ('spike', 0.047), ('moments', 0.046), ('approximately', 0.045), ('analytic', 0.044), ('processes', 0.044), ('cae', 0.043), ('confidence', 0.043), ('mima', 0.043), ('pdir', 0.043), ('ppy', 0.043), ('undersampled', 0.043), ('frequency', 0.041), ('improper', 0.04), ('ganglion', 0.039), ('nite', 0.038), ('beta', 0.038), ('coincidences', 0.038), ('unboundedly', 0.038), ('bias', 0.036), ('litke', 0.035), ('params', 0.035), ('plugin', 0.035), ('shlens', 0.035), ('var', 0.033), ('chichilnisky', 0.033), ('variance', 0.033), ('physical', 0.032), ('samples', 0.031), ('pitman', 0.031), ('digamma', 0.031), ('entropies', 0.031), ('expectations', 0.031), ('distribution', 0.03), ('coverage', 0.03), ('sher', 0.03), ('simplex', 0.029), ('zz', 0.029), ('estimation', 0.028), ('appendix', 0.028), ('expressions', 0.028), ('nk', 0.027), ('denotes', 0.027), ('regime', 0.026), ('indicated', 0.025), ('ld', 0.025), ('exponent', 0.025), ('al', 0.025), ('subclass', 0.025), ('induced', 0.024), ('species', 0.024), ('post', 0.024), ('shannon', 0.024), ('estimates', 0.023), ('mm', 0.023), ('unknown', 0.023), ('clarity', 0.022), ('atoms', 0.022), ('copies', 0.022), ('estimate', 0.022), ('word', 0.022), ('mixing', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

2 0.14283173 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.

3 0.14044088 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

4 0.12828776 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

Author: Dahua Lin, John W. Fisher

Abstract: Mixture distributions are often used to model complex data. In this paper, we develop a new method that jointly estimates mixture models over multiple data sets by exploiting the statistical dependencies between them. Specifically, we introduce a set of latent Dirichlet processes as sources of component models (atoms), and for each data set, we construct a nonparametric mixture model by combining sub-sampled versions of the latent DPs. Each mixture model may acquire atoms from different latent DPs, while each atom may be shared by multiple mixtures. This multi-to-multi association distinguishes the proposed method from previous ones that require the model structure to be a tree or a chain, allowing more flexible designs. We also derive a sampling algorithm that jointly infers the model parameters and present experiments on both document analysis and image modeling. 1

5 0.11181091 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

6 0.098422132 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

7 0.093545564 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

8 0.092053413 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

9 0.085443117 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

10 0.082415983 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

11 0.074426949 294 nips-2012-Repulsive Mixtures

12 0.074357592 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

13 0.073847085 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

14 0.072463155 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

15 0.069515169 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

16 0.065326832 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

17 0.063663073 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

18 0.063012317 126 nips-2012-FastEx: Hash Clustering with Exponential Families

19 0.062651761 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

20 0.060952332 279 nips-2012-Projection Retrieval for Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.158), (1, 0.03), (2, 0.002), (3, 0.043), (4, -0.146), (5, 0.015), (6, 0.007), (7, 0.064), (8, 0.035), (9, -0.054), (10, 0.011), (11, 0.014), (12, 0.02), (13, -0.115), (14, 0.026), (15, -0.121), (16, 0.065), (17, -0.069), (18, -0.082), (19, 0.012), (20, 0.04), (21, 0.018), (22, 0.145), (23, 0.052), (24, -0.008), (25, 0.045), (26, 0.096), (27, -0.022), (28, 0.12), (29, -0.003), (30, 0.062), (31, -0.001), (32, 0.084), (33, -0.047), (34, -0.034), (35, -0.084), (36, 0.144), (37, 0.076), (38, -0.039), (39, -0.034), (40, -0.08), (41, -0.009), (42, 0.03), (43, -0.039), (44, 0.059), (45, -0.106), (46, -0.036), (47, 0.048), (48, 0.049), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94694263 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

2 0.77732313 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.

3 0.68242157 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

4 0.62487936 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1

5 0.54839635 294 nips-2012-Repulsive Mixtures

Author: Francesca Petralia, Vinayak Rao, David B. Dunson

Abstract: Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool. One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. The methods are illustrated using synthetic examples and an iris data set. Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based clustering; Repulsive point process; Well separated mixture. 1

6 0.53758073 95 nips-2012-Density-Difference Estimation

7 0.53181094 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

8 0.4897058 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

9 0.48012984 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

10 0.47599143 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

11 0.45333102 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

12 0.44910014 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

13 0.44639495 359 nips-2012-Variational Inference for Crowdsourcing

14 0.44616368 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

15 0.44563556 222 nips-2012-Multi-Task Averaging

16 0.44214791 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

17 0.4409349 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.43954259 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

19 0.43296257 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing

20 0.4202835 60 nips-2012-Bayesian nonparametric models for ranked data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.038), (17, 0.012), (21, 0.042), (38, 0.113), (42, 0.027), (54, 0.017), (55, 0.025), (74, 0.054), (76, 0.137), (80, 0.113), (86, 0.292), (92, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80571342 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black

Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.

same-paper 2 0.75572616 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

3 0.74098212 303 nips-2012-Searching for objects driven by context

Author: Bogdan Alexe, Nicolas Heess, Yee W. Teh, Vittorio Ferrari

Abstract: The dominant visual search paradigm for object class detection is sliding windows. Although simple and effective, it is also wasteful, unnatural and rigidly hardwired. We propose strategies to search for objects which intelligently explore the space of windows by making sequential observations at locations decided based on previous observations. Our strategies adapt to the class being searched and to the content of a particular test image, exploiting context as the statistical relation between the appearance of a window and its location relative to the object, as observed in the training set. In addition to being more elegant than sliding windows, we demonstrate experimentally on the PASCAL VOC 2010 dataset that our strategies evaluate two orders of magnitude fewer windows while achieving higher object detection performance. 1

4 0.68177849 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1

5 0.65750933 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1

6 0.61010396 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

7 0.60922486 197 nips-2012-Learning with Recursive Perceptual Representations

8 0.60826635 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.60741341 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

10 0.60724282 168 nips-2012-Kernel Latent SVM for Visual Recognition

11 0.60689241 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

12 0.60664338 279 nips-2012-Projection Retrieval for Classification

13 0.60659081 200 nips-2012-Local Supervised Learning through Space Partitioning

14 0.60525775 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

15 0.6052469 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

16 0.605169 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

17 0.60395253 65 nips-2012-Cardinality Restricted Boltzmann Machines

18 0.60338491 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

19 0.60272521 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

20 0.60256326 232 nips-2012-Multiplicative Forests for Continuous-Time Processes