nips nips2012 nips2012-294 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francesca Petralia, Vinayak Rao, David B. Dunson
Abstract: Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool. One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. The methods are illustrated using synthetic examples and an iris data set. Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based clustering; Repulsive point process; Well separated mixture. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. [sent-8, score-0.204]
2 Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool. [sent-9, score-0.168]
3 One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. [sent-10, score-0.451]
4 Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. [sent-11, score-0.222]
5 Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. [sent-12, score-0.252]
6 To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. [sent-13, score-1.148]
7 We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. [sent-14, score-1.025]
8 Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based clustering; Repulsive point process; Well separated mixture. [sent-16, score-0.295]
9 1 Introduction Discrete mixture models characterize the density of y ∈ Y ⊂ m as k f (y) = ph φ(y; γh ) (1) h=1 where p = (p1 , . [sent-17, score-0.292]
10 In analyses of finite mixture models, a common concern is over-fitting in which redundant mixture components located close together are introduced. [sent-21, score-0.709]
11 In particular, introducing components located close together can lead to splitting of well separated clusters into a larger number of closely overlapping clusters. [sent-23, score-0.601]
12 1 Recently, [1] studied the asymptotic behavior of the posterior distribution in over-fitted Bayesian mixture models having more components than needed. [sent-26, score-0.502]
13 They showed that a carefully chosen prior will lead to asymptotic emptying of the redundant components. [sent-27, score-0.256]
14 For example, if we consider a finite location-scale mixture of multivariate Gaussians, one may choose P0 to be multivariate Gaussian-inverse Wishart. [sent-30, score-0.217]
15 In addition, drawing the component-specific parameters from a common prior tends to favor components located close together unless the variance is high. [sent-32, score-0.492]
16 For finite samples, the weight assigned to redundant components is often substantial. [sent-34, score-0.266]
17 Each mixture component can potentially be split into multiple components having the same parameters. [sent-36, score-0.365]
18 Even if exact equivalence is ruled out, it can be difficult to distinguish between models having different degrees of splitting of well-separated components into components located close together. [sent-37, score-0.494]
19 This issue can lead to an unnecessarily complex model, and creates difficulties in estimating the number of components and component-specific parameters. [sent-38, score-0.308]
20 Existing strategies, such as the incorporation of order constraints, do not adequately address this issue, since it is difficult to choose reasonable constraints in multivariate problems and even with constraints, the components can be close together. [sent-39, score-0.284]
21 The problem of separating components has been studied for Gaussian mixture models ([2]; [3]). [sent-40, score-0.365]
22 Two Gaussians can be separated by placing an arbitrarily chosen lower bound on the distance between their means. [sent-41, score-0.171]
23 To avoid such arbitrary hard separation thresholds, we instead propose a repulsive prior that smoothly pushes components apart. [sent-44, score-1.254]
24 In contrast to the vast majority of the recent Bayesian literature on discrete mixture models, instead of drawing the component-specific parameters {γh } independently from a common prior P0 , we propose a joint prior for {γ1 , . [sent-45, score-0.363]
25 , γk } that is chosen to assign low density to γh s located close together. [sent-48, score-0.252]
26 The deviation from independence is specified a priori by a pair of repulsion parameters. [sent-49, score-0.166]
27 The proposed class of repulsive mixture models will only place components close together if it results in a substantial gain in model fit. [sent-50, score-1.21]
28 As we illustrate, the prior will favor a more parsimonious representation of densities, while improving practical performance in unsupervised learning. [sent-51, score-0.174]
29 We provide strong theoretical results on rates of posterior convergence and develop Markov chain Monte Carlo algorithms for posterior computation. [sent-52, score-0.256]
30 1 Background on Bayesian mixture modeling Considering the finite mixture model in expression (1), a Bayesian specification is completed by choosing priors for the number of components k, the probability weights p, and the componentspecific parameters γ = (γ1 , . [sent-54, score-0.621]
31 Posterior computation can proceed via a reversible jump Markov chain Monte Carlo algorithm involving moves for adding or deleting mixture components. [sent-62, score-0.271]
32 It has become popular to use over-fitted mixture models in which k is chosen as a conservative upper bound on the number of components under the expectation that only relatively few of the components will be occupied by subjects in the sample. [sent-65, score-0.608]
33 From a practical perspective, the success of over-fitted mixture models has been largely due to ease in computation. [sent-66, score-0.171]
34 , k and a constant c > 0 leads to an approximation to a Dirichlet process mixture model for the density of y, which is obtained in the 2 limit as k approaches infinity. [sent-70, score-0.292]
35 An alternative finite approximation to a Dirichlet process mixture is obtained by truncating the stick-breaking representation of [6], leading to a similarly simple Gibbs sampling algorithm [7]. [sent-71, score-0.171]
36 2 Repulsive densities We seek a prior on the component parameters in (1) that automatically favors spread out components near the support of the data. [sent-74, score-0.418]
37 Instead of generating the atoms γh independently from P0 , one could generate them from a repulsive process that automatically pushes the atoms apart. [sent-75, score-0.87]
38 This idea is conceptually related to the literature on repulsive point processes [8]. [sent-76, score-0.811]
39 In the spatial statistics literature, a variety of repulsive processes have been proposed. [sent-77, score-0.811]
40 One such model assumes that points are clustered spatially, with the cluster centers having a Strauss density [9], that is p(k, γ) ∝ β k ρr(γ) where k is the number of clusters, β > 0, 0 < ρ ≤ 1 and r(γ) is the number of pairwise centers that lie within a pre-specified distance r of each other. [sent-78, score-0.229]
41 A possibly unappealing feature is that repulsion is not directly dependent on the pairwise distances between the clusters. [sent-79, score-0.206]
42 We propose an alternative class of priors, which smoothly push apart components based on pairwise distances. [sent-80, score-0.329]
43 A density h(γ) is repulsive if for any δ > 0 there is a corresponding > 0 such that h(γ) < δ for all γ ∈ Γ \ G , where G = {γ : d(γs , γi ) > ; s = 1, . [sent-82, score-0.908]
44 Depending on the specification of the metric d(γs , γj ), a prior satisfying definition 1 may limit overfitting or favor well separated clusters. [sent-86, score-0.295]
45 When d(γs , γj ) is the distance between sub-vectors of γs and γj corresponding to only locations the proposed prior favors well separated clusters. [sent-87, score-0.281]
46 Instead, when d(γs , γj ) is the distance between the sth and jth kernel, a prior satisfying definition 1 limits overfitting in density estimation. [sent-88, score-0.264]
47 As a convenient class of repulsive priors which smoothly push components apart, we propose k π(γ) = c1 g0 (γh ) h(γ), (2) h=1 with c1 being the normalizing constant that depends on the number of components k. [sent-90, score-1.32]
48 The proposed prior is related to a class of point processes from the statistical physics and spatial statistics literature referred to as Gibbs processes [10]. [sent-91, score-0.17]
49 We assume g0 : Γ → + and h : Γk → [0, ∞) are continuous with respect to Lesbesgue measure, and h is bounded above by a positive constant c2 and is repulsive according to definition 1. [sent-92, score-0.787]
50 A special hardcore repulsion is produced if the repulsion function is zero when at least one pairwise distance is smaller than a pre-specified threshold. [sent-94, score-0.346]
51 Such a density implies choosing a minimal separation level between the atoms. [sent-95, score-0.233]
52 As mentioned in the introduction, we avoid such arbitrary hard separation thresholds by considering repulsive priors that smoothly push components apart. [sent-96, score-1.238]
53 The two alternative repulsion functions differ in their dependence on the relative distances between components, with all the pairwise distances playing a role in (3), while (4) only depends on the minimal separation. [sent-102, score-0.251]
54 Figure 1 shows contour plots of the prior π(γ1 , γ2 ) defined as (2) with g0 being the standard normal density, the repulsive function defined as (3) or (4) and g defined as (5) for different values of (τ, ν). [sent-104, score-0.909]
55 As τ and ν increase, the prior increasingly favors well separated components. [sent-105, score-0.259]
56 3 (I) (II) 5 5 0 0 −5 −5 −5 0 5 −5 (III) 0 5 (IV) 5 5 0 0 −5 −5 −5 0 5 −5 0 5 Figure 1: Contour plots of the repulsive prior π(γ1 , γ2 ) under (3), either (4) or (5) and (6) with hyperparameters (τ, ν) equal to (I)(1, 2), (II)(1, 4), (III)(5, 2) and (IV )(5, 4) 2. [sent-106, score-0.883]
57 The next lemma provides sufficient conditions under which the true density is in the K-L support of the prior. [sent-114, score-0.18]
58 The repulsive density in (2) with h defined as either (3) or (4) satisfies condition A4 in lemma 1. [sent-128, score-0.998]
59 The next lemma formalizes the posterior rate of concentration for univariate location mixtures of Gaussians. [sent-129, score-0.445]
60 As motivated above, when the number of mixture components is chosen to be unnecessarily large, it is appealing for the posterior distribution of the weights of the extra components to be concentrated near zero. [sent-141, score-0.902]
61 One of the main assumptions required in theorem 1 is that the posterior rate of convergence relative to the L1 metric is δn = n−1/2 (log n)q with q ≥ 0. [sent-143, score-0.176]
62 We provided the contraction rate, under the proposed prior specification and univariate Gaussian kernel, in lemma 3. [sent-144, score-0.155]
63 However, theorem 1 is a more general statement and it applies to multivariate mixture density of any kernel. [sent-145, score-0.344]
64 Theorem 1 is a modification of theorem 1 in [1] to the proposed repulsive mixture model. [sent-157, score-0.987]
65 Theorem 1 implies that the posterior expectation of weights of the extra components is of order O n−1/2 (log n)q(1+s(k0 ,α)/sr2 ) . [sent-158, score-0.443]
66 When the number of components is unknown, with only an upper bound known, the posterior rate of convergence is equivalent to the parametric rate n−1/2 [12]. [sent-160, score-0.398]
67 In this case, the rate in theorem 1 is n−1/2 under usual priors or the repulsive prior. [sent-161, score-0.924]
68 However, in our experience using usual priors, the sum of the extra components can be substantial in small to moderate sample sizes, and often has high variability. [sent-162, score-0.319]
69 As we show in Section 3, for repulsive priors the sum of the extra component weights is close to zero and has small variance for small as well as large sample sizes. [sent-163, score-1.009]
70 On the other hand, when an upper bound on the number of components is unknown, the posterior rate of concentration is n−1/2 (log n)q with q > 0. [sent-164, score-0.388]
71 In this case, according to theorem 1, using the proposed prior specification the logarithmic factor in theorem 1 of [1] can be improved. [sent-165, score-0.154]
72 4 Parameter calibration and posterior computation The parameters involved in the repulsion function h are chosen such that a priori, with high probability, the clusters will be adequately separated. [sent-167, score-0.413]
73 Here, it is natural to relate the separation of two densities to the distance between their location parameters. [sent-169, score-0.243]
74 A mixture of k densities is c-separated if all pairs of densities are c-separated. [sent-174, score-0.349]
75 This is because ν controls the rate at which the density tends to zero as two components approach but not the separation level across them. [sent-193, score-0.461]
76 A possible issue with the proposed repulsive mixture prior is that the full conditionals are nonstandard, complicating posterior computation. [sent-195, score-1.191]
77 It will be interesting in future work to develop fast approximations to MCMC for implementation of repulsive mixture models, such as variational methods for approximating the full posterior and optimization methods for obtaining a maximum a posteriori estimate. [sent-199, score-1.072]
78 The latter approach would provide an alternative to usual maximum likelihood estimation via the EM algorithm, which provides a penalty on components located close together. [sent-200, score-0.323]
79 3 Synthetic examples Synthetic toy examples were considered to assess the performance of the repulsive prior in density estimation, classification and emptying the extra components. [sent-201, score-1.152]
80 For each synthetic dataset, repulsive and non-repulsive mixture models were compared considering a fixed upper bound on the number of components; extra components should be assigned small probabilities and hence effectively excluded. [sent-203, score-1.339]
81 Details on parameters involved in the true densities and choice of prior distributions can be found in the supplementary material. [sent-207, score-0.185]
82 Table 1 shows summary statistics of the K-L divergence, the misclassification error and the sum of extra weights under repulsive and non-repulsive mixtures with six mixture components as the upper bound. [sent-208, score-1.48]
83 In practice, observations drawn from the same mixture component were considered as belonging to the same category and for each dataset a similarity matrix was constructed. [sent-210, score-0.171]
84 As shown in table 1, the K-L divergences under repulsive and non-repulsive mixtures become more similar as the sample size increases. [sent-212, score-0.931]
85 For smaller sample sizes, the results are more similar when components are very well separated. [sent-213, score-0.194]
86 Since a repulsive prior tends to discourage overlapping mixture components, a repulsive model might not estimate the density quite as accurately when a mixture of closely overlapping components is needed. [sent-214, score-2.412]
87 However, as the sample size increases, the fitted density approaches the true density regardless of the degree of closeness among clusters. [sent-215, score-0.242]
88 Again, though repulsive and non-repulsive mixtures perform similarly in estimating the true density, repulsive mixtures place considerably less probability on extra components leading to more interpretable clusters. [sent-216, score-2.158]
89 In terms of misclassification error, the repulsive model outperforms the other two approaches while, in most cases, the worst performance was obtained by the non-repulsive model. [sent-217, score-0.787]
90 Potentially, one may favor fewer clusters, and hence possibly better separated clusters, by penalizing the introduction of new clusters more through modifying the precision in the Dirichlet prior for the weights; in the supplemental materials, we demonstrate that this cannot solve the problem. [sent-218, score-0.398]
91 Repulsive and non-repulsive mixtures were fitted under different choices of upper bound on the number of components. [sent-225, score-0.168]
92 Posterior means and standard deviations of the three highest weights were (0·30, 0·23, 0·13) and (0·05, 0·04, 0·04) for non-repulsive and (0·60, 0·30, 0·04) and (0·04, 0·03, 0·02) for repulsive under six components. [sent-227, score-0.845]
93 Clearly, repulsive priors lead to a posterior more concentrated on two components, and assign low probability to more than three components. [sent-228, score-0.975]
94 7 Figure 3: Posterior density of the total probability weight assigned to more than three components in the Iris data under a max of 6 or 10 components for non-repulsive (6:solid, 10:dash-dot) and repulsive (6:dash, 10:dot) mixtures. [sent-236, score-1.324]
95 Figure 3 shows the density of the total probability assigned to the extra components. [sent-237, score-0.251]
96 According to figure 3, our repulsive prior specification leads to extra component weights very close to zero regardless of the upper bound on the number of components. [sent-239, score-1.077]
97 Non-repulsive mixtures assign large weight to extra components, with posterior uncertainty increasing considerably as the number of components increases. [sent-241, score-0.554]
98 Discussions We have proposed a new repulsive mixture modeling framework, which should lead to substantially improved unsupervised learning (clustering) performance in general applications. [sent-242, score-1.008]
99 A key aspect is soft penalization of components located close together to favor, without sharply enforcing, well separated clusters that should be more likely to correspond to the true missing labels. [sent-243, score-0.548]
100 We have focused on Bayesian MCMC-based methods, but there are numerous interesting directions for ongoing research, including fast optimization-based approaches for learning mixture models with repulsive penalties. [sent-244, score-0.958]
wordName wordTfidf (topN-words)
[('repulsive', 0.787), ('components', 0.194), ('mixture', 0.171), ('mixtures', 0.144), ('repulsion', 0.141), ('separated', 0.124), ('density', 0.121), ('posterior', 0.114), ('extra', 0.102), ('clusters', 0.101), ('prior', 0.096), ('separation', 0.09), ('densities', 0.089), ('located', 0.071), ('misclassi', 0.071), ('iib', 0.069), ('iiib', 0.069), ('unnecessarily', 0.069), ('dirichlet', 0.067), ('iv', 0.064), ('iiia', 0.061), ('iia', 0.061), ('lemma', 0.059), ('tted', 0.054), ('ishwaran', 0.053), ('priors', 0.052), ('smoothly', 0.052), ('iris', 0.05), ('favor', 0.05), ('bayesian', 0.048), ('reversible', 0.046), ('emptying', 0.046), ('redundant', 0.044), ('pairwise', 0.042), ('location', 0.042), ('push', 0.041), ('favors', 0.039), ('satis', 0.039), ('species', 0.038), ('pushes', 0.035), ('close', 0.035), ('divergence', 0.035), ('gaussians', 0.033), ('weights', 0.033), ('rate', 0.033), ('synthetic', 0.033), ('adequately', 0.032), ('routinely', 0.032), ('condition', 0.031), ('overlapping', 0.031), ('iii', 0.03), ('clustering', 0.03), ('nition', 0.03), ('formalizes', 0.03), ('theorem', 0.029), ('unsupervised', 0.028), ('gibbs', 0.028), ('sinica', 0.028), ('statistica', 0.028), ('assigned', 0.028), ('chain', 0.028), ('auxiliary', 0.027), ('carlo', 0.027), ('penalizing', 0.027), ('dunson', 0.027), ('health', 0.027), ('referred', 0.026), ('tting', 0.026), ('solid', 0.026), ('contour', 0.026), ('jump', 0.026), ('monte', 0.026), ('duke', 0.025), ('priori', 0.025), ('six', 0.025), ('finite', 0.025), ('chosen', 0.025), ('satisfying', 0.025), ('redundancy', 0.024), ('atoms', 0.024), ('switching', 0.024), ('upper', 0.024), ('processes', 0.024), ('together', 0.023), ('usual', 0.023), ('tends', 0.023), ('distances', 0.023), ('asymptotic', 0.023), ('issue', 0.023), ('ii', 0.023), ('concentration', 0.023), ('multivariate', 0.023), ('minimal', 0.022), ('distance', 0.022), ('kernel', 0.022), ('cation', 0.022), ('fk', 0.022), ('thresholds', 0.022), ('lead', 0.022), ('centers', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 294 nips-2012-Repulsive Mixtures
Author: Francesca Petralia, Vinayak Rao, David B. Dunson
Abstract: Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool. One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. The methods are illustrated using synthetic examples and an iris data set. Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based clustering; Repulsive point process; Well separated mixture. 1
2 0.15404171 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
Author: Xue-xin Wei, Alan Stocker
Abstract: A common challenge for Bayesian models of perception is the fact that the two fundamental Bayesian components, the prior distribution and the likelihood function, are formally unconstrained. Here we argue that a neural system that emulates Bayesian inference is naturally constrained by the way it represents sensory information in populations of neurons. More specifically, we show that an efficient coding principle creates a direct link between prior and likelihood based on the underlying stimulus distribution. The resulting Bayesian estimates can show biases away from the peaks of the prior distribution, a behavior seemingly at odds with the traditional view of Bayesian estimation, yet one that has been reported in human perception. We demonstrate that our framework correctly accounts for the repulsive biases previously reported for the perception of visual orientation, and show that the predicted tuning characteristics of the model neurons match the reported orientation tuning properties of neurons in primary visual cortex. Our results suggest that efficient coding is a promising hypothesis in constraining Bayesian models of perceptual inference. 1 Motivation Human perception is not perfect. Biases have been observed in a large number of perceptual tasks and modalities, of which the most salient ones constitute many well-known perceptual illusions. It has been suggested, however, that these biases do not reflect a failure of perception but rather an observer’s attempt to optimally combine the inherently noisy and ambiguous sensory information with appropriate prior knowledge about the world [13, 4, 14]. This hypothesis, which we will refer to as the Bayesian hypothesis, has indeed proven quite successful in providing a normative explanation of perception at a qualitative and, more recently, quantitative level (see e.g. [15]). A major challenge in forming models based on the Bayesian hypothesis is the correct selection of two main components: the prior distribution (belief) and the likelihood function. This has encouraged some to criticize the Bayesian hypothesis altogether, claiming that arbitrary choices for these components always allow for unjustified post-hoc explanations of the data [1]. We do not share this criticism, referring to a number of successful attempts to constrain prior beliefs and likelihood functions based on principled grounds. For example, prior beliefs have been defined as the relative distribution of the sensory variable in the environment in cases where these statistics are relatively easy to measure (e.g. local visual orientations [16]), or where it can be assumed that subjects have learned them over the course of the experiment (e.g. time perception [17]). Other studies have constrained the likelihood function according to known noise characteristics of neurons that are crucially involved in the specific perceptual process (e.g motion tuned neurons in visual cor∗ http://www.sas.upenn.edu/ astocker/lab 1 world neural representation efficient encoding percept Bayesian decoding Figure 1: Encoding-decoding framework. A stimulus representing a sensory variable θ elicits a firing rate response R = {r1 , r2 , ..., rN } in a population of N neurons. The perceptual task is to generate a ˆ good estimate θ(R) of the presented value of the sensory variable based on this population response. Our framework assumes that encoding is efficient, and decoding is Bayesian based on the likelihood p(R|θ), the prior p(θ), and a squared-error loss function. tex [18]). However, we agree that finding appropriate constraints is generally difficult and that prior beliefs and likelihood functions have been often selected on the basis of mathematical convenience. Here, we propose that the efficient coding hypothesis [19] offers a joint constraint on the prior and likelihood function in neural implementations of Bayesian inference. Efficient coding provides a normative description of how neurons encode sensory information, and suggests a direct link between measured perceptual discriminability, neural tuning characteristics, and environmental statistics [11]. We show how this link can be extended to a full Bayesian account of perception that includes perceptual biases. We validate our model framework against behavioral as well as neural data characterizing the perception of visual orientation. We demonstrate that we can account not only for the reported perceptual biases away from the cardinal orientations, but also for the specific response characteristics of orientation-tuned neurons in primary visual cortex. Our work is a novel proposal of how two important normative hypotheses in perception science, namely efficient (en)coding and Bayesian decoding, might be linked. 2 Encoding-decoding framework We consider perception as an inference process that takes place along the simplified neural encodingdecoding cascade illustrated in Fig. 11 . 2.1 Efficient encoding Efficient encoding proposes that the tuning characteristics of a neural population are adapted to the prior distribution p(θ) of the sensory variable such that the population optimally represents the sensory variable [19]. Different definitions of “optimally” are possible, and may lead to different results. Here, we assume an efficient representation that maximizes the mutual information between the sensory variable and the population response. With this definition and an upper limit on the total firing activity, the square-root of the Fisher Information must be proportional to the prior distribution [12, 21]. In order to constrain the tuning curves of individual neurons in the population we also impose a homogeneity constraint, requiring that there exists a one-to-one mapping F (θ) that transforms the ˜ physical space with units θ to a homogeneous space with units θ = F (θ) in which the stimulus distribution becomes uniform. This defines the mapping as θ F (θ) = p(χ)dχ , (1) −∞ which is the cumulative of the prior distribution p(θ). We then assume a neural population with identical tuning curves that evenly tiles the stimulus range in this homogeneous space. The population provides an efficient representation of the sensory variable θ according to the above constraints [11]. ˜ The tuning curves in the physical space are obtained by applying the inverse mapping F −1 (θ). Fig. 2 1 In the context of this paper, we consider ‘inferring’, ‘decoding’, and ‘estimating’ as synonymous. 2 stimulus distribution d samples # a Fisher information discriminability and average firing rates and b firing rate [ Hz] efficient encoding F likelihood function F -1 likelihood c symmetric asymmetric homogeneous space physical space Figure 2: Efficient encoding constrains the likelihood function. a) Prior distribution p(θ) derived from stimulus statistics. b) Efficient coding defines the shape of the tuning curves in the physical space by transforming a set of homogeneous neurons using a mapping F −1 that is the inverse of the cumulative of the prior p(θ) (see Eq. (1)). c) As a result, the likelihood shape is constrained by the prior distribution showing heavier tails on the side of lower prior density. d) Fisher information, discrimination threshold, and average firing rates are all uniform in the homogeneous space. illustrates the applied efficient encoding scheme, the mapping, and the concept of the homogeneous space for the example of a symmetric, exponentially decaying prior distribution p(θ). The key idea here is that by assuming efficient encoding, the prior (i.e. the stimulus distribution in the world) directly constrains the likelihood function. In particular, the shape of the likelihood is determined by the cumulative distribution of the prior. As a result, the likelihood is generally asymmetric, as shown in Fig. 2, exhibiting heavier tails on the side of the prior with lower density. 2.2 Bayesian decoding Let us consider a population of N sensory neurons that efficiently represents a stimulus variable θ as described above. A stimulus θ0 elicits a specific population response that is characterized by the vector R = [r1 , r2 , ..., rN ] where ri is the spike-count of the ith neuron over a given time-window τ . Under the assumption that the variability in the individual firing rates is governed by a Poisson process, we can write the likelihood function over θ as N p(R|θ) = (τ fi (θ))ri −τ fi (θ) e , ri ! i=1 (2) ˆ with fi (θ) describing the tuning curve of neuron i. We then define a Bayesian decoder θLSE as the estimator that minimizes the expected squared-error between the estimate and the true stimulus value, thus θp(R|θ)p(θ)dθ ˆ θLSE (R) = , (3) p(R|θ)p(θ)dθ where we use Bayes’ rule to appropriately combine the sensory evidence with the stimulus prior p(θ). 3 Bayesian estimates can be biased away from prior peaks Bayesian models of perception typically predict perceptual biases toward the peaks of the prior density, a characteristic often considered a hallmark of Bayesian inference. This originates from the 3 a b prior attraction prior prior attraction likelihood repulsion! likelihood c prior prior repulsive bias likelihood likelihood mean posterior mean posterior mean Figure 3: Bayesian estimates biased away from the prior. a) If the likelihood function is symmetric, then the estimate (posterior mean) is, on average, shifted away from the actual value of the sensory variable θ0 towards the prior peak. b) Efficient encoding typically leads to an asymmetric likelihood function whose normalized mean is away from the peak of the prior (relative to θ0 ). The estimate is determined by a combination of prior attraction and shifted likelihood mean, and can exhibit an overall repulsive bias. c) If p(θ0 ) < 0 and the likelihood is relatively narrow, then (1/p(θ)2 ) > 0 (blue line) and the estimate is biased away from the prior peak (see Eq. (6)). common approach of choosing a parametric description of the likelihood function that is computationally convenient (e.g. Gaussian). As a consequence, likelihood functions are typically assumed to be symmetric (but see [23, 24]), leaving the bias of the Bayesian estimator to be mainly determined by the shape of the prior density, i.e. leading to biases toward the peak of the prior (Fig. 3a). In our model framework, the shape of the likelihood function is constrained by the stimulus prior via efficient neural encoding, and is generally not symmetric for non-flat priors. It has a heavier tail on the side with lower prior density (Fig. 3b). The intuition is that due to the efficient allocation of neural resources, the side with smaller prior density will be encoded less accurately, leading to a broader likelihood function on that side. The likelihood asymmetry pulls the Bayes’ least-squares estimate away from the peak of the prior while at the same time the prior pulls it toward its peak. Thus, the resulting estimation bias is the combination of these two counter-acting forces - and both are determined by the prior! 3.1 General derivation of the estimation bias In the following, we will formally derive the mean estimation bias b(θ) of the proposed encodingdecoding framework. Specifically, we will study the conditions for which the bias is repulsive i.e. away from the peak of the prior density. ˆ We first re-write the estimator θLSE (3) by replacing θ with the inverse of its mapping to the homo−1 ˜ geneous space, i.e., θ = F (θ). The motivation for this is that the likelihood in the homogeneous space is symmetric (Fig. 2). Given a value θ0 and the elicited population response R, we can write the estimator as ˜ ˜ ˜ ˜ θp(R|θ)p(θ)dθ F −1 (θ)p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) ˆ θLSE (R) = = . ˜ ˜ ˜ p(R|θ)p(θ)dθ p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) Calculating the derivative of the inverse function and noting that F is the cumulative of the prior density, we get 1 1 1 ˜ ˜ ˜ ˜ ˜ ˜ dθ = dθ. dF −1 (θ) = (F −1 (θ)) dθ = dθ = −1 (θ)) ˜ F (θ) p(θ) p(F ˆ Hence, we can simplify θLSE (R) as ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)p(R|F −1 (θ))dθ . ˜ ˜ p(R|F −1 (θ))dθ With ˜ K(R, θ) = ˜ p(R|F −1 (θ)) ˜ ˜ p(R|F −1 (θ))dθ 4 we can further simplify the notation and get ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)K(R, θ)dθ . (4) ˆ ˜ In order to get the expected value of the estimate, θLSE (θ), we marginalize (4) over the population response space S, ˆ ˜ ˜ ˜ ˜ θLSE (θ) = p(R)F −1 (θ)K(R, θ)dθdR S = F −1 ˜ (θ)( ˜ ˜ p(R)K(R, θ)dR)dθ = ˜ ˜ ˜ F −1 (θ)L(θ)dθ, S where we define ˜ L(θ) = ˜ p(R)K(R, θ)dR. S ˜ ˜ ˜ It follows that L(θ)dθ = 1. Due to the symmetry in this space, it can be shown that L(θ) is ˜0 . Intuitively, L(θ) can be thought as the normalized ˜ symmetric around the true stimulus value θ average likelihood in the homogeneous space. We can then compute the expected bias at θ0 as b(θ0 ) = ˜ ˜ ˜ ˜ F −1 (θ)L(θ)dθ − F −1 (θ0 ) (5) ˜ This is expression is general where F −1 (θ) is defined as the inverse of the cumulative of an arbitrary ˜ prior density p(θ) (see Eq. (1)) and the dispersion of L(θ) is determined by the internal noise level. ˜ ˜ Assuming the prior density to be smooth, we expand F −1 in a neighborhood (θ0 − h, θ0 + h) that is larger than the support of the likelihood function. Using Taylor’s theorem with mean-value forms of the remainder, we get 1 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ F −1 (θ) = F −1 (θ0 ) + F −1 (θ0 ) (θ − θ0 ) + F −1 (θx ) (θ − θ0 )2 , 2 ˜ ˜ ˜ with θx lying between θ0 and θ. By applying this expression to (5), we find ˜ θ0 +h b(θ0 ) = = 1 2 ˜ θ0 −h 1 −1 ˜ ˜ ˜ ˜ ˜ 1 F (θx )θ (θ − θ0 )2 L(θ)dθ = ˜ 2 2 ˜ θ0 +h −( ˜ θ0 −h p(θx )θ ˜ ˜ 2 ˜ ˜ 1 )(θ − θ0 ) L(θ)dθ = p(θx )3 4 ˜ θ0 +h 1 ˜ − θ0 )2 L(θ)dθ ˜ ˜ ˜ ( ) ˜(θ ˜ p(F −1 (θx )) θ ( 1 ˜ ˜ ˜ ˜ ) (θ − θ0 )2 L(θ)dθ. p(θx )2 θ ˜ θ0 −h ˜ θ0 +h ˜ θ0 −h In general, there is no simple rule to judge the sign of b(θ0 ). However, if the prior is monotonic ˜ ˜ on the interval F −1 ((θ0 − h, θ0 + h)), then the sign of ( p(θ1 )2 ) is always the same as the sign of x 1 1 ( p(θ0 )2 ) . Also, if the likelihood is sufficiently narrow we can approximate ( p(θ1 )2 ) by ( p(θ0 )2 ) , x and therefore approximate the bias as b(θ0 ) ≈ C( 1 ) , p(θ0 )2 (6) where C is a positive constant. The result is quite surprising because it states that as long as the prior is monotonic over the support of the likelihood function, the expected estimation bias is always away from the peaks of the prior! 3.2 Internal (neural) versus external (stimulus) noise The above derivation of estimation bias is based on the assumption that all uncertainty about the sensory variable is caused by neural response variability. This level of internal noise depends on the response magnitude, and thus can be modulated e.g. by changing stimulus contrast. This contrastcontrolled noise modulation is commonly exploited in perceptual studies (e.g. [18]). Internal noise will always lead to repulsive biases in our framework if the prior is monotonic. If internal noise is low, the likelihood is narrow and thus the bias is small. Increasing internal noise leads to increasingly 5 larger biases up to the point where the likelihood becomes wide enough such that monotonicity of the prior over the support of the likelihood is potentially violated. Stimulus noise is another way to modulate the noise level in perception (e.g. random-dot motion stimuli). Such external noise, however, has a different effect on the shape of the likelihood function as compared to internal noise. It modifies the likelihood function (2) by convolving it with the noise kernel. External noise is frequently chosen as additive and symmetric (e.g. zero-mean Gaussian). It is straightforward to prove that such symmetric external noise does not lead to a change in the mean of the likelihood, and thus does not alter the repulsive effect induced by its asymmetry. However, by increasing the overall width of the likelihood, the attractive influence of the prior increases, resulting in an estimate that is closer to the prior peak than without external noise2 . 4 Perception of visual orientation We tested our framework by modelling the perception of visual orientation. Our choice was based on the fact that i) we have pretty good estimates of the prior distribution of local orientations in natural images, ii) tuning characteristics of orientation selective neurons in visual cortex are wellstudied (monkey/cat), and iii) biases in perceived stimulus orientation have been well characterized. We start by creating an efficient neural population based on measured prior distributions of local visual orientation, and then compare the resulting tuning characteristics of the population and the predicted perceptual biases with reported data in the literature. 4.1 Efficient neural model population for visual orientation Previous studies measured the statistics of the local orientation in large sets of natural images and consistently found that the orientation distribution is multimodal, peaking at the two cardinal orientations as shown in Fig. 4a [16, 20]. We assumed that the visual system’s prior belief over orientation p(θ) follows this distribution and approximate it formally as p(θ) ∝ 2 − | sin(θ)| (black line in Fig. 4b) . (7) Based on this prior distribution we defined an efficient neural representation for orientation. We assumed a population of model neurons (N = 30) with tuning curves that follow a von-Mises distribution in the homogeneous space on top of a constant spontaneous firing rate (5 Hz). We then ˜ applied the inverse transformation F −1 (θ) to all these tuning curves to get the corresponding tuning curves in the physical space (Fig. 4b - red curves), where F (θ) is the cumulative of the prior (7). The concentration parameter for the von-Mises tuning curves was set to κ ≈ 1.6 in the homogeneous space in order to match the measured average tuning width (∼ 32 deg) of neurons in area V1 of the macaque [9]. 4.2 Predicted tuning characteristics of neurons in primary visual cortex The orientation tuning characteristics of our model population well match neurophysiological data of neurons in primary visual cortex (V1). Efficient encoding predicts that the distribution of neurons’ preferred orientation follows the prior, with more neurons tuned to cardinal than oblique orientations by a factor of approximately 1.5. A similar ratio has been found for neurons in area V1 of monkey/cat [9, 10]. Also, the tuning widths of the model neurons vary between 25-42 deg depending on their preferred tuning (see Fig. 4c), matching the measured tuning width ratio of 0.6 between neurons tuned to the cardinal versus oblique orientations [9]. An important prediction of our model is that most of the tuning curves should be asymmetric. Such asymmetries have indeed been reported for the orientation tuning of neurons in area V1 [6, 7, 8]. We computed the asymmetry index for our model population as defined in previous studies [6, 7], and plotted it as a function of the preferred tuning of each neuron (Fig. 4d). The overall asymmetry index in our model population is 1.24 ± 0.11, which approximately matches the measured values for neurons in area V1 of the cat (1.26 ± 0.06) [6]. It also predicts that neurons tuned to the cardinal and oblique orientations should show less symmetry than those tuned to orientations in between. Finally, 2 Note, that these predictions are likely to change if the external noise is not symmetric. 6 a b 25 firing rate(Hz) 0 orientation(deg) asymmetry vs. tuning width 1.0 2.0 90 2.0 e asymmetry 1.0 0 asymmetry index 50 30 width (deg) 10 90 preferred tuning(deg) -90 0 d 0 0 90 asymmetry index 0 orientation(deg) tuning width -90 0 0 probability 0 -90 c efficient representation 0.01 0.01 image statistics -90 0 90 preferred tuning(deg) 25 30 35 40 tuning width (deg) Figure 4: Tuning characteristics of model neurons. a) Distribution of local orientations in natural images, replotted from [16]. b) Prior used in the model (black) and predicted tuning curves according to efficient coding (red). c) Tuning width as a function of preferred orientation. d) Tuning curves of cardinal and oblique neurons are more symmetric than those tuned to orientations in between. e) Both narrowly and broadly tuned neurons neurons show less asymmetry than neurons with tuning widths in between. neurons with tuning widths at the lower and upper end of the range are predicted to exhibit less asymmetry than those neurons whose widths lie in between these extremes (illustrated in Fig. 4e). These last two predictions have not been tested yet. 4.3 Predicted perceptual biases Our model framework also provides specific predictions for the expected perceptual biases. Humans show systematic biases in perceived orientation of visual stimuli such as e.g. arrays of Gabor patches (Fig. 5a,d). Two types of biases can be distinguished: First, perceived orientations show an absolute bias away from the cardinal orientations, thus away from the peaks of the orientation prior [2, 3]. We refer to these biases as absolute because they are typically measured by adjusting a noise-free reference until it matched the orientation of the test stimulus. Interestingly, these repulsive absolute biases are the larger the smaller the external stimulus noise is (see Fig. 5b). Second, the relative bias between the perceived overall orientations of a high-noise and a low-noise stimulus is toward the cardinal orientations as shown in Fig. 5c, and thus toward the peak of the prior distribution [3, 16]. The predicted perceptual biases of our model are shown Fig. 5e,f. We computed the likelihood function according to (2) and used the prior in (7). External noise was modeled by convolving the stimulus likelihood function with a Gaussian (different widths for different noise levels). The predictions well match both, the reported absolute bias away as well as the relative biases toward the cardinal orientations. Note, that our model framework correctly accounts for the fact that less external noise leads to larger absolute biases (see also discussion in section 3.2). 5 Discussion We have presented a modeling framework for perception that combines efficient (en)coding and Bayesian decoding. Efficient coding imposes constraints on the tuning characteristics of a population of neurons according to the stimulus distribution (prior). It thus establishes a direct link between prior and likelihood, and provides clear constraints on the latter for a Bayesian observer model of perception. We have shown that the resulting likelihoods are in general asymmetric, with 7 absolute bias (data) b c relative bias (data) -4 0 bias(deg) 4 a low-noise stimulus -90 e 90 absolute bias (model) low external noise high external noise 3 high-noise stimulus -90 f 0 90 relative bias (model) 0 bias(deg) d 0 attraction -3 repulsion -90 0 orientation (deg) 90 -90 0 orientation (deg) 90 Figure 5: Biases in perceived orientation: Human data vs. Model prediction. a,d) Low- and highnoise orientation stimuli of the type used in [3, 16]. b) Humans show absolute biases in perceived orientation that are away from the cardinal orientations. Data replotted from [2] (pink squares) and [3] (green (black) triangles: bias for low (high) external noise). c) Relative bias between stimuli with different external noise level (high minus low). Data replotted from [3] (blue triangles) and [16] (red circles). e,f) Model predictions for absolute and relative bias. heavier tails away from the prior peaks. We demonstrated that such asymmetric likelihoods can lead to the counter-intuitive prediction that a Bayesian estimator is biased away from the peaks of the prior distribution. Interestingly, such repulsive biases have been reported for human perception of visual orientation, yet a principled and consistent explanation of their existence has been missing so far. Here, we suggest that these counter-intuitive biases directly follow from the asymmetries in the likelihood function induced by efficient neural encoding of the stimulus. The good match between our model predictions and the measured perceptual biases and orientation tuning characteristics of neurons in primary visual cortex provides further support of our framework. Previous work has suggested that there might be a link between stimulus statistics, neuronal tuning characteristics, and perceptual behavior based on efficient coding principles, yet none of these studies has recognized the importance of the resulting likelihood asymmetries [16, 11]. We have demonstrated here that such asymmetries can be crucial in explaining perceptual data, even though the resulting estimates appear “anti-Bayesian” at first sight (see also models of sensory adaptation [23]). Note, that we do not provide a neural implementation of the Bayesian inference step. However, we and others have proposed various neural decoding schemes that can approximate Bayes’ leastsquares estimation using efficient coding [26, 25, 22]. It is also worth pointing out that our estimator is set to minimize total squared-error, and that other choices of the loss function (e.g. MAP estimator) could lead to different predictions. Our framework is general and should be directly applicable to other modalities. In particular, it might provide a new explanation for perceptual biases that are hard to reconcile with traditional Bayesian approaches [5]. Acknowledgments We thank M. Jogan and A. Tank for helpful comments on the manuscript. This work was partially supported by grant ONR N000141110744. 8 References [1] M. Jones, and B. C. Love. Bayesian fundamentalism or enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34, 169–231,2011. [2] D. P. Andrews. Perception of contours in the central fovea. Nature, 205:1218- 1220, 1965. [3] A. Tomassini, M. J.Morgam. and J. A. Solomon. Orientation uncertainty reduces perceived obliquity. Vision Res, 50, 541–547, 2010. [4] W. S. Geisler, D. Kersten. Illusions, perception and Bayes. Nature Neuroscience, 5(6):508- 510, 2002. [5] M. O. Ernst Perceptual learning: inverting the size-weight illusion. Current Biology, 19:R23- R25, 2009. [6] G. H. Henry, B. Dreher, P. O. Bishop. Orientation specificity of cells in cat striate cortex. J Neurophysiol, 37(6):1394-409,1974. [7] D. Rose, C. Blakemore An analysis of orientation selectivity in the cat’s visual cortex. Exp Brain Res., Apr 30;20(1):1-17, 1974. [8] N. V. Swindale. Orientation tuning curves: empirical description and estimation of parameters. Biol Cybern., 78(1):45-56, 1998. [9] R. L. De Valois, E. W. Yund, N. Hepler. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res.,22, 531544,1982. [10] B. Li, M. R. Peterson, R. D. Freeman. The oblique effect: a neural basis in the visual cortex. J. Neurophysiol., 90, 204217, 2003. [11] D. Ganguli and E.P. Simoncelli. Implicit encoding of prior probabilities in optimal neural populations. In Adv. Neural Information Processing Systems NIPS 23, vol. 23:658–666, 2011. [12] M. D. McDonnell, N. G. Stocks. Maximally Informative Stimuli and Tuning Curves for Sigmoidal RateCoding Neurons and Populations. Phys Rev Lett., 101(5):058103, 2008. [13] H Helmholtz. Treatise on Physiological Optics (transl.). Thoemmes Press, Bristol, U.K., 2000. Original publication 1867. [14] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [15] D.C. Knill and W. Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [16] A R Girshick, M S Landy, and E P Simoncelli. Cardinal rules: visual orientation perception reflects knowledge of environmental statistics. Nat Neurosci, 14(7):926–932, Jul 2011. [17] M. Jazayeri and M.N. Shadlen. Temporal context calibrates interval timing. Nature Neuroscience, 13(8):914–916, 2010. [18] A.A. Stocker and E.P. Simoncelli. Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, pages 578–585, April 2006. [19] H.B. Barlow. Possible principles underlying the transformation of sensory messages. In W.A. Rosenblith, editor, Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961. [20] D.M. Coppola, H.R. Purves, A.N. McCoy, and D. Purves The distribution of oriented contours in the real world. Proc Natl Acad Sci U S A., 95(7): 4002–4006, 1998. [21] N. Brunel and J.-P. Nadal. Mutual information, Fisher information and population coding. Neural Computation, 10, 7, 1731–1757, 1998. [22] X-X. Wei and A.A. Stocker. Bayesian inference with efficient neural population codes. In Lecture Notes in Computer Science, Artificial Neural Networks and Machine Learning - ICANN 2012, Lausanne, Switzerland, volume 7552, pages 523–530, 2012. [23] A.A. Stocker and E.P. Simoncelli. Sensory adaptation within a Bayesian framework for perception. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages o 1291–1298. MIT Press, Cambridge, MA, 2006. Oral presentation. [24] D.C. Knill. Robust cue integration: A Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):1–24, 2007. [25] Deep Ganguli. Efficient coding and Bayesian inference with neural populations. PhD thesis, Center for Neural Science, New York University, New York, NY, September 2012. [26] B. Fischer. Bayesian estimates from heterogeneous population codes. In Proc. IEEE Intl. Joint Conf. on Neural Networks. IEEE, 2010. 9
3 0.13927853 180 nips-2012-Learning Mixtures of Tree Graphical Models
Author: Anima Anandkumar, Furong Huang, Daniel J. Hsu, Sham M. Kakade
Abstract: We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable is hidden and each mixture component can have a potentially different Markov graph structure and parameters over the observed variables. We propose a novel method for estimating the mixture components with provable guarantees. Our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. The sample and computational requirements for our method scale as poly(p, r), for an r-component mixture of pvariate graphical models, for a wide class of models which includes tree mixtures and mixtures over bounded degree graphs. Keywords: Graphical models, mixture models, spectral methods, tree approximation.
4 0.12331355 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
5 0.10201128 126 nips-2012-FastEx: Hash Clustering with Exponential Families
Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy
Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1
6 0.079936214 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
7 0.077040307 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
8 0.075712174 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
9 0.074426949 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
10 0.070280463 40 nips-2012-Analyzing 3D Objects in Cluttered Images
11 0.068167157 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
12 0.066767424 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation
13 0.066146232 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
14 0.066062935 69 nips-2012-Clustering Sparse Graphs
15 0.065528035 37 nips-2012-Affine Independent Variational Inference
16 0.061642535 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space
17 0.06158565 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models
18 0.060391724 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
19 0.0600763 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
20 0.058176987 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
topicId topicWeight
[(0, 0.169), (1, 0.05), (2, 0.003), (3, 0.009), (4, -0.127), (5, -0.038), (6, -0.007), (7, 0.013), (8, 0.058), (9, -0.04), (10, 0.049), (11, -0.056), (12, 0.01), (13, -0.094), (14, -0.002), (15, -0.073), (16, -0.003), (17, -0.033), (18, -0.032), (19, -0.012), (20, -0.036), (21, 0.011), (22, 0.031), (23, 0.119), (24, -0.004), (25, 0.054), (26, 0.02), (27, -0.033), (28, 0.072), (29, -0.007), (30, -0.034), (31, -0.037), (32, -0.02), (33, -0.067), (34, 0.074), (35, -0.001), (36, 0.022), (37, -0.091), (38, 0.076), (39, -0.009), (40, 0.008), (41, -0.061), (42, 0.034), (43, -0.012), (44, -0.078), (45, 0.002), (46, 0.015), (47, 0.055), (48, 0.115), (49, -0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.93933451 294 nips-2012-Repulsive Mixtures
Author: Francesca Petralia, Vinayak Rao, David B. Dunson
Abstract: Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool. One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. The methods are illustrated using synthetic examples and an iris data set. Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based clustering; Repulsive point process; Well separated mixture. 1
2 0.66911352 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
3 0.62604612 126 nips-2012-FastEx: Hash Clustering with Exponential Families
Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy
Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1
4 0.61010683 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
Author: Evan Archer, Il M. Park, Jonathan W. Pillow
Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1
5 0.60903651 26 nips-2012-A nonparametric variable clustering model
Author: Konstantina Palla, Zoubin Ghahramani, David A. Knowles
Abstract: Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover blockdiagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems. 1
6 0.58996069 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
7 0.55977261 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space
8 0.55418533 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding
9 0.53772861 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support
10 0.53325093 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
11 0.53293574 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
12 0.51939952 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
13 0.51751882 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
14 0.51666296 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
15 0.51316285 359 nips-2012-Variational Inference for Crowdsourcing
16 0.50932902 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
17 0.50686115 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates
18 0.50427562 180 nips-2012-Learning Mixtures of Tree Graphical Models
19 0.48395514 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
20 0.48319009 286 nips-2012-Random Utility Theory for Social Choice
topicId topicWeight
[(0, 0.07), (21, 0.033), (38, 0.11), (39, 0.014), (42, 0.06), (55, 0.03), (59, 0.176), (74, 0.066), (76, 0.17), (80, 0.112), (92, 0.047)]
simIndex simValue paperId paperTitle
1 0.93126744 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks
Author: Qirong Ho, Junming Yin, Eric P. Xing
Abstract: In this paper, we argue for representing networks as a bag of triangular motifs, particularly for important network problems that current model-based approaches handle poorly due to computational bottlenecks incurred by using edge representations. Such approaches require both 1-edges and 0-edges (missing edges) to be provided as input, and as a consequence, approximate inference algorithms for these models usually require Ω(N 2 ) time per iteration, precluding their application to larger real-world networks. In contrast, triangular modeling requires less computation, while providing equivalent or better inference quality. A triangular motif is a vertex triple containing 2 or 3 edges, and the number of such motifs is 2 Θ( i Di ) (where Di is the degree of vertex i), which is much smaller than N 2 for low-maximum-degree networks. Using this representation, we develop a novel mixed-membership network model and approximate inference algorithm suitable for large networks with low max-degree. For networks with high maximum degree, the triangular motifs can be naturally subsampled in a node-centric fashion, allowing for much faster inference at a small cost in accuracy. Empirically, we demonstrate that our approach, when compared to that of an edge-based model, has faster runtime and improved accuracy for mixed-membership community detection. We conclude with a large-scale demonstration on an N ≈ 280, 000-node network, which is infeasible for network models with Ω(N 2 ) inference cost. 1
2 0.89225757 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
Author: Daniel Zoran, Yair Weiss
Abstract: Simple Gaussian Mixture Models (GMMs) learned from pixels of natural image patches have been recently shown to be surprisingly strong performers in modeling the statistics of natural images. Here we provide an in depth analysis of this simple yet rich model. We show that such a GMM model is able to compete with even the most successful models of natural images in log likelihood scores, denoising performance and sample quality. We provide an analysis of what such a model learns from natural images as a function of number of mixture components including covariance structure, contrast variation and intricate structures such as textures, boundaries and more. Finally, we show that the salient properties of the GMM learned from natural images can be derived from a simplified Dead Leaves model which explicitly models occlusion, explaining its surprising success relative to other models. 1 GMMs and natural image statistics models Many models for the statistics of natural image patches have been suggested in recent years. Finding good models for natural images is important to many different research areas - computer vision, biological vision and neuroscience among others. Recently, there has been a growing interest in comparing different aspects of models for natural images such as log-likelihood and multi-information reduction performance, and much progress has been achieved [1,2, 3,4,5, 6]. Out of these results there is one which is particularly interesting: simple, unconstrained Gaussian Mixture Models (GMMs) with a relatively small number of mixture components learned from image patches are extraordinarily good in modeling image statistics [6, 4]. This is a surprising result due to the simplicity of GMMs and their ubiquity. Another surprising aspect of this result is that many of the current models may be thought of as GMMs with an exponential or infinite number of components, having different constraints on the covariance structure of the mixture components. In this work we study the nature of GMMs learned from natural image patches. We start with a thorough comparison to some popular and cutting edge image models. We show that indeed, GMMs are excellent performers in modeling natural image patches. We then analyze what properties of natural images these GMMs capture, their dependence on the number of components in the mixture and their relation to the structure of the world around us. Finally, we show that the learned GMM suggests a strong connection between natural image statistics and a simple variant of the dead leaves model [7, 8] , explicitly modeling occlusions and explaining some of the success of GMMs in modeling natural images. 1 3.5 .,...- ••.......-.-.. -..---'-. 1 ~~6\8161·· -.. .-.. --...--.-- ---..-.- -. --------------MII+··+ilIl ..... .. . . ~ '[25 . . . ---- ] B'II 1_ -- ~2 ;t:: fI 1 - --- ,---- ._.. : 61.5 ..... '
3 0.88324803 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
Author: Jun Wang, Alexandros Kalousis, Adam Woznica
Abstract: We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner. 1
same-paper 4 0.86659282 294 nips-2012-Repulsive Mixtures
Author: Francesca Petralia, Vinayak Rao, David B. Dunson
Abstract: Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool. One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. The methods are illustrated using synthetic examples and an iris data set. Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based clustering; Repulsive point process; Well separated mixture. 1
5 0.86274213 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms
Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams
Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1
6 0.83989054 338 nips-2012-The Perturbed Variation
7 0.83517224 298 nips-2012-Scalable Inference of Overlapping Communities
8 0.80582863 188 nips-2012-Learning from Distributions via Support Measure Machines
9 0.80544132 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
10 0.80522627 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
11 0.80419892 197 nips-2012-Learning with Recursive Perceptual Representations
12 0.80392289 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
13 0.80232239 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
14 0.80177218 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
15 0.80125463 168 nips-2012-Kernel Latent SVM for Visual Recognition
16 0.79667008 200 nips-2012-Local Supervised Learning through Space Partitioning
17 0.79482919 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
19 0.79401457 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
20 0.79339856 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models