nips nips2009 nips2009-59 knowledge-graph by maker-knowledge-mining

59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

Source: pdf

Author: Peter Orbanz

Abstract: We consider the general problem of constructing nonparametric Bayesian models on inﬁnite-dimensional random objects, such as functions, inﬁnite graphs or inﬁnite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their ﬁnite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of ﬁnitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate ﬁnite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the ﬁnite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on inﬁnite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We consider the general problem of constructing nonparametric Bayesian models on inﬁnite-dimensional random objects, such as functions, inﬁnite graphs or inﬁnite permutations. [sent-5, score-0.249]

2 The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. [sent-6, score-0.287]

3 Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their ﬁnite-dimensional marginal distributions. [sent-7, score-0.324]

4 The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. [sent-8, score-0.279]

5 This extension allows a rigorous construction of nonparametric Bayesian models from systems of ﬁnitedimensional, parametric Bayes equations. [sent-9, score-0.449]

6 As an application of our constructive framework, we derive a model on inﬁnite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data. [sent-11, score-0.285]

7 Common models, in particular the Gaussian process (GP) and the Dirichlet process (DP), were originally imported from statistics, but the nonparametric Bayesian idea has since been adapted to the needs of machine learning. [sent-13, score-0.31]

8 We then ask whether interesting statistical properties of the ﬁnite-dimensional models used in the constructions, such as conjugacy of priors and posteriors, carry over to the stochastic process model. [sent-18, score-0.36]

9 1 In general, the term nonparametric Bayesian model refers to a Bayesian model on an inﬁnitedimensional parameter space. [sent-19, score-0.252]

10 To accommodate a variable and asymptotically unbounded number of parameters within a single parameter space, the dimension of the space has to be inﬁnite, and nonparametric models can be deﬁned as statistical models with inﬁnite-dimensional parameter spaces [17]. [sent-24, score-0.43]

11 A Bayesian nonparametric model places a prior distribution on the inﬁnite-dimensional parameter space. [sent-26, score-0.253]

12 Many nonparametric Bayesian models are deﬁned in terms of their ﬁnite-dimensional marginals: For example, the Gaussian process and Dirichlet process are characterized by the fact that their ﬁnite-dimensional marginals are, respectively, Gaussian and Dirichlet distributions [11, 5]. [sent-27, score-0.602]

13 The probability-theoretic construction result underlying such deﬁnitions is the Kolmogorov extension theorem [1], described in Sec. [sent-28, score-0.254]

14 In stochastic process theory, the theorem is used to study the properties of a process in terms of its marginals, and hence by studying the properties of ﬁnitedimensional distributions. [sent-30, score-0.317]

15 For example, can a nonparametric Bayesian model be guaranteed to be conjugate if the marginals used in its construction are conjugate? [sent-34, score-0.826]

16 The treatment of the statistical properties of nonparametric Bayesian models in terms of ﬁnite-dimensional Bayes equations therefore requires an extension result similar to the Kolmogorov theorem that is applicable to conditional distributions. [sent-37, score-0.609]

17 We present an analogue of the Kolmogorov theorem for conditional probabilities, which permits the direct construction of conditional stochastic process models on countable-dimensional spaces from ﬁnite-dimensional conditional probabilities. [sent-39, score-0.866]

18 Application to conjugate models shows how a conjugate nonparametric Bayesian model can be constructed from conjugate ﬁnite-dimensional Bayes equations – including the mapping to the posterior parameters. [sent-40, score-1.302]

19 The converse is also true: To construct a conjugate nonparametric Bayesian model, the ﬁnite-dimensional models used in the construction all have to be conjugate. [sent-41, score-0.634]

20 The construction of stochastic process models from exponential family marginals is almost generic: The model is completely described by the mapping to the posterior parameters, which has a generic form as a function of the inﬁnite-dimensional counterpart of the model’s sufﬁcient statistic. [sent-42, score-0.9]

21 We discuss how existing models ﬁt into the framework, and derive the nonparametric Bayesian version of a model on inﬁnite permutations suggested by [9]. [sent-43, score-0.343]

22 By essentially providing a construction recipe for conjugate models of countable dimension, our theoretical results have clear practical implications for the derivation of novel nonparametric Bayesian models. [sent-44, score-0.762]

23 In this paper, required concepts will be measures on product spaces and abstract conditional probabilities (see e. [sent-46, score-0.421]

24 Since densities are not generally applicable in inﬁnite-dimensional spaces, the formulation of Bayesian models on such spaces draws on the abstract conditional probabilities of measure-theoretic probability, which are derived from Kolmogorov’s implicit formulation of conditional expectations [3]. [sent-73, score-0.505]

25 For the reader familiar with the theory, we note that all spaces considered here are Borel spaces, such that regular versions of conditionals always exist, and we hence assume all conditionals to be regular conditional probabilities (Markov kernels). [sent-77, score-0.506]

26 1 Construction of Stochastic Processes from their Marginals I Suppose that a family PX of probability measures are the ﬁnite-dimensional marginals of an inﬁniteE I dimensional measure PX (a “stochastic process”). [sent-100, score-0.518]

27 As marginals of one and the same measure, the measures must be marginals of x each other as well: I J -1 PX = PX ◦ πJI whenever I ⊂ J . [sent-102, score-0.558]

28 (1) Any family of probability measures satisfying (1) is called a projective family. [sent-103, score-0.576]

29 The marginals of a stochastic process measure are always projective. [sent-104, score-0.41]

30 A famous theorem by Kolmogorov states that the converse is also true: Any projective family on the ﬁnite-dimensional subspaces of an inﬁnitedimensional product space ΩE uniquely deﬁnes a stochastic process on the space ΩE [1]. [sent-105, score-0.885]

31 Examples include Euclidean spaces, separable Banach or Hilbert spaces, countable discrete spaces, and countable products of spaces that are themselves Polish. [sent-109, score-0.354]

32 Let Ωx be a Polish I I space, and let {PX |I ∈ F(E)} be a family of probability measures on the spaces (ΩIx , Bx ). [sent-112, score-0.316]

33 If the E E family is projective, there exists a uniquely deﬁned probability measure PX on Ωx with the measures I PX as its marginals. [sent-113, score-0.323]

34 E The inﬁnite-dimensional measure PX constructed in Theorem 1 is called the projective limit of the I family PX . [sent-114, score-0.62]

35 Intuitively, the theorem is a regularity result: The marginals determine the values of E PX on a subset of events (namely on those events involving only a ﬁnite subset of the random variables, which are just the cylinder sets with ﬁnite-dimensional base). [sent-115, score-0.575]

36 The theorem then states that a probability measure is such a regular object that knowledge of these values determines the measure completely, in a similar manner as continuous functions on the line are completely determined by their values on a countable dense subset. [sent-116, score-0.462]

37 The statement of the Kolmogorov theorem is deceptive in E its generality: It holds for any index set E, but if E is not countable, the constructed measure PX is essentially useless – even though the theorem still holds, and the measure is still uniquely deﬁned. [sent-117, score-0.479]

38 Application of the extension theorem to each value of the condition need not yield a proper conditional distribution on the inﬁnite-dimensional space, as it disregards the properties of the second argument. [sent-125, score-0.335]

39 In order to analyze the properties of an inﬁnite-dimensional Bayesian model in terms of ﬁnite-dimensional marginals, we need a theorem that establishes a correspondence between the ﬁnite-dimensional and inﬁnite-dimensional conditional distributions. [sent-127, score-0.274]

40 Though a number of extension theorems based on conditional distributions is available in the literature, these results focus on the construction of sequential stochastic processes from a sequence of conditionals (see [10] for an overview). [sent-128, score-0.42]

41 To formulate the result, the projector used to deﬁne the marginals has to be generalized from meaJ sures to conditionals. [sent-130, score-0.287]

42 The natural way to do so is the following: If PX (X J |ΘJ ) is a conditional J probability on the product space Ω , and I ⊂ J, deﬁne J J -1 [πJI PX ]( . [sent-131, score-0.271]

43 As with projective families of measures, we then deﬁne projective families of conditional probabilities. [sent-136, score-0.978]

44 Let PX (X I |ΘI ) be a family of regular conditional probabilities on product spaces ΩIx , for all I ∈ F(E). [sent-138, score-0.519]

45 The family will be called J I conditionally projective if [πJI PX ]( . [sent-139, score-0.539]

46 Theorem 2 states that a conditional probability on a countably-dimensional product space is uniquely deﬁned (up to a. [sent-152, score-0.317]

47 In particular, if we can deﬁne a parametric model on each ﬁnite-dimensional space ΩIx for I ∈ F(E) such that these models are conditionally projective, the models determine an inﬁnite-dimensional parametric model (a “nonparametric” model) on the overall space ΩE . [sent-155, score-0.387]

48 Let PX (X I |ΘI ) I be a family of regular conditional probabilities on the product space Ωx . [sent-158, score-0.461]

49 Then if the family is E conditionally projective, there exists a regular conditional probability PX (X E |C E ) on the inﬁniteI E dimensional space ΩE with the PX (X I |ΘI ) as its conditional marginals. [sent-159, score-0.577]

50 However, for any conditionally projective family, there is a set N ⊂ Ω of possible exceptions (for which projectiveness need not hold), due to the fact that conditional probabilities and conditional projections are only unique almost everywhere. [sent-165, score-0.772]

51 Using the countability of the dimension set E, we can argue that N is always a null set; the resulting set of constructed inﬁnite-dimensional measures is still a valid candidate for a regular conditional probability. [sent-166, score-0.333]

52 4 Conjugacy The posterior of a Dirichlet process is again a Dirichlet process, and the posterior parameters can be computed as a function of the data and the prior parameters. [sent-168, score-0.388]

53 Virtually all known nonparametric Bayesian models, including Gaussian processes, P´ lya o trees, and neutral-to-the-right processes are conjugate [16]. [sent-170, score-0.515]

54 In the Bayesian and exponential family literature, conjugacy is often deﬁned as “closure under sampling”, i. [sent-171, score-0.295]

55 This deﬁnition does not imply tractability of the posterior: In particular, the set of all probability measures (used as priors) is conjugate for any possible likelihood, but obviously this does not facilitate computation of the posterior. [sent-174, score-0.373]

56 In the following, we call a prior and a likelihood of a Bayesian model conjugate if the posterior (i) is parameterized and (ii) there is a measurable mapping T from the data x and the prior parameter ψ to the parameter ψ = T (x, ψ) which speciﬁes the corresponding posterior. [sent-175, score-0.601]

57 Model and prior are called conjugate if there exists a regular conditional probability k : Bθ × Ωt → [0, 1], parameterized on a measurable Polish space (Ωt , Bt ), and a measurable map T : Ωx × Ωψ → Ωt , such that PΘ (A|X = x, Ψ = ψ) = k(A, T (x, ψ)) for all A ∈ Bθ . [sent-181, score-0.68]

58 (3) The mapping T is called the posterior index of the model. [sent-182, score-0.276]

59 5 over to the projective limit model: If the ﬁnite-dimensional marginals admit a tractable posterior index, then so does the projective limit model. [sent-185, score-1.225]

60 Choose PΘ (Θ|Ψ) as the “natural conjugate prior” with parameters ψ = (α, y). [sent-188, score-0.259]

61 The posterior PΘ (Θ|X, Ψ) is conjugate in the sense of Def. [sent-193, score-0.411]

62 The probability kernel k is given by k(A, (t1 , t2 )) := A q(θ|t1 , t2 )dνΘ (θ), and the posterior index is T (x, (α, y)) := (α + 1, y + S(x)). [sent-195, score-0.282]

63 Both extension theorems discussed so far require a projection condition on the measures and models involved. [sent-197, score-0.302]

64 A similar condition is now required for the mappings T I : The preimages T I,-1 of the posterior indices T I must commute with the preimage under projection, (πEI ◦ T E )-1 = (T I ◦ πEI )-1 for all I ∈ F(E) . [sent-198, score-0.312]

65 (4) The posterior indices of all well-known exponential family models, such as Gaussians and Dirichlets, satisfy this condition. [sent-199, score-0.383]

66 The following theorem states that (i) stochastic process Bayesian models that are constructed from conjugate marginals are conjugate if the projection equation (4) is satisﬁed, and that (ii) such conjugate models can only be constructed from conjugate marginals. [sent-200, score-1.757]

67 Let E be a countable index set and ΩE and ΩE be Polish product spaces. [sent-202, score-0.293]

68 Assume that there is a Bayesian model on each ﬁnitex θ dimensional subspace ΩIx , such that the families of all priors, all observation models and all posteE E E riors are conditionally projective. [sent-203, score-0.251]

69 Let PΘ (ΘE ), PX (X E |ΘE ) and PΘ (ΘE |X E ) denote the respective E E E projective limits. [sent-204, score-0.354]

70 Then PΘ (Θ |X ) is a posterior for the inﬁnite-dimensional Bayesian model deE E ﬁned by PX (X E |ΘE ) with prior PΘ (ΘE ), and the following holds: I (i) Assume that each ﬁnite-dimensional posterior PΘ (ΘI |X I ) is conjugate w. [sent-205, score-0.618]

71 its respective I Bayesian model, with posterior index T and probability kernel k I . [sent-208, score-0.282]

72 Then if there is a measurable mapping T : ΩE → ΩE satisfying the projection condition (4), the projective limit t x E posterior PΘ (ΘE |X E ) is conjugate with posterior index T . [sent-209, score-1.272]

73 E (ii) Conversely, if the inﬁnite-dimensional posterior PΘ (ΘE |X E ) is conjugate with posterior I index T E and probability kernel k E , then each marginal posterior PΘ (ΘI |X I ) is conjugate, I E -1 with posterior index T := πEI ◦ T ◦ πEI . [sent-210, score-1.113]

74 Part (i): We deﬁne a candidate for the probability kernel k E representing the projective limit posterior, and then verify that it makes the model conjugate when combined with the mapI ping T given by assumption. [sent-215, score-0.764]

75 To do so, we ﬁrst construct the conditional probabilities PΘ (ΘI |T I ), show that they form a conditionally projective family, and take their conditional projective limit using Theorem 2. [sent-216, score-1.153]

76 This projective limit is used as a candidate for k E . [sent-217, score-0.44]

77 5 Construction of Nonparametric Bayesian Models Theorem 3(ii) states that conjugate models have conjugate marginals. [sent-221, score-0.569]

78 Addition commutes with projection, and hence the posterior indices T I of a family of such models over all dimensions I ∈ F(E) satisfy the projection condition (4) if and only if the same condition is satisﬁed by the sufﬁcient statistics S I of the marginals. [sent-224, score-0.563]

79 If that is the case, T E (xE , (α, y E )) := (α + 1, y E + S E (xE )) is a posterior index for the inﬁnite-dimensional projective limit model. [sent-226, score-0.66]

80 The following table summarizes some stochastic process models from the conjugate extension point of view: Marginals (d-dim) Bernoulli/Beta Multin. [sent-229, score-0.482]

81 We construct a nonparametric Bayesian model on bijections, with a likelihood component E PX (X E |ΘE ) equivalent to the model of Meila and Bao. [sent-235, score-0.252]

82 The ﬁnite-dimensional marginals are probability models of rankings of a ﬁnite number of items, introduced by Fligner and Verducci [6]. [sent-237, score-0.33]

83 For permutations τ ∈ Sr of length r, the model is deﬁned by the exponential family density p(τ |σ, θ) := Z(θ)−1 exp( S(τ σ −1 ), θ ), where the sufﬁcient statistic is the vector S r (τ ) := (S1 (τ ), . [sent-238, score-0.304]

84 We will deﬁne a nonparametric Bayesian model that puts a prior on the inﬁnite-dimensional analogue 3 Mixtures of conjugate priors are conjugate in the sense of closure under sampling [4], but the posterior index in Def. [sent-260, score-1.081]

85 An example of a conjugate model not in the exponential family is the uniform distribution on [0, θ] with a Pareto prior [12]. [sent-262, score-0.473]

86 The corresponding natural conjugate prior on θI has density q I (θI |α, y I ) ∝ exp( θI , y I − α log Z I (θI )). [sent-267, score-0.287]

87 Since the model is an exponential family model, the posterior index is of the form T I ((α, y I ), τ I ) = (α + 1, y I + S I (τ I )), and since S I is projective in the sense of Eq. [sent-268, score-0.784]

88 It is reasonably straightforward to show that both families are conditionally projective, and so is the family of the corresponding posteriors. [sent-271, score-0.257]

89 Each therefore has a projective limit, and the projective limit of the posteriors is the posterior of the projective limit P E (X E |ΘE ) under prior P E (ΘE ). [sent-272, score-1.393]

90 By Theorem 3, this makes T E a posterior index for the projective limit model. [sent-281, score-0.66]

91 6 Discussion and Conclusion We have shown how nonparametric Bayesian models can be constructed from ﬁnite-dimensional Bayes equations, and how conjugacy properties of the ﬁnite-dimensional models carry over to the inﬁnite-dimensional, nonparametric case. [sent-282, score-0.701]

92 We also have argued that conjugate nonparametric Bayesian models arise from exponential families. [sent-283, score-0.559]

93 A number of interesting questions could not be addressed within the scope of this paper, including (1) the extension to model properties other than conjugacy and (2) the generalization to uncountable dimensions. [sent-284, score-0.381]

94 In this case, we would ask whether the existence of sufﬁcient statistics for the ﬁnite-dimensional marginals implies the existence of a sufﬁcient statistic for the nonparametric Bayesian model, and whether the inﬁnite-dimensional sufﬁcient statistic can be explicitly constructed. [sent-286, score-0.541]

95 On the other hand, uncountable product space constructions are subject to all the subtleties of stochastic process theory, many of which do not occur in countable dimensions. [sent-294, score-0.511]

96 The application of construction methods to conditional probabilities also becomes more complicated (roughly speaking, the point-wise application of the Kolmogorov theorem in the proof of Theorem 2 is not possible if the dimension is uncountable). [sent-295, score-0.376]

97 Product space constructions are by far not the only way to deﬁne nonparametric Bayesian models. [sent-296, score-0.289]

98 A P´ lya tree model [7], for example, is much more intuitive to construct by means of a binary partition o argument than from marginals in product space. [sent-297, score-0.37]

99 However, the marginals may then not be the marginals in terms of which we “naturally” think about the model. [sent-299, score-0.482]

100 Nonetheless, we have hopefully demonstrated that the theoretical results are applicable for the construction of an interesting and practical range of nonparametric Bayesian models. [sent-300, score-0.33]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('px', 0.476), ('projective', 0.354), ('conjugate', 0.259), ('marginals', 0.241), ('nonparametric', 0.198), ('kolmogorov', 0.183), ('ji', 0.17), ('xe', 0.153), ('posterior', 0.152), ('conjugacy', 0.136), ('bayesian', 0.135), ('countable', 0.13), ('conditional', 0.126), ('ix', 0.126), ('family', 0.108), ('uncountable', 0.105), ('construction', 0.101), ('theorem', 0.095), ('spaces', 0.094), ('index', 0.092), ('projection', 0.087), ('cylinder', 0.087), ('ei', 0.082), ('conditionally', 0.077), ('nitedimensional', 0.076), ('events', 0.076), ('measures', 0.076), ('families', 0.072), ('product', 0.071), ('bijections', 0.07), ('measurability', 0.07), ('polish', 0.07), ('permutations', 0.067), ('regular', 0.066), ('dirichlet', 0.063), ('limit', 0.062), ('extension', 0.058), ('stochastic', 0.058), ('nite', 0.057), ('process', 0.056), ('measure', 0.055), ('constructions', 0.055), ('ai', 0.054), ('probabilities', 0.054), ('measurable', 0.052), ('preimages', 0.052), ('statistic', 0.051), ('models', 0.051), ('exponential', 0.051), ('conditionals', 0.05), ('uniquely', 0.046), ('projector', 0.046), ('bx', 0.044), ('indices', 0.043), ('permutation', 0.043), ('dp', 0.042), ('constructed', 0.041), ('parametric', 0.041), ('probability', 0.038), ('items', 0.037), ('borel', 0.037), ('gp', 0.037), ('space', 0.036), ('fligner', 0.035), ('mondrian', 0.035), ('preimage', 0.035), ('projectiveness', 0.035), ('priors', 0.033), ('dimensions', 0.033), ('rr', 0.033), ('analogue', 0.033), ('mapping', 0.032), ('applicable', 0.031), ('movies', 0.031), ('lya', 0.031), ('condition', 0.03), ('nition', 0.03), ('scope', 0.029), ('satisfy', 0.029), ('prior', 0.028), ('suf', 0.028), ('bayes', 0.028), ('posteriors', 0.027), ('model', 0.027), ('processes', 0.027), ('favorite', 0.026), ('embeds', 0.026), ('properties', 0.026), ('speaking', 0.026), ('entries', 0.025), ('converse', 0.025), ('candidate', 0.024), ('subspace', 0.024), ('marginal', 0.024), ('meila', 0.024), ('equations', 0.024), ('completely', 0.023), ('parameterized', 0.023), ('densities', 0.023), ('recipe', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

Author: Peter Orbanz

2 0.13325304 112 nips-2009-Human Rademacher Complexity

Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers

Abstract: We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learner’s ability to ﬁt random labels, and can be used to bound the learner’s true error based on the observed training sample error. We ﬁrst review the deﬁnition of Rademacher complexity and its generalization bound. We then describe a “learning the noise” procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overﬁtting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science. 1

3 0.093420826 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

Author: Arno Onken, Steffen Grünewälder, Klaus Obermayer

Abstract: The linear correlation coefﬁcient is typically used to characterize and analyze dependencies of neural spike counts. Here, we show that the correlation coefﬁcient is in general insufﬁcient to characterize these dependencies. We construct two neuron spike count models with Poisson-like marginals and vary their dependence structure using copulas. To this end, we construct a copula that allows to keep the spike counts uncorrelated while varying their dependence strength. Moreover, we employ a network of leaky integrate-and-ﬁre neurons to investigate whether weakly correlated spike counts with strong dependencies are likely to occur in real networks. We ﬁnd that the entropy of uncorrelated but dependent spike count distributions can deviate from the corresponding distribution with independent components by more than 25 % and that weakly correlated but strongly dependent spike counts are very likely to occur in biological networks. Finally, we introduce a test for deciding whether the dependence structure of distributions with Poissonlike marginals is well characterized by the linear correlation coefﬁcient and verify it for different copula-based models. 1

4 0.085907623 226 nips-2009-Spatial Normalized Gamma Processes

Author: Vinayak Rao, Yee W. Teh

Abstract: Dependent Dirichlet processes (DPs) are dependent sets of random measures, each being marginally DP distributed. They are used in Bayesian nonparametric models when the usual exchangeability assumption does not hold. We propose a simple and general framework to construct dependent DPs by marginalizing and normalizing a single gamma process over an extended space. The result is a set of DPs, each associated with a point in a space such that neighbouring DPs are more dependent. We describe Markov chain Monte Carlo inference involving Gibbs sampling and three different Metropolis-Hastings proposals to speed up convergence. We report an empirical study of convergence on a synthetic dataset and demonstrate an application of the model to topic modeling through time. 1

5 0.08342506 69 nips-2009-Discrete MDL Predicts in Total Variation

Author: Marcus Hutter

Abstract: The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identiﬁability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed. 1

6 0.075302757 78 nips-2009-Efficient Moments-based Permutation Tests

7 0.07517305 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

8 0.070804521 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

9 0.070702553 100 nips-2009-Gaussian process regression with Student-t likelihood

10 0.069894381 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

11 0.069641314 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

12 0.064018309 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

13 0.062484909 114 nips-2009-Indian Buffet Processes with Power-law Behavior

14 0.060889266 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

15 0.059862737 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

16 0.058271855 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

17 0.057480175 11 nips-2009-A General Projection Property for Distribution Families

18 0.057124883 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

19 0.056738537 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

20 0.056548454 97 nips-2009-Free energy score space

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.186), (1, 0.009), (2, 0.014), (3, -0.057), (4, 0.042), (5, -0.16), (6, 0.058), (7, -0.041), (8, -0.038), (9, -0.07), (10, 0.017), (11, -0.009), (12, 0.022), (13, -0.075), (14, -0.008), (15, 0.006), (16, 0.046), (17, 0.077), (18, 0.042), (19, 0.048), (20, -0.004), (21, 0.035), (22, -0.1), (23, -0.007), (24, -0.097), (25, -0.02), (26, -0.016), (27, -0.021), (28, -0.071), (29, 0.071), (30, -0.042), (31, 0.056), (32, -0.081), (33, -0.023), (34, -0.033), (35, 0.065), (36, 0.028), (37, 0.005), (38, -0.046), (39, -0.022), (40, 0.077), (41, 0.02), (42, 0.085), (43, 0.065), (44, 0.119), (45, 0.053), (46, 0.093), (47, -0.087), (48, -0.049), (49, 0.091)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95438987 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

Author: Peter Orbanz

2 0.608105 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

Author: Francois Caron, Arnaud Doucet

Abstract: Over recent years Dirichlet processes and the associated Chinese restaurant process (CRP) have found many applications in clustering while the Indian buffet process (IBP) is increasingly used to describe latent feature models. These models are attractive because they ensure exchangeability (over samples). We propose here extensions of these models where the dependency between samples is given by a known decomposable graph. These models have appealing properties and can be easily learned using Monte Carlo techniques. 1 Motivation The CRP and IBP have found numerous applications in machine learning over recent years [5, 10]. We consider here the case where the data we are interested in are ‘locally’ dependent; these dependencies being represented by a known graph G where each data point/object is associated to a vertex. These local dependencies can correspond to any conceptual or real (e.g. space, time) metric. For example, in the context of clustering, we might want to propose a prior distribution on partitions enforcing that data which are ‘close’ in the graph are more likely to be in the same cluster. Similarly, in the context of latent feature models, we might be interested in a prior distribution on features enforcing that data which are ‘close’ in the graph are more likely to possess similar features. The ‘standard’ CRP and IBP correspond to the case where the graph G is complete; that is it is fully connected. In this paper, we generalize the CRP and IBP to decomposable graphs. The resulting generalized versions of the CRP and IBP enjoy attractive properties. Each clique of the graph follows marginally a CRP or an IBP process and explicit expressions for the joint prior distribution on the graph is available. It makes it easy to learn those models using straightforward generalizations of Markov chain Monte Carlo (MCMC) or Sequential Monte Carlo (SMC) algorithms proposed to perform inference for the CRP and IBP [5, 10, 14]. The rest of the paper is organized as follows. In Section 2, we review the popular Dirichlet multinomial allocation model and the Dirichlet Process (DP) partition distribution. We propose an extension of these two models to decomposable graphical models. In Section 3 we discuss nonparametric latent feature models, reviewing brieﬂy the construction in [5] and extending it to decomposable graphs. We demonstrate these models in Section 4 on two applications: an alternative to the hierarchical DP model [12] and a time-varying matrix factorization problem. 2 Prior distributions for partitions on decomposable graphs Assume we have n observations. When performing clustering, we associate to each of this observation an allocation variable zi ∈ [K] = {1, . . . , K}. Let Πn be the partition of [n] = {1, . . . , n} deﬁned by the equivalence relation i ↔ j ⇔ zi = zj . The resulting partition Πn = {A1 , . . . , An(Πn ) } 1 is an unordered collection of disjoint non-empty subsets Aj of [n], j = 1, . . . , n(Πn ), where ∪j Aj = [n] and n(Πn ) is the number of subsets for partition Πn . We also denote by Pn be the set of all partitions of [n] and let nj , j = 1, . . . , n(Πn ), be the size of the subset Aj . Each allocation variable zi is associated to a vertex/site of an undirected graph G, which is assumed to be known. In the standard case where the graph G is complete, we ﬁrst review brieﬂy here two popular prior distributions on z1:n , equivalently on Πn . We then extend these models to undirected decomposable graphs; see [2, 8] for an introduction to decomposable graphs. Finally we brieﬂy discuss the directed case. Note that the models proposed here are completely different from the hyper multinomial-Dirichlet in [2] and its recent DP extension [6]. 2.1 Dirichlet multinomial allocation model and DP partition distribution Assume for the time being that K is ﬁnite. When the graph is complete, a popular choice for the allocation variables is to consider a Dirichlet multinomial allocation model [11] θ θ , . . . , ), zi |π ∼ π (1) K K where D is the standard Dirichlet distribution and θ > 0. Integrating out π, we obtain the following Dirichlet multinomial prior distribution π ∼ D( Pr(z1:n ) = K j=1 Γ(θ) Γ(nj + θ K) (2) θ Γ(θ + n)Γ( K )K and then, using the straightforward equality Pr(Πn ) = PK where PK = {Πn ∈ Pn |n(Πn ) ≤ K}, we obtain K! (K−n(Πn ))! Pr(z1:n ) valid for for all Πn ∈ n(Π ) Pr(Πn ) = θ Γ(θ) j=1n Γ(nj + K ) K! . θ (K − n(Πn ))! Γ(θ + n)Γ( K )n(Πn ) (3) DP may be seen as a generalization of the Dirichlet multinomial model when the number of components K → ∞; see for example [10]. In this case the distribution over the partition Πn of [n] is given by [11] n(Π ) θn(Πn ) j=1n Γ(nj ) . (4) Pr(Πn ) = n i=1 (θ + i − 1) Let Π−k = {A1,−k , . . . , An(Π−k ),−k } be the partition induced by removing item k to Πn and nj,−k be the size of cluster j for j = 1, . . . , n(Π−k ). It follows from (4) that an item k is assigned to an existing cluster j, j = 1, . . . , n(Π−k ), with probability proportional to nj,−k / (n − 1 + θ) and forms a new cluster with probability θ/ (n − 1 + θ). This property is the basis of the CRP. We now extend the Dirichlet multinomial allocation and the DP partition distribution models to decomposable graphs. 2.2 Markov combination of Dirichlet multinomial and DP partition distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. It can be easily checked that if the marginal distribution of zC for each clique C ∈ C is deﬁned by (2) then these distributions are consistent as they yield the same distribution (2) over the separators. Therefore, the unique Markov distribution over G with Dirichlet multinomial distribution over the cliques is deﬁned by [8] Pr(zC ) S∈S Pr(zS ) C∈C Pr(z1:n ) = (5) where for each complete set B ⊆ G, we have Pr(zB ) given by (2). It follows that we have for any Πn ∈ PK Γ(θ) K! Pr(Πn ) = (K − n(Πn ))! C∈C Γ(θ) S∈S 2 K j=1 θ Γ(nj,C + K ) θ Γ(θ+nC )Γ( K )K K j=1 θ Γ(nj,S + K ) θ Γ(θ+nS )Γ( K )K (6) where for each complete set B ⊆ G, nj,B is the number of items associated to cluster j, j = 1, . . . , K in B and nB is the total number of items in B. Within each complete set B, the allocation variables deﬁne a partition distributed according to the Dirichlet-multinomial distribution. We now extend this approach to DP partition distributions; that is we derive a joint distribution over Πn such that the distribution of ΠB over each complete set B of the graph is given by (4) with θ > 0. Such a distribution satisﬁes the consistency condition over the separators as the restriction of any partition distributed according to (4) still follows (4) [7]. G Proposition. Let Pn be the set of partitions Πn ∈ Pn such that for each decomposition A, B, and any (i, j) ∈ A × B, i ↔ j ⇒ ∃k ∈ A ∩ B such that k ↔ i ↔ j. As K → ∞, the prior distribution G over partitions (6) is given for each Πn ∈ Pn by Pr(Πn ) = θn(Πn ) n(ΠC ) Γ(nj,C ) j=1 nC i=1 (θ+i−1) n(ΠS ) Γ(nj,S ) j=1 nS (θ+i−1) i=1 C∈C S∈S (7) where n(ΠB ) is the number of clusters in the complete set B. Proof. From (6), we have θ n(ΠC ) K(K − 1) . . . (K − n(Πn ) + 1) Pr(Πn ) = K C∈C n(ΠC )− S∈S n(ΠS ) C∈C θ n(ΠS ) S∈S n(ΠC ) θ Γ(nj,C + K ) j=1 nC (θ+i−1) i=1 n(ΠS ) θ Γ(nj,S + K ) j=1 nS (θ+i−1) i=1 Thus when K → ∞, we obtain (7) if n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) and 0 otherwise. We have n(Πn ) ≤ C∈C n(ΠC ) − S∈S n(ΠS ) for any Πn ∈ Pn and the subset of Pn verifying G n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) corresponds to the set Pn . Example. Let the notation i ∼ j (resp. i j) indicates an edge (resp. no edge) between two sites. Let n = 3 and G be the decomposable graph deﬁned by the relations 1 ∼ 2, 2 ∼ 3 and 1 3. G The set P3 is then equal to {{{1, 2, 3}}; {{1, 2}, {3}}; {{1}, {2, 3}}; {{1}, {2}, {3}}}. Note that G the partition {{1, 3}, {2}} does not belong to P3 . Indeed, as there is no edge between 1 and 3, they cannot be in the same cluster if 2 is in another cluster. The cliques are C1 = {1, 2} and C2 = {2, 3} Pr(ΠC1 ) Pr(ΠC2 ) hence we can and the separator is S2 = {2}. The distribution is given by Pr(Π3 ) = Pr(ΠS ) 2 check that we obtain Pr({1, 2, 3}) = (θ + 1)−2 , Pr({1, 2}, {3}) = Pr({1, 2}, {3}) = θ(θ + 1)−2 and Pr({1}, {2}, {3}) = θ2 (θ + 1)−2 . Let now deﬁne the full conditional distributions. Based on (7) the conditional assignment of an item k is proportional to the conditional over the cliques divided by the conditional over the separators. G Let denote G−k the undirected graph obtained by removing vertex k from G. Suppose that Πn ∈ Pn . G−k If Π−k ∈ Pn−1 , then do not change the value of item k. Otherwise, item k is assigned to cluster j / where j = 1, . . . , n(Π−k ) with probability proportional to {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (8) and to a new cluster with probability proportional to θ, where n−k,j,C is the number of items in the set C \ {k} belonging to cluster j. The updating process is illustrated by the Chinese wedding party process1 in Fig. 1. The results of this section can be extended to the Pitman-Yor process, and more generally to species sampling models. Example (continuing). Given Π−2 = {A1 = {1}, A2 = {3}}, we have −1 Pr( item 2 assigned to A1 = {1}| Π−2 ) = Pr( item 2 assigned to A2 = {3}| Π−2 ) = (θ + 2) −1 and Pr( item 2 assigned to new cluster A3 | Π−2 ) = θ (θ + 2) . Given Π−2 = {A1 = {1, 3}}, item 2 is assigned to A1 with probability 1. 1 Note that this representation describes the full conditionals while the CRP represents the sequential updat- ing. 3 (a) (b) (d) (c) (e) Figure 1: Chinese wedding party. Consider a group of n guests attending a wedding party. Each of the n guests may belong to one or several cliques, i.e. maximal groups of people such that everybody knows everybody. The belonging of each guest to the different cliques is represented by color patches on the ﬁgures, and the graphical representation of the relationship between the guests is represented by the graphical model (e). (a) Suppose that the guests are already seated such that two guests cannot be together at the same table is they are not part of the same clique, or if there does not exist a group of other guests such that they are related (“Any friend of yours is a friend of mine”). (b) The guest number k leaves his table and either (c) joins a table where there are guests from the same clique as him, with probability proportional to the product of the number of guests from each clique over the product of the number of guests belonging to several cliques on that table or (d) he joins a new table with probability proportional to θ. 2.3 Monte Carlo inference 2.3.1 MCMC algorithm Using the full conditionals, a single site Gibbs sampler can easily be designed to approximate the posterior distribution Pr(Πn |z1:n ). Given a partition Πn , an item k is taken out of the partition. If G−k Π−k ∈ Pn−1 , item k keeps the same value. Otherwise, the item will be assigned to a cluster j, / j = 1, . . . , n(Π−k ), with probability proportional to p(z{k}∪Aj,−k ) × p(zAj,−k ) {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (9) and the item will be assigned to a new cluster with probability proportional to p(z{k} ) × θ. Similarly to [3], we can also deﬁne a procedure to sample from p(θ|n(Πn ) = k)). We assume that θ ∼ G(a, b) and use p auxiliary variables x1 , . . . , xp . The procedure is as follows. • For j = 1, . . . , p, sample xj |k, θ ∼ Beta(θ + nSj , nCj − nSj ) • Sample θ|k, x1:p ∼ G(a + k, b − j log xj ) 2.3.2 Sequential Monte Carlo We have so far only treated the case of an undirected decomposable graph G. We can formulate a sequential updating rule for the corresponding perfect directed version D of G. Indeed, let (a1 , . . . a|V | ) be a perfect ordering and pa(ak ) be the set of parents of ak which is by deﬁnition complete. Let Πk−1 = {A1,k−1 , . . . , An(Πk−1 ),k−1 } denote the partition of the ﬁrst k−1 vertices a1:k−1 and let nj,pa(ak ) be the number of elements with value j in the set pa(ak ), j = 1, . . . , n(Πk−1 ). Then the vertex ak joins the set j with probability nj,pa(ak ) / θ + cluster with probability θ/ θ + q q nq,pa(ak ) and creates a new nq,pa(ak ) . One can then design a particle ﬁlter/SMC method in a similar fashion as [4]. Consider a set of (i) (i) (i) (i) N N particles Πk−1 with weights wk−1 ∝ Pr(Πk−1 , z1:k−1 ) ( i=1 wk−1 = 1) that approximate (i) the posterior distribution Pr(Πk−1 |z1:k−1 ). For each particle i, there are n(Πk−1 ) + 1 possible 4 (i,j) allocations for component ak . We denote Πk the partition obtained by associating component ak (i,j) to cluster j. The weight associated to Πk is given by  nj,pa(ak ) (i)  if j = 1, . . . , n(Πk−1 ) θ+ q nq,pa(ak ) (i,j) (i) p(z{ak }∪Aj,k−1 ) wk−1 = wk−1 × (10) (i) θ  θ+ n p(zAj,k−1 ) if j = n(Πk−1 ) + 1 q q,pa(ak ) (i,j) Then we can perform a deterministic resampling step by keeping the N particles Πk with highest (i,j) (i) (i) weights wk−1 . Let Πk be the resampled particles and wk the associated normalized weights. 3 Prior distributions for inﬁnite binary matrices on decomposable graphs Assume we have n objects; each of these objects being associated to the vertex of a graph G. To K each object is associated a K-dimensional binary vector zn = (zn,1 , . . . , zn,K ) ∈ {0, 1} where zn,i = 1 if object n possesses feature i and zn,i = 0 otherwise. These vectors zt form a binary n × K matrix denoted Z1:n . We denote by ξ1:n the associated equivalence class of left-ordered matrices and let EK be the set of left-ordered matrices with at most K features. In the standard case where the graph G is complete, we review brieﬂy here two popular prior distributions on Z1:n , equivalently on ξ1:n : the Beta-Bernoulli model and the IBP [5]. We then extend these models to undirected decomposable graphs. This can be used for example to deﬁne a time-varying IBP as illustrated in Section 4. 3.1 Beta-Bernoulli and IBP distributions The Beta-Bernoulli distribution over the allocation Z1:n is K Pr(Z1:n ) = α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) α K Γ(nj j=1 (11) where nj is the number of objects having feature j. It follows that Pr(ξ1:n ) = K K! 2n −1 h=0 α K Γ(nj α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) Kh ! j=1 (12) where Kh is the number of features possessing the history h (see [5] for details). The nonparametric model is obtained by taking the limit when K → ∞ Pr(ξ1:n ) = αK K+ + 2n −1 h=1 Kh ! exp(−αHn ) where K + is the total number of features and Hn = 3.2 (n − nj )!(nj − 1)! n! j=1 n 1 k=1 k . (13) The IBP follows from (13). Markov combination of Beta-Bernoulli and IBP distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. As in the Dirichlet-multinomial case, it is easily seen that if for each clique C ∈ C, the marginal distribution is deﬁned by (11), then these distributions are consistent as they yield the same distribution (11) over the separators. Therefore, the unique Markov distribution over G with Beta-Bernoulli distribution over the cliques is deﬁned by [8] Pr(ZC ) S∈S Pr(ZS ) C∈C Pr(Z1:n ) = (14) where Pr(ZB ) given by (11) for each complete set B ⊆ G. The prior over ξ1:n is thus given, for ξ1:n ∈ EK , by Pr(ξ1:n ) = K! 2n −1 h=0 Kh ! α K α Γ(nj,C + K )Γ(nC −nj,C +1) α Γ(nC +1+ K ) α α Γ(nj,S + K )Γ(nS −nj,S +1) K K α j=1 Γ(nS +1+ K ) K j=1 C∈C S∈S 5 (15) where for each complete set B ⊆ G, nj,B is the number of items having feature j, j = 1, . . . , K in the set B and nB is the whole set of objects in set B. Taking the limit when K → ∞, we obtain after a few calculations Pr(ξ1:n ) = α + K[n] exp [−α ( C HnC − 2n −1 h=1 Kh ! HnS )] × C∈C + KC (nC −nj,C )!(nj,C −1)! j=1 nC ! S∈S S + KS (nS −nj,S )!(nj,S −1)! j=1 nS ! + + + + if K[n] = C KC − S KS and 0 otherwise, where KB is the number of different features possessed by objects in B. G Let En be the subset of En such that for each decomposition A, B and any (u, v) ∈ A × B: {u and v possess feature j} ⇒ ∃k ∈ A ∩ B such that {k possesses feature j}. Let ξ−k be the left-ordered + matrix obtained by removing object k from ξn and K−k be the total number of different features in G−k + ξ−k . For each feature j = 1, . . . , K−k , if ξ−k ∈ En−1 then we have   b C∈C nj,C if i = 1 S∈C nj,S Pr(ξk,j = i) = (16)  b C∈C (nC −nj,C ) if i = 0 (nS −nj,S ) S∈C nS where b is the appropriate normalizing constant then the customer k tries Poisson α {S∈S|k∈S} nC {C∈C|k∈C} new dishes. We can easily generalize this construction to a directed version D of G using arguments similar to those presented in Section 2; see Section 4 for an application to time-varying matrix factorization. 4 4.1 Applications Sharing clusters among relative groups: An alternative to HDP Consider that we are given d groups with nj data yi,j in each group, i = 1, . . . , nj , j = 1, . . . , d. We consider latent cluster variables zi,j that deﬁne the partition of the data. We will use alternatively the notation θi,j = Uzi,j in the following. Hierarchical Dirichlet Process [12] (HDP) is a very popular model for sharing clusters among related groups. It is based on a hierarchy of DPs G0 ∼ DP (γ, H), Gj |G0 ∼ DP (α, G0 ) j = 1, . . . d θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj . Under conjugacy assumptions, G0 , Gj and U can be integrated out and we can approximate the marginal posterior of (zi,j ) given y = (yi,j ) with Gibbs sampling using the Chinese restaurant franchise to sample from the full conditional p(zi,j |z−{i,j} , y). Using the graph formulation deﬁned in Section 2, we propose an alternative to HDP. Let θ0,1 , . . . , θ0,N be N auxiliary variables belonging to what we call group 0. We deﬁne each clique Cj (j = 1, . . . , d) to be composed of elements from group j and elements from group 0. This deﬁnes a decomposable graphical model whose separator is given by the elements of group 0. We can rewrite the model in a way quite similar to HDP G0 ∼ DP (α, H), θ0,i |G0 ∼ G0 i = 1, ..., N α α Gj |θ0,1 , . . . , θ0,N ∼ DP (α + N, α+N H + α+N θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj N i=1 δθ0,i ) j = 1, . . . d, N For any subset A and j = k ∈ {1, . . . , p} we have corr(Gj (A), Gk (A)) = α+N . Again, under conjugacy conditions, we can integrate out G0 , Gj and U and approximate the marginal posterior distribution over the partition using the Chinese wedding party process deﬁned in Section 2. Note that for latent variables zi,j , j = 1, . . . , d, associated to data, this is the usual CRP update. As in HDP, multiple layers can be added to the model. Figures 2 (a) and (b) resp. give the graphical DP alternative to HDP and 2-layer HDP. 6 z0 root z0 root corpora docs z1 z2 z1 z2 z3 z1,1 z1,2 z2,1 z2,2 z2,3 docs (a) Graphical DP alternative to HDP (b) Graphical DP alternative to 2-layer HDP Figure 2: Hierarchical Graphs of dependency with (a) one layer and (b) two layers of hierarchy. If N = 0, then Gj ∼ DP (α, H) for all j and this is equivalent to setting γ → ∞ in HDP. If N → ∞ then Gj = G0 for all j, G0 ∼ DP (α, H). This is equivalent to setting α → ∞ in the HDP. One interesting feature of the model is that, contrary to HDP, the marginal distribution of Gj at any layer of the tree is DP (α, H). As a consequence, the total number of clusters scales logarithmically (as in the usual DP) with the size of each group, whereas it scales doubly logarithmically in HDP. Contrary to HDP, there are at most N clusters shared between different groups. Our model is in that sense reminiscent of [9] where only a limited number of clusters can be shared. Note however that contrary to [9] we have a simple CRP-like process. The proposed methodology can be straightforwardly extended to the inﬁnite HMM [12]. The main issue of the proposed model is the setting of the number N of auxiliary parameters. Another issue is that to achieve high correlation, we need a large number of auxiliary variables. Nonetheless, the computational time used to sample from auxiliary variables is negligible compared to the time used for latent variables associated to data. Moreover, it can be easily parallelized. The model proposed offers a far richer framework and ensures that at each level of the tree, the marginal distribution of the partition is given by a DP partition model. 4.2 Time-varying matrix factorization Let X1:n be an observed matrix of dimension n × D. We want to ﬁnd a representation of this matrix in terms of two latent matrices Z1:n of dimension n × K and Y of dimension K × D. Here Z1:n 2 is a binary matrix whereas Y is a matrix of latent features. By assuming that Y ∼ N 0, σY IK×D and 2 X1:n = Z1:n Y + σX εn where εn ∼ N 0, σX In×D , we obtain p(X1:n |Z1:n ) ∝ −D/2 2 2 + Z+T Z+ + σX /σY IKn 1:n 1:n + (n−Kn )D σX exp − + Kn D σY 2 2 + where Σ−1 = I − Z+ Z+T Z+ + σX /σY IKn n 1:n 1:n 1:n −1 1 T −1 2 tr X1:n Σn X1:n 2σX (17) + Z+T , Kn the number of non-zero columns of 1:n + Z1:n and Z+ is the ﬁrst Kn columns of Z1:n . To avoid having to set K, [5, 14] assume that Z1:n 1:n follows an IBP. The resulting posterior distribution p(Z1:n |X1:n ) can be estimated through MCMC [5] or SMC [14]. We consider here a different model where the object Xt is assumed to arrive at time index t and we want a prior distribution on Z1:n ensuring that objects close in time are more likely to possess similar features. To achieve this, we consider the simple directed graphical model D of Fig. 3 where the site numbering corresponds to a time index in that case and a perfect numbering of D is (1, 2, . . .). The set of parents pa(t) is composed of the r preceding sites {{t − r}, . . . , {t − 1}}. The time-varying IBP to sample from p(Z1:n ) associated to this directed graph follows from (16) and proceeds as follows. At time t = 1 + new new • Sample K1 ∼Poisson(α), set z1,i = 1 for i = 1, ..., K1 and set K1 = Knew . At times t = 2, . . . , r n + new ∼Poisson( α ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( 1:t−1,k ) and Kt t t 7 ? ? - t−r - t−r+1 - . . . - t−1 - t - t+1 6 6 Figure 3: Directed graph. At times t = r + 1, . . . , n n + α new ∼Poisson( r+1 ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( t−r:t−1,k ) and Kt r+1 + Here Kt is the total number of features appearing from time max(1, t − r) to t − 1 and nt−r:t−1,k the restriction of n1:t−1 to the r last customers. Using (17) and the prior distribution of Z1:n which can be sampled using the time-varying IBP described above, we can easily design an SMC method to sample from p(Z1:n |X1:n ). We do not detail it here. Note that contrary to [14], our algorithm does not require inverting a matrix whose dimension grows linearly with the size of the data but only a matrix of dimension r × r. In order to illustrate the model and SMC algorithm, we create 200 6 × 6 images using a ground truth Y consisting of 4 different 6 × 6 latent images. The 200 × 4 binary matrix was generated from Pr(zt,k = 1) = πt,k , where πt = ( .6 .5 0 0 ) if t = 1, . . . , 30, πt = ( .4 .8 .4 0 ) if t = 31, . . . , 50 and πt = ( 0 .3 .6 .6 ) if t = 51, . . . , 200. The order of the model is set to r = 50. The feature occurences Z1:n and true features Y and their estimates are represented in Figure 4. Two spurious features are detected by the model (features 2 and 5 on Fig. 3(c)) but quickly discarded (Fig. 4(d)). The algorithm is able to correctly estimate the varying prior occurences of the features over time. Feature1 Feature2 Feature1 Feature2 Feature3 20 20 40 40 60 60 Feature4 80 100 Feature4 Feature5 Feature6 Time Feature3 Time 80 100 120 120 140 140 160 160 180 200 180 1 2 3 200 4 Feature (a) 1 2 3 4 5 6 Feature (b) (c) (d) Figure 4: (a) True features, (b) True features occurences, (c) MAP estimate ZM AP and (d) associated E[Y|ZM AP ] t=20 t=50 t=20 t=50 t=100 t=200 t=100 t=200 (a) (b) Figure 5: (a) E[Xt |πt , Y] and (b) E[Xt |X1:t−1 ] at t = 20, 50, 100, 200. 5 Related work and Discussion The ﬁxed-lag version of the time-varying DP of Caron et al. [1] is a special case of the proposed model when G is given by Fig. 3. The bivariate DP of Walker and Muliere [13] is also a special case when G has only two cliques. In this paper, we have assumed that the structure of the graph was known beforehand and we have shown that many ﬂexible models arise from this framework. It would be interesting in the future to investigate the case where the graphical structure is unknown and must be estimated from the data. Acknowledgment The authors thank the reviewers for their comments that helped to improve the writing of the paper. 8 References [1] F. Caron, M. Davy, and A. Doucet. Generalized Polya urn for time-varying Dirichlet process mixtures. In Uncertainty in Artiﬁcial Intelligence, 2007. [2] A.P. Dawid and S.L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21:1272–1317, 1993. [3] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. [4] P. Fearnhead. Particle ﬁlters for mixture models with an unknown number of components. Statistics and Computing, 14:11–21, 2004. [5] T.L. Grifﬁths and Z. Ghahramani. Inﬁnite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, 2006. [6] D. Heinz. Building hyper dirichlet processes for graphical models. Electonic Journal of Statistics, 3:290–315, 2009. [7] J.F.C. Kingman. Random partitions in population genetics. Proceedings of the Royal Society of London, 361:1–20, 1978. [8] S.L. Lauritzen. Graphical Models. Oxford University Press, 1996. [9] P. M¨ ller, F. Quintana, and G. Rosner. A method for combining inference across related nonu parametric Bayesian models. Journal of the Royal Statistical Society B, 66:735–749, 2004. [10] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265, 2000. [11] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related ﬁelds, 102:145–158, 1995. [12] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566–1581, 2006. [13] S. Walker and P. Muliere. A bivariate Dirichlet process. Statistics and Probability Letters, 64:1–7, 2003. [14] F. Wood and T.L. Grifﬁths. Particle ﬁltering for nonparametric Bayesian matrix factorization. In Advances in Neural Information Processing Systems, 2007. 9

3 0.59780717 114 nips-2009-Indian Buffet Processes with Power-law Behavior

Author: Yee W. Teh, Dilan Gorur

Abstract: The Indian buffet process (IBP) is an exchangeable distribution over binary matrices used in Bayesian nonparametric featural models. In this paper we propose a three-parameter generalization of the IBP exhibiting power-law behavior. We achieve this by generalizing the beta process (the de Finetti measure of the IBP) to the stable-beta process and deriving the IBP corresponding to it. We ﬁnd interesting relationships between the stable-beta process and the Pitman-Yor process (another stochastic process used in Bayesian nonparametric models with interesting power-law properties). We derive a stick-breaking construction for the stable-beta process, and ﬁnd that our power-law IBP is a good model for word occurrences in document corpora. 1

4 0.59456176 69 nips-2009-Discrete MDL Predicts in Total Variation

Author: Marcus Hutter

5 0.58512181 78 nips-2009-Efficient Moments-based Permutation Tests

Author: Chunxiao Zhou, Huixia J. Wang, Yongmei M. Wang

Abstract: In this paper, we develop an efficient moments-based permutation test approach to improve the test s computational efficiency by approximating the permutation distribution of the test statistic with Pearson distribution series. This approach involves the calculation of the first four moments of the permutation distribution. We propose a novel recursive method to derive these moments theoretically and analytically without any permutation. Experimental results using different test statistics are demonstrated using simulated data and real data. The proposed strategy takes advantage of nonparametric permutation tests and parametric Pearson distribution approximation to achieve both accuracy and efficiency. 1 In t ro d u c t i o n Permutation tests are flexible nonparametric alternatives to parametric tests in small samples, or when the distribution of a test statistic is unknown or mathematically intractable. In permutation tests, except exchangeability, no other statistical assumptions are required. The p-values can be obtained by using the permutation distribution. Permutation tests are appealing in many biomedical studies, which often have limited observations with unknown distribution. They have been used successfully in structural MR image analysis [1, 2, 3], in functional MR image analysis [4], and in 3D face analysis [5]. There are three common approaches to construct the permutation distribution [6, 7, 8]: (1) exact permutation enumerating all possible arrangements; (2) approximate permutation based on random sampling from all possible permutations; (3) approximate permutation using the analytical moments of the exact permutation distribution under the null hypothesis. The main disadvantage of the exact permutation is the computational cost, due to the factorial increase in the number of permutations with the increasing number of subjects. The second technique often gives inflated type I errors caused by random sampling. When a large number of repeated tests are needed, the random permutation strategy is also computationally expensive to achieve satisfactory accuracy. Regarding the third approach, the exact permutation distribution may not have moments or moments with tractability. In most applications, it is not the existence but the derivation of moments that limits the third approach. To the best of our knowledge, there is no systematic and efficient way to derive the moments of the permutation distribution. Recently, Zhou [3] proposed a solution by converting the permutation of data to that of the statistic coefficients that are symmetric to the permutation. Since the test statistic coefficients usually have simple presentations, it is easier to track the permutation of the test statistic coefficients than that of data. However, this method requires the derivation of the permutation for each specific test statistic, which is not accessible to practical users. In this paper, we propose a novel strategy by employing a general theoretical method to derive the moments of the permutation distribution of any weighted v-statistics, for both univariate and multivariate data. We note that any moments of the permutation distribution for weighted v-statistics [9] can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Our key idea is to divide the whole index set into several permutation equivalent (see Definition 2) index subsets such that the summation of the data/index function term over all permutations is invariant within each subset and can be calculated without conducting any permutation. Then we can obtain the moments by summing up several subtotals. The proposed method can be extended to equivalent weighted v-statistics by replacing them with monotonic weighted v-statistics. This is due to the fact that only the order of test statistics of all permutations matters for obtaining the p-values, so that the monotonic weighted v-statistics shares the same p-value with the original test statistic. Given the first four moments, the permutation distribution can be well fitted by Pearson distribution series. The p-values are then obtained without conducting any real permutation. For multiple comparison of two-group difference, given the sample size n1 = 21 and n2 = 21, the number of tests m = 2,000, we need to conduct m×(n1 + n2 )!/n1 !/n2 ! 1.1×1015 permutations for the exact permutation test. Even for 20,000 random permutations per test, we still need m×20,000 4×107 permutations. Alternatively, our moments-based permutation method using Pearson distribution approximation only involves the calculation of the first four analytically-derived moments of exact permutation distributions to achieve high accuracy (see section 3). Instead of calculating test statistics in factorial scale with exact permutation, our moments-based permutation only requires computation of polynomial order. For example, the computational cost for univariate mean difference test statistic and modified multivariate Hotelling's T2 test statistics [8] are O(n) and O(n3 ), respectively, where n = n 1 + n2 . 2 M e t h o d o lo g y In this section, we shall mainly discuss how to calculate the moments of the permutation distribution for weighted v-statistics. For other test statistics, a possible solution is to replace them with their equivalent weighted v-statistics by monotonic transforms. The detailed discussion about equivalent test statistics can be found in [7, 8, 10]. 2.1 C o m p ut a t i o n a l c h a l l e n g e Let us first look at a toy example. Suppose we have a two-group univariate data x = ( x1 , L , xn1 , xn1 +1 ,L , xn1 + n2 ) , where the first n1 elements are in group A and the rest, n2 ,are in group B. For comparison of the two groups, the hypothesis is typically constructed as: H 0 : m A = mB vs. H a : m A ¹ m B , where m A , m B are the population means of the groups A n1 n i =1 i = n1 +1 and B, respectively. Define x A = å xi / n1 and xB = å xi / n2 as the sample means of two groups, where n=n1+n2. We choose the univariate group mean difference statistic, i.e., n T ( x) = x A - xB = å w(i ) xi , i =1 where the index function as the test w(i ) = 1/ n1 , if i Î {1, L , n1} and w(i ) = -1/ n2 , if i Î {n1 + 1, L, n} . Then the total number of all possible permutations of {1, L, n} is n!. To calculate the fourth moment of the permutation distribution, n n n n n 1 1 4 å ( å w(i ) xp (i ) ) = å å å å å w(i1 ) w(i2 ) w(i3 ) w(i4 )xp ( i1 ) xp ( i2 ) xp ( i3 ) xp ( i4 ) , n ! p ÎS n i =1 n ! p ÎS n i1 =1 i2 =1 i3 =1 i4 =1 where is the permutation operator and the symmetric group Sn [11] includes all distinct permutations. The above example shows that the moment calculation can be considered as a summation over all possible permutations and a large index set. It is noticeable that the computational challenge here is to go through the factorial level permutations and polynomial level indices. Ep (T 4 ( x ))= 2.2 P a r t i t i o n t h e i n de x s e t In this paper, we assume that the test statistic T can be expressed as a weighted v-statistic of n n i1 =1 id =1 degree d [9], that is, T ( x) = å L å w(i1 , L , id ) h( xi1 ,L , xid ) , where x = ( x1 , x 2 , L , x n ) T is a data with n observations, and w is a symmetric index function. h is a symmetric data function, i.e., invariant under permutation of (i1 ,L , id ) . Though the symmetry property is not required for our method, it helps reduce the computational cost. Here, each observation xk can be either univariate or multivariate. In the above toy example, d=1 and h is the identity function. Therefore, the r-th moment of the test statistic from the permutated data is: Ep (T r ( x)) = Ep ( å w(i1 ,L , id )h( xp (i1 ) ,L , xp ( id ) ))r i1 ,i2 ,L,id = Ep [ å i1(1) ,L, id (1) , r r k =1 k =1 { Õ w(i1(k ) ,L, id ( k ) ) Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) )}] . L i1( r ) ,L,id ( r ) d 1 Then we can exchange the summation order of permutations and that of indices, Ep (T r ( x)) = å i1(1) ,L, id (1) , L i1( r ) ,L,id ( r ) r r k =1 k =1 {( Õ w(i1( k ) ,L , id ( k ) )) Ep ( Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) ))}. d 1 Thus any moment of permutation distribution can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Since all possible permutations map any index value between 1 and n to all possible index r values from 1 to n with equal probability, Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) , the summation of k =1 1 d data function over all permutations is only related to the equal/unequal relationship among indices. It is natural to divide the whole index set U = {i1 ,L , id }r = {(i1(1) , L , id (1) ), L (r ) , ( i1 , L , id (r ) r )} into the union of disjoint index subsets, in which Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) k =1 1 d is invariant. Definition 1. Since h is a symmetric function, two index elements (i1 ,L , id ) and ( j1 ,L , jd ) are said to be equivalent if they are the same up to the order. For example, for d = 3, (1, 4, 5) = (1,5,4) = (4,1,5) = (4,5,1) = (5,1,4) = (5,4,1). Definition 2. Two indices {(i1(1) , L , id (1) ), L , (i1( r ) , L , id ( r ) )} and {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} are said to be permutation equivalent/ if there exists a permutation p Î Sn such that {(p (i1(1) ), L , p (id (1) )), L , (p (i1( r ) ), L , p (id ( r ) ))} = {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} . Here

6 0.57924461 7 nips-2009-A Data-Driven Approach to Modeling Choice

7 0.57123595 206 nips-2009-Riffled Independence for Ranked Data

8 0.51800269 100 nips-2009-Gaussian process regression with Student-t likelihood

9 0.49111521 29 nips-2009-An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

10 0.4793542 152 nips-2009-Measuring model complexity with the prior predictive

11 0.46942657 11 nips-2009-A General Projection Property for Distribution Families

12 0.46135858 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information

13 0.45454025 112 nips-2009-Human Rademacher Complexity

14 0.45087403 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

15 0.44613448 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

16 0.44600084 25 nips-2009-Adaptive Design Optimization in Experiments with People

17 0.44473544 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

18 0.44028515 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

19 0.43364 226 nips-2009-Spatial Normalized Gamma Processes

20 0.43119776 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.063), (25, 0.078), (35, 0.056), (36, 0.097), (39, 0.047), (58, 0.093), (61, 0.023), (66, 0.011), (71, 0.085), (81, 0.011), (86, 0.081), (90, 0.249), (91, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81240904 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

Author: Peter Orbanz

2 0.64460999 260 nips-2009-Zero-shot Learning with Semantic Output Codes

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

Abstract: We consider the problem of zero-shot learning, where the goal is to learn a classiﬁer f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we deﬁne the notion of a semantic output code classiﬁer (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classiﬁer and study its theoretical properties in a PAC framework, showing conditions under which the classiﬁer can accurately predict novel classes. As a case study, we build a SOC classiﬁer for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. 1

3 0.64409202 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

Author: Kai Yu, Tong Zhang, Yihong Gong

Abstract: This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difﬁcult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. 1

4 0.64259875 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a ﬁxed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a ﬁxed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classiﬁers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classiﬁers, or the discriminative classiﬁers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classiﬁcation performance that at least matches the performance of the free energy classiﬁer based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classiﬁers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

5 0.64128739 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

Author: Piyush Rai, Hal Daume

Abstract: Canonical Correlation Analysis (CCA) is a useful technique for modeling dependencies between two (or more) sets of variables. Building upon the recently suggested probabilistic interpretation of CCA, we propose a nonparametric, fully Bayesian framework that can automatically select the number of correlation components, and effectively capture the sparsity underlying the projections. In addition, given (partially) labeled data, our algorithm can also be used as a (semi)supervised dimensionality reduction technique, and can be applied to learn useful predictive features in the context of learning a set of related tasks. Experimental results demonstrate the efﬁcacy of the proposed approach for both CCA as a stand-alone problem, and when applied to multi-label prediction. 1

6 0.64074427 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

7 0.64072746 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

8 0.64050215 113 nips-2009-Improving Existing Fault Recovery Policies

9 0.63990158 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

10 0.63863814 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

11 0.63749486 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

12 0.63747674 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

13 0.63641977 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

14 0.63526082 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

15 0.63473952 3 nips-2009-AUC optimization and the two-sample problem

16 0.6327765 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.63257676 112 nips-2009-Human Rademacher Complexity

18 0.63193071 226 nips-2009-Spatial Normalized Gamma Processes

19 0.63076788 90 nips-2009-Factor Modeling for Advertisement Targeting

20 0.63021243 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data