nips nips2013 nips2013-43 knowledge-graph by maker-knowledge-mining

43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

Source: pdf

Author: Ari Pakman, Liam Paninski

Abstract: We present a new approach to sample from generic binary distributions, based on an exact Hamiltonian Monte Carlo algorithm applied to a piecewise continuous augmentation of the binary distribution of interest. An extension of this idea to distributions over mixtures of binary and possibly-truncated Gaussian or exponential variables allows us to sample from posteriors of linear and probit regression models with spike-and-slab priors and truncated parameters. We illustrate the advantages of these algorithms in several examples in which they outperform the Metropolis or Gibbs samplers. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An extension of this idea to distributions over mixtures of binary and possibly-truncated Gaussian or exponential variables allows us to sample from posteriors of linear and probit regression models with spike-and-slab priors and truncated parameters. [sent-2, score-0.338]

2 1 Introduction Mapping a problem involving discrete variables into continuous variables often results in a more tractable formulation. [sent-4, score-0.099]

3 The HMC method is a Markov Chain Monte Carlo algorithm that usually has better performance over Metropolis or Gibbs samplers, because it manages to propose transitions in the Markov chain which lie far apart in the sampling space, while maintaining a reasonable acceptance rate for these proposals. [sent-6, score-0.141]

4 The algorithms we present in this work are special because the Hamiltonian equations of motion can be integrated exactly, so there is no need for tuning a step-size parameter and the Markov chain always accepts the proposed moves. [sent-8, score-0.149]

5 Similar ideas have been used recently to sample from truncated Gaussian multivariate distributions [2], allowing much faster sampling than other methods. [sent-9, score-0.205]

6 Since the method we present transforms a binary sampling problem into a continuous one, it is natural to extend it to distributions deﬁned over mixtures of binary and Gaussian or exponential variables, transforming them into purely continuous distributions. [sent-11, score-0.298]

7 In particular, we show how to sample from the posterior of linear and probit regression models with spike-and-slab priors, while also imposing truncations in the parameter space (e. [sent-13, score-0.19]

8 The method we use to map binary to continuous variables consists in simply identifying a binary variable with the sign of a continuous one. [sent-16, score-0.289]

9 An alternative relaxation of binary to continuous vari1 ables, known in statistical physics as the “Gaussian integral trick” [3], has been used recently to apply HMC methods to binary distributions [4], but the details of that method are different than ours. [sent-17, score-0.202]

10 2 Binary distributions We are interested in sampling from a probability distribution p(s) deﬁned over d-dimensional binary vectors s ∈ {−1, +1}d , and given in terms of a function f (s) as 1 f (s) . [sent-19, score-0.12]

11 Let us augment the distribution p(s) with continuous variables y ∈ Rd as p(s) = p(s, y) = p(s)p(y|s) (2) where p(y|s) is non-zero only in the orthant deﬁned by si = sign(yi ) i = 1, . [sent-21, score-0.294]

12 In the second line we have made explicit that for each y, only one term in the sum in (4) is non-zero, so that p(y) is piecewise deﬁned in each orthant. [sent-26, score-0.065]

13 The idea is to deﬁne a potential energy function U (y) = − log p(y|s) − log f (s) , (6) introduce momentum variables qi , and consider the piecewise continuous Hamiltonian H(y, q) = U (y) + q·q 2 , (7) whose value is identiﬁed with the energy of a particle moving in a d-dimensional space. [sent-28, score-0.52]

14 In each iteration of the sampling algorithm, we sample initial values q(0) for the momenta from a standard Gaussian distribution and let the particle move during a time T according to the equations of motion ˙ y(t) = ∂H , ∂q(t) ˙ q(t) = − ∂H . [sent-30, score-0.337]

15 ∂y(t) (8) The ﬁnal coordinates, y(T ), belong to a Markov chain with invariant distribution p(y), and are used as the initial coordinates of the next iteration. [sent-31, score-0.118]

16 The detailed balance condition follows directly from the conservation of energy and (y, q)-volume along the trajectory dictated by (8), see [1, 2] for details. [sent-32, score-0.163]

17 As the particle moves, the potential energy U (y) and the kinetic energy q·q change in tandem, keeping the value of the Hamiltonian (7) constant. [sent-34, score-0.27]

18 But this smooth 2 interchange gets interrupted when any coordinate reaches zero. [sent-35, score-0.154]

19 Suppose this ﬁrst happens at time tj for coordinate yj , and assume that yj < 0 for t < tj . [sent-36, score-0.734]

20 Conservation of energy imposes now a jump on the momentum qj as a result of the discontinuity in U (y). [sent-37, score-0.338]

21 Let us call qj (t− ) and qj (t+ ) the j j value of the momentum qj just before and after the coordinate hits yj = 0. [sent-38, score-1.017]

22 In order to enforce conservation of energy, we equate the Hamiltonian at both sides of the yj = 0 wall, giving 2 2 qj (t+ ) qj (t− ) j j = ∆j + 2 2 2 (9) with ∆j = U (yj = 0, sj = −1) − U (yj = 0, sj = +1) (10) 2 If eq. [sent-39, score-0.697]

23 (9) gives a positive value for qj (t+ ), the coordinate yj crosses the boundary and continues j 2 its trajectory in the new orthant. [sent-40, score-0.671]

24 (9) gives a negative value for qj (t+ ), the j + particle is reﬂected from the yj = 0 wall and continues its trajectory with qj (tj ) = −qj (t− ). [sent-42, score-0.925]

25 When j ∆j < 0, the situation can be understood as the limit of a scenario in which the particle faces an upward hill in the potential energy, causing it to diminish its velocity until it either reaches the top of the hill with a lower velocity or stops and then reverses. [sent-43, score-0.396]

26 In the limit in which the hill has ﬁnite height but inﬁnite slope, the velocity change occurs discontinuously at one instant. [sent-44, score-0.11]

27 Note that we used in (9) that the momenta qi=j are continuous, since this sudden inﬁnite slope hill is only seen by the yj coordinate. [sent-45, score-0.313]

28 Regardless of whether the particle bounces or crosses the yj = 0 wall, the other coordinates move unperturbed until the next boundary hit, where a similar crossing or reﬂection occurs, and so on, until the ﬁnal position y(T ). [sent-46, score-0.525]

29 The framework we presented above is very general and in order to implement a particular sampler we need to select the distributions p(y|s). [sent-47, score-0.189]

30 1 Gaussian augmentation Let us consider ﬁrst for p(y|s) the truncated Gaussians p(y|s) = (2/π)d/2 e− 0 y·y 2 for sign(yi ) = si , otherwise , i = 1, . [sent-50, score-0.325]

31 , d (11) ¨ ¨ The equations of motion (8) lead to y(t) = −y(t), q(t) = −q(t), and have a solution yi (t) = = qi (t) = = yi (0) cos(t) + qi (0) sin(t) , ui sin(φi + t) , −yi (0) sin(t) + qi (0) cos(t) , ui cos(φi + t) . [sent-53, score-0.407]

32 (12) (13) (14) (15) This setting is similar to the case studied in [2] and from φi = tan−1 (yi (0)/qi (0)) the boundary hit times ti are easily obtained. [sent-54, score-0.139]

33 When a boundary is reached, say yj = 0, the coordinate yj changes its trajectory for t > tj as yj (t) = qj (t+ ) sin(t − tj ) , j (16) with the value of qj (t+ ) obtained as described above. [sent-55, score-1.423]

34 j Choosing an appropriate value for the travel time T is crucial when using HMC algorithms [5]. [sent-56, score-0.075]

35 As is clear from (13), if we let the particle travel during a time T > π, each coordinate reaches zero at least once, and the hitting times can be ordered as 0 < tj1 ≤ tj2 ≤ · · · ≤ tjd < π . [sent-57, score-0.37]

36 (17) Moreover, regardless of whether a coordinate crosses zero or gets reﬂected, it follows from (16) that the successive hits occur at ti + nπ, n = 1, 2, . [sent-58, score-0.317]

37 (18) and therefore the hitting times only need to be computed once for each coordinate in every iteration. [sent-61, score-0.119]

38 If we let the particle move during a time T = nπ, each coordinate reaches zero n times, in the cyclical order (17), with a computational cost of O(nd) from wall hits. [sent-62, score-0.429]

39 As we just showed, between yj (0) and yj (π) the coordinate touches the boundary yj = 0 once, and if yj gets reﬂected off the boundary, it is easy to check that we have yj (π) = yj (0). [sent-64, score-1.427]

40 If we take T = nπ and the particle gets reﬂected all the n times it hits the boundary, we get yj (T ) = yj (0) and the coordinate yj does not move at all. [sent-65, score-1.009]

41 To 1 avoid these singular situations, a good choice is T = (n + 2 )π, which generalizes the recommended 3 travel time T = π/2 for truncated Gaussians in [2]. [sent-66, score-0.162]

42 1 With T = (n + 2 )π, the total cost of each sample is O((n + 1/2)d) on average from wall hits, plus O(d) from the sampling of q(0) and from the d inverse trigonometric functions to obtain the hit times ti . [sent-68, score-0.263]

43 But in complex distributions, the computational cost is dominated by the the evaluation of ∆i in (10) at each wall hit. [sent-69, score-0.146]

44 Interestingly, the rate at which wall yi = 0 is crossed coincides with the acceptance rate in a Metropolis algorithm that samples uniformly a value for i and makes a proposal of ﬂipping the binary variable si . [sent-70, score-0.491]

45 Of course, this does not mean that the HMC algorithm is the same as Metropolis, because in HMC the order in which the walls are hit is ﬁxed given 2 the initial velocity, and the values of qi at successive hits of yi = 0 within the same iteration are not independent. [sent-72, score-0.386]

46 2 Exponential and other augmentations Another distribution that allows one an exact solution of the equations of motion (8) is p(y|s) = e− 0 d i=1 |yi | for sign(yi ) = si , otherwise , i = 1, . [sent-74, score-0.294]

47 , d (19) which leads to the equations of motion yi (t) = −si , with solutions of the form ¨ si t2 . [sent-77, score-0.316]

48 (20) 2 In this case, the initial hit time for every coordinate is the solution of the quadratic equation yi (t) = 0. [sent-78, score-0.225]

49 But, unlike the case of the Gaussian augmentation, the order of successive hits is not ﬁxed. [sent-79, score-0.155]

50 Indeed, if coordinate yj hits zero at time tj , it continues its trajectory as sj yj (t > tj ) = q(t+ )(t − tj ) − (t − tj )2 , (21) j 2 so the next wall hit yj = 0 will occur at a time tj given by yi (t) = yi (0) + qi (0)t − (tj − tj ) = 2|qj (t+ )| , j (22) sign(qj (t+ )). [sent-80, score-2.067]

51 j where we used sj = So we see that the time between successive hits of the same coordinate depends only on its momentum after the last hit. [sent-81, score-0.32]

52 Moreover, since the value of |qj (t+ )| is smaller than |qj (t− )| if the coordinate crosses to an orthant of lower probability, equation (22) implies that the particle moves away faster from areas of lower probability. [sent-82, score-0.351]

53 This is unlike the Gaussian augmentation, where a coordinate ‘waits in line’ until all the other coordinates touch their wall before hitting its wall again. [sent-83, score-0.448]

54 One could also deﬁne f (y|s) as a uniform distribution in a box such that the computation of the times for wall hits would becomes purely linear and we get a classical ‘billiards’ dynamics. [sent-85, score-0.294]

55 More generally, one could consider different augmentations in different orthants and potentially tailor the algorithm to mix faster in complex and multimodal distributions. [sent-86, score-0.129]

56 3 Spike-and-slab regression with truncated parameters The subject of Bayesian sparse regression has seen a lot of work during the last decade. [sent-87, score-0.181]

57 In this section we will show how the HMC binary sampler can be extended to sample from the posterior of these models. [sent-90, score-0.272]

58 The latter is a distribution over a set of binary and continuous variables, with the binary variables determining whether each coefﬁcient should be included in the model or not. [sent-91, score-0.196]

59 The idea is to map these indicator binary variables into continuous variables as we did above, obtaining a distribution from which we can sample again using exact HMC methods. [sent-92, score-0.185]

60 1 Linear regression Consider a regression problem with a log-likelihood that depends quadratically on its coefﬁcients, such as 1 log p(D|w) = − w Mw + r · w + const. [sent-95, score-0.094]

61 (24) i=1 (1−si ) (1+si ) Each binary variable si = ±1 has a Bernoulli prior p(si |a) = a 2 (1 − a) 2 and determines whether the coefﬁcient wi is included in the model. [sent-99, score-0.325]

62 In principle we could integrate out the weights w and obtain a collapsed distribution for s, but we are interested in cases in which the space of w is truncated and therefore the integration is not feasible. [sent-103, score-0.087]

63 An example would be when a non-negativity constraint wi ≥ 0 is imposed. [sent-104, score-0.09]

64 In these cases, one possible approach is to sample from (28) with a block Gibbs sampler over the pairs {wi , si }, as proposed in [10]. [sent-105, score-0.353]

65 Using the Gaussian augmentation (11), this gives a distribution p(w, y|D, a, τ 2 ) ∝ e− 2 w+ (M+ +τ 1 −2 )w+ +r+ ·w+ e− w− ·w− 2τ 2 e− y·y 2 a|s | (1 − a)|s + −| (30) where the values of s in the rhs are obtained from the signs of y. [sent-110, score-0.065]

66 This is a piecewise Gaussian, different in each orthant of y, and possibly truncated in the w space. [sent-111, score-0.201]

67 Sampling from (30) gives us samples from the original distribution (28) using a simple rule: each pair (wi , yi ) becomes (wi , si = +1) if yi ≥ 0 and 5 (wi = 0, si = −1) if yi < 0. [sent-113, score-0.571]

68 This undoes the steps we took to transform (28) into (30): the identiﬁcation si = sign(yi ) takes us from p(w, y|D, a, τ 2 ) to p(w, s|D, a, τ 2 ), and setting wi = 0 when si = −1 undoes the replacement in (29). [sent-114, score-0.516]

69 Since (30) is a piecewise Gaussian distribution, we can sample from it again using the methods of [2]. [sent-115, score-0.089]

70 As in that work, the possible truncations for w are given as gn (w) ≥ 0 for n = 1, . [sent-116, score-0.056]

71 The details are a simple extension of the purely binary case and are not very illuminating, so we leave them for the Appendix. [sent-120, score-0.088]

72 Each zi is truncated according to the sign of bi and we can also truncate the w space if we so desire. [sent-124, score-0.217]

73 1 1D Ising model We consider a 1D periodic Ising model, with p(s) ∝ e−βE[s] , where the energy is E[s] = d − i=1 si si+1 , with sd+1 = s1 and β is the inverse temperature. [sent-128, score-0.278]

74 Figure 1 shows the ﬁrst 1000 iterations of both the Gaussian HMC and the Metropolis1 sampler on a model with d = 400 and β = 0. [sent-129, score-0.156]

75 5π and, for the sake of comparable computational costs, for the Metropolis sampler we recorded the value of s every d × 12. [sent-132, score-0.179]

76 The plot shows clearly that the Markov chain mixes much faster with HMC than with Metropolis. [sent-134, score-0.163]

77 A useful variable that summarizes the behavior of the Markov d 1 chain is the magnetization m = d i=1 si , whose expected value is m = 0. [sent-135, score-0.334]

78 The oscillations of both samplers around this value illustrate the superiority of the HMC sampler. [sent-136, score-0.096]

79 In the Appendix we present a more detailed comparison of the HMC Gaussian and exponential and the Metropolis samplers, showing that the Gaussian HMC sampler is the most efﬁcient among the three. [sent-137, score-0.156]

80 2 2D Ising model We consider now a 2D Ising model on a square lattice of size L × L with periodic boundary conditions below the critical temperature. [sent-139, score-0.13]

81 Starting from a completely disordered state, we compare the time it takes for the sampler to reach one of the two low energy states with magnetization m ±1. [sent-140, score-0.306]

82 5π and a Metropolis sampler recording values of s every 2. [sent-144, score-0.156]

83 In general we see that the HMC sampler reaches higher likelihood regions faster. [sent-146, score-0.202]

84 [14]), for binary distributions, the Metropolis sampler that chooses a random spin and makes a proposal of ﬂipping its value, is more efﬁcient than the Gibbs sampler. [sent-149, score-0.218]

85 First 1000 iterations of Gaussian HMC and Metropolis samplers on a model with d = 400 and β = 0. [sent-151, score-0.061]

86 42, initialized with all spins si = 1 (black dots). [sent-152, score-0.208]

87 5π and in the Metropolis sampler we recorded the state of the Markov chain once every d × 12. [sent-154, score-0.26]

88 The plots show clearly that the HMC model mixes faster than Metropolis in this model. [sent-157, score-0.082]

89 First samples from 20 simulations in a 2D Ising model in a square lattice of side length L = 100 with periodic boundary conditions and inverse temperature β = 0. [sent-169, score-0.165]

90 5π and for Metropolis we recorded the state of the chain every 2. [sent-174, score-0.104]

91 The plots illustrate that in general HMC reaches equilibrium faster than Metropolis in this model. [sent-176, score-0.148]

92 Figure 1 shows that the HMC sampler explores faster the sampled space once the chain has reached its equilibrium distribution, while Figure 2 shows that the HMC sampler is faster in reaching the equilibrium distribution. [sent-178, score-0.527]

93 Comparison of the proposed HMC method with the Gibbs sampler of [10] for the posterior of a linear regression model with spike-andslab prior, with a positivity constraint on the coefﬁcients. [sent-185, score-0.27]

94 The plots shows clearly that HMC mixes much faster than Gibbs and is more consistent in exploring areas of high probability. [sent-190, score-0.082]

95 3 Spike-and-slab linear regression with positive coefﬁcients We consider a linear regression model z = Xw + ε with the following synthetic data. [sent-192, score-0.094]

96 Figure 3 compares the results of the proposed HMC method versus the Gibbs sampler used in [10]. [sent-199, score-0.156]

97 For the HMC sampler we use a travel time T = π/2. [sent-201, score-0.231]

98 This results in a number of wall hits (both for w and y variables) of 150, which makes the computational cost of every HMC and Gibbs sample similar. [sent-202, score-0.292]

99 This impressive difference in the efﬁciency of HMC versus Gibbs is similar to the case of truncated multivariate Gaussians studied in [2]. [sent-204, score-0.087]

100 Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. [sent-277, score-0.114]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hmc', 0.688), ('metropolis', 0.233), ('yj', 0.208), ('qj', 0.185), ('si', 0.173), ('sampler', 0.156), ('hamiltonian', 0.153), ('wall', 0.146), ('particle', 0.13), ('hits', 0.122), ('tj', 0.118), ('ising', 0.097), ('wi', 0.09), ('gibbs', 0.09), ('truncated', 0.087), ('coordinate', 0.082), ('chain', 0.081), ('magnetization', 0.08), ('travel', 0.075), ('yi', 0.075), ('boundary', 0.071), ('energy', 0.07), ('monte', 0.068), ('hit', 0.068), ('piecewise', 0.065), ('hill', 0.065), ('augmentation', 0.065), ('qi', 0.063), ('binary', 0.062), ('samplers', 0.061), ('probit', 0.058), ('carlo', 0.056), ('coef', 0.056), ('crosses', 0.054), ('augmentations', 0.053), ('conservation', 0.053), ('momentum', 0.05), ('orthant', 0.049), ('sign', 0.048), ('regression', 0.047), ('mixes', 0.046), ('reversible', 0.046), ('reaches', 0.046), ('continuous', 0.045), ('velocity', 0.045), ('gaussian', 0.044), ('markov', 0.044), ('bi', 0.043), ('ected', 0.041), ('acf', 0.04), ('momenta', 0.04), ('orthants', 0.04), ('undoes', 0.04), ('trajectory', 0.04), ('zi', 0.039), ('motion', 0.038), ('coordinates', 0.037), ('ip', 0.037), ('hitting', 0.037), ('positivity', 0.037), ('sin', 0.036), ('faster', 0.036), ('illustrate', 0.035), ('spins', 0.035), ('periodic', 0.035), ('acceptance', 0.035), ('temperature', 0.035), ('jump', 0.033), ('sj', 0.033), ('successive', 0.033), ('distributions', 0.033), ('mw', 0.033), ('ari', 0.033), ('coefficient', 0.033), ('equilibrium', 0.031), ('truncations', 0.031), ('pakman', 0.031), ('slab', 0.031), ('continues', 0.031), ('posterior', 0.03), ('equations', 0.03), ('gaussians', 0.03), ('ipping', 0.029), ('arxiv', 0.029), ('cients', 0.029), ('variables', 0.027), ('cos', 0.027), ('liam', 0.026), ('army', 0.026), ('xw', 0.026), ('purely', 0.026), ('gets', 0.026), ('move', 0.025), ('gn', 0.025), ('sampling', 0.025), ('bayesian', 0.025), ('iteration', 0.025), ('sample', 0.024), ('lattice', 0.024), ('recorded', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

Author: Ari Pakman, Liam Paninski

2 0.5300414 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

Author: Alfredo Kalaitzis, Ricardo Silva

Abstract: Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parameterization of joint distributions. Among other properties, copulas provide a recipe for combining ﬂexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efﬁcient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size. 1 Contribution There are many ways of constructing multivariate discrete distributions: from full contingency tables in the small dimensional case [1], to structured models given by sparsity constraints [11] and (hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has been combined with such standard building blocks. Our contribution is a novel algorithm for efﬁcient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending algorithmic ideas introduced by [17]. A copula is a continuous cumulative distribution function (CDF) with uniformly distributed univariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms that provide a modular parameterization of joint distributions. The core idea is simple and given by the following observation: suppose we are given a (say) bivariate CDF F (y1 , y2 ) with marginals −1 −1 F1 (y1 ) and F2 (y2 ). This CDF can then be rewritten as F (F1 (F1 (y1 )), F2 (F2 (y2 ))). The func−1 −1 tion C(·, ·) given by F (F1 (·), F2 (·)) is a copula. For discrete distributions, this decomposition is not unique but still well-deﬁned [16]. Copulas have found numerous applications in statistics and machine learning since they provide a way of constructing ﬂexible multivariate distributions by mix-and-matching different copulas with different univariate marginals. For instance, one can combine ﬂexible univariate marginals Fi (·) with useful but more constrained high-dimensional copulas. We will not further motivate the use of copula models, which has been discussed at length in recent 1 machine learning publications and conference workshops, and for which comprehensive textbooks exist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning perspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea dates back at least to the 1950s [16]. In the discrete case, copulas can be difﬁcult to apply: transforming a copula CDF into a probability mass function (PMF) is computationally intractable in general. For the continuous case, a common ˆ trick goes as follows: transform variables by deﬁning ai ≡ Fi (yi ) for an estimate of Fi (·) and then ﬁt a copula density c(·, . . . , ·) to the resulting ai [e.g. 9]. It is not hard to check this breaks down in the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data point by a continuous integral on a bounded space. Sampling methods can then be used. This trick has allowed many applications of the Gaussian copula to discrete domains. Readers familiar with probit models will recognize the similarities to models where an underlying latent Gaussian ﬁeld is discretized into observable integers as in Gaussian process classiﬁers and ordinal regression [18]. Such models can be indirectly interpreted as special cases of the Gaussian copula. In what follows, we describe in Section 2 the Gaussian copula and the general framework for constructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework. This framework entails computational issues which are discussed. A recent general approach for MCMC in constrained Gaussian ﬁelds by [17] can in principle be directly applied to this problem as a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical in general. Our key contribution is given in Section 4. An application experiment on the Bayesian Gaussian copula factor model is performed in Section 5. Conclusions are discussed in the ﬁnal section. 2 Gaussian copulas and the extended rank likelihood It is not hard to see that any multivariate Gaussian copula is fully deﬁned by a correlation matrix C, since marginal distributions have no free parameters. In practice, the following equivalent generative model is used to deﬁne a sample U according to a Gaussian copula GC(C): 1. Sample Z from a zero mean Gaussian with covariance matrix C 2. For each Zj , set Uj = Φ(zj ), where Φ(·) is the CDF of the standard Gaussian It is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables {y1 , y2 , . . . , yp } with marginal distributions Fj (·) and copula GC(C), one can add the deterministic (n) (1) (1) (2) step yj = Fj−1 (uj ). Now, given n samples of observed data Y ≡ {y1 , . . . , yp , y1 , . . . , yp }, one is interested on inferring C via a Bayesian approach and the posterior distribution p(C, θF | Y) ∝ pGC (Y | C, θF )π(C, θF ) where π(·) is a prior distribution, θF are marginal parameters for each Fj (·), which in general might need to be marginalized since they will be unknown, and pGC (·) is the PMF of a (here discrete) distribution with a Gaussian copula and marginals given by θF . Let Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a deterministic function of Z, this mapping is not invertible due to the discreteness of the distribution: each marginal Fj (·) has jumps. Instead, the reverse mapping only enforces the constraints where (i ) (i ) (i ) (i ) yj 1 < yj 2 implies zj 1 < zj 2 . Based on this observation, [7] considers the event Z ∈ D(y), where D(y) is the set of values of Z in Rn×p obeying those constraints, that is (k) (k) D(y) ≡ Z ∈ Rn×p : max zj s.t. yj (i) < yj (i) (k) (i) (k) < zj < min zj s.t. yj < yj . Since {Y = y} ⇒ Z(y) ∈ D(y), we have pGC (Y | C, θF ) = pGC (Z ∈ D(y), Y | C, θF ) = pN (Z ∈ D(y) | C) × pGC (Y| Z ∈ D(y), C, θF ), (1) the ﬁrst factor of the last line being that of a zero-mean a Gaussian density function marginalized over D(y). 2 The extended rank likelihood is deﬁned by the ﬁrst factor of (1). With this likelihood, inference for C is given simply by marginalizing p(C, Z | Y) ∝ I(Z ∈ D(y)) pN (Z| C) π(C), (2) the ﬁrst factor of the right-hand side being the usual binary indicator function. Strictly speaking, this is not a fully Bayesian method since partial information on the marginals is ignored. Nevertheless, it is possible to show that under some mild conditions there is information in the extended rank likelihood to consistently estimate C [13]. It has two important properties: ﬁrst, in many applications where marginal distributions are nuisance parameters, this sidesteps any major assumptions about the shape of {Fi (·)} – applications include learning the degree of dependence among variables (e.g., to understand relationships between social indicators as in [7] and [13]) and copula-based dimensionality reduction (a generalization of correlation-based principal component analysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler than with the joint likelihood, since dropping marginal models will remove complicated entanglements between C and θF . Therefore, even if θF is necessary (when, for instance, predicting missing values of Y), an estimate of C can be computed separately and will not depend on the choice of estimator for {Fi (·)}. The standard model with a full correlation matrix C can be further reﬁned to take into account structure implied by sparse inverse correlation matrices [2] or low rank decompositions via higher-order latent variable models [13], among others. We explore the latter case in section 5. An off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: ﬁrst, given Z, the (full or structured) correlation matrix C can be sampled by standard methods. More to the point, sampling (i) Z is straightforward if for each variable j and data point i we sample Zj conditioned on all other variables. The corresponding distribution is an univariate truncated Gaussian. This is the approach used originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and that is where novel sampling methods can facilitate inference. 3 Exact HMC for truncated Gaussian distributions (i) Hoff’s algorithm modiﬁes the positions of all Zj associated with a particular discrete value of Yj , conditioned on the remaining points. As the number of data points increases, the spread of the hard (i) boundaries on Zj , given by data points of Zj associated with other levels of Yj , increases. This (i) reduces the space in which variables Zj can move at a time. To improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables (i) Zj , i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them are satisﬁed and thus the ordering in the observation level is conserved. Standard Gibbs approaches for sampling from truncated Gaussians reduce the problem to sampling from univariate truncated Gaussians. Even though each step is computationally simple, mixing can be slow when strong correlations are induced by very tight truncation bounds. In the following, we brieﬂy describe the methodology recently introduced by [17] that deals with the problem of sampling from log p(x) ∝ − 1 x Mx + r x , where x, r ∈ Rn and M is positive 2 deﬁnite, with linear constraints of the form fj x ≤ gj , where fj ∈ Rn , j = 1 . . . m, is the normal vector to some linear boundary in the sample space. Later in this section we shall describe how this framework can be applied to the Gaussian copula extended rank likelihood model. More importantly, the observed rank statistics impose only linear constraints of the form xi1 ≤ xi2 . We shall describe how this special structure can be exploited to reduce the runtime complexity of the constrained sampler from O(n2 ) (in the number of observations) to O(n) in practice. 3.1 Hamiltonian Monte Carlo for the Gaussian distribution Hamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with auxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to 3 potentially far places in the original variable space. Deterministic moves cannot in general be done, but this is possible in the Gaussian case. The form of the Hamiltonian for the general d-dimensional Gaussian case with mean µ and precision matrix M is: 1 1 H = x Mx − r x + s M−1 s , (3) 2 2 where M is also known in the present context as the mass matrix, r = Mµ and s is the velocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant) as the negative log of the product of two independent Gaussian random variables. The physical interpretation is that of a sum of potential and kinetic energy terms, where the total energy of the system is conserved. In a system where this Hamiltonian function is constant, we can exactly compute its evolution through the pair of differential equations: ˙ x= sH = M−1 s , ˙ s=− xH = −Mx + r . (4) These are solved exactly by x(t) = µ + a sin(t) + b cos(t) , where a and b can be identiﬁed at initial conditions (t = 0) : ˙ a = x(0) = M−1 s , b = x(0) − µ . (5) Therefore, the exact HMC algorithm can be summarised as follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b and store the new position at the end of the trajectory xk+1 = x(T ) as an HMC sample. It can be easily shown that the Markov chain of sampled positions has the desired equilibrium distribution N µ, M−1 [17]. 3.2 Sampling with linear constraints Sampling from multivariate Gaussians does not require any method as sophisticated as HMC, but the plot thickens when the target distribution is truncated by linear constraints of the form Fx ≤ g . Here, F ∈ Rm×n is a constraint matrix whose every row is the normal vector to a linear boundary in the sample space. This is equivalent to sampling from a Gaussian that is conﬁned in the (not necessarily bounded) convex polyhedron {x : Fx ≤ g}. In general, to remain within the boundaries of each wall, once a new velocity has been sampled one must compute all possible collision times with the walls. The smallest of all collision times signiﬁes the wall that the particle should bounce from at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3 dimensions. The collision times can be computed analytically and their equations can be found in the supplementary material. We also point the reader to [17] for a more detailed discussion of this implementation. Once the wall to be hit has been found, then position and velocity at impact time are computed and the velocity is reﬂected about the boundary normal1 . The constrained HMC sampler is summarized follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b . 1 Also equivalent to transforming the velocity with a Householder reﬂection matrix about the bounding hyperplane. 4 1 2 3 4 1 2 3 4 Figure 1: Left: Trajectories of the ﬁrst 40 iterations of the exact HMC sampler on a 2D truncated Gaussian. A reﬂection of the velocity can clearly be seen when the particle meets wall #2 . Here, the constraint matrix F is a 4 × 2 matrix. Center: The same example after 40000 samples. The coloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The walls are now planes and in this case F is a 2 × 3 matrix. Figure best seen in color. 3. Reset remaining travel time Tleft ← T . Until no travel time is left or no walls can be reached (no solutions exist), do: (a) Compute impact times with all walls and pick the smallest one, th (if a solution exists). (b) Compute v(th ) and reﬂect it about the hyperplane fh . This is the updated velocity after impact. The updated position is x(th ) . (c) Tleft ← Tleft − th 4. Store the new position at the end of the trajectory xk+1 as an HMC sample. In general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the number of constraints. This means that the computational load is concentrated in step 3(a). Another consideration is that of the allocated travel time T . Depending on the shape of the bounding polyhedron and the number of walls, a very large travel time can induce many more bounces thus requiring more computations per sample. On the other hand, a very small travel time explores the distribution more locally so the mixing of the chain can suffer. What constitutes a given travel time “large” or “small” is relative to the dimensionality, the number of constraints and the structure of the constraints. Due to the nature of our problem, the number of constraints, when explicitly expressed as linear functions, is O(n2 ) . Clearly, this restricts any direct application of the HMC framework for Gaussian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the structure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) . 4 HMC for the Gaussian Copula extended rank likelihood model Given some discrete data Y ∈ Rn×p , the task is to infer the correlation matrix of the underlying Gaussian copula. Hoff’s sampling algorithm proceeds by alternating between sampling the continu(i) (i) ous latent representation Zj of each Yj , for i = 1 . . . n, j = 1 . . . p , and sampling a covariance matrix from an inverse-Wishart distribution conditioned on the sampled matrix Z ∈ Rn×p , which is then renormalized as a correlation matrix. From here on, we use matrix notation for the samples, as opposed to the random variables – with (i) Zi,j replacing Zj , Z:,j being a column of Z, and Z:,\j being the submatrix of Z without the j-th column. In a similar vein to Hoff’s sampling algorithm, we replace the successive sampling of each Zi,j conditioned on Zi,\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of Z:,j conditioned on Z:,\j . This is done through an HMC step from a conditional multivariate truncated Gaussian. The added beneﬁt of this HMC step over the standard Gibbs approach, is that of a handle for regulating the trade-off between exploration and runtime via the allocated travel time T . Larger travel times potentially allow for larger moves in the sample space, but it comes at a cost as explained in the sequel. 5 4.1 The Hough envelope algorithm The special structure of constraints. Recall that the number of constraints is quadratic in the dimension of the distribution. This is because every Z sample must satisfy the conditions of the event Z ∈ D(y) of the extended rank likelihood (see Section 2). In other words, for any column Z:,j , all entries are organised into a partition L(j) of |L(j) | levels, the number of unique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent levels lk , lk+1 ∈ L(j) and any pair i1 ∈ lk , i2 ∈ lk+1 , it must be true that Zli ,j < Zli+1 ,j . Equivalently, a constraint f exists where fi1 = 1, fi2 = −1 and g = 0 . It is easy to see that O(n2 ) of such constraints are induced by the order statistics of the j-th variable. To deal with this boundary explosion, we developed the Hough Envelope algorithm to search efﬁciently, within all pairs in {Z:,j }, in practically linear time. Recall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as xi (t) = ai sin(t) + bi cos(t) + µi , (6) and there are n such functions, grouped into a partition of levels as described above. The Hough envelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 dimensions and two levels in Figure 2, without loss of generalization to any number of levels or dimensions. Assume we represent trajectories for points in level lk with blue curves, and points in lk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all blue curves. The goal is to ﬁnd the smallest t where a blue curve meets a red curve. This will be our collision time where a bounce will be necessary. 5 3 1 2 Figure 2: The trajectories xj (t) of each component are sinusoid functions. The right-most green dot signiﬁes the wall and the time th of the earliest bounce, where the ﬁrst inter-level pair (that is, any two components respectively from the blue and red level) becomes equal, in this case the constraint activated being xblue2 = xred2 . 4 4 5 1 2 3 0.2 0.4 0.6 t 0.8 1 1.2 1.4 1. First we ﬁnd the largest component bluemax of the blue level at t = 0. This takes O(n) time. Clearly, this will be the largest component until its sinusoid intersects that of any other component. 2. To ﬁnd the next largest component, compute the roots of xbluemax (t) − xi (t) = 0 for all components and pick the smallest (earliest) one (represented by a green dot). This also takes O(n) time. 3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When this happens, the time of earliest bounce and its constraint are found. In the worst-case scenario, n such repetitions have to be made, but in practice we can safely assume an ﬁxed upper bound h on the number of blue crossings before a inter-level crossing occurs. In experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of curves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis of other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls [8]. Our HMC sampling approach is summarized in Algorithm 1. 2 The name is inspired from the fact that each xi (t) is the sinusoid representation, in angle-distance space, of all lines that pass from the (ai , bi ) point in a − b space. A representation known in image processing as the Hough transform [3]. 6 Algorithm 1 HMC for GCERL # Notation: T MN (µ, C, F) is a truncated multivariate normal with location vector µ, scale matrix C and constraints encoded by F and g = 0 . # IW(df, V0 ) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 . Input: Y ∈ Rn×p , allocated travel time T , a starting Z and variable covariance V ∈ Rp×p , df = p + 2, V0 = df Ip and chain size N . Generate constraints F(j) from Y:,j , for j = 1 . . . p . for samples k = 1 . . . N do # Resample Z as follows: for variables j = 1 . . . p do −1 −1 2 Compute parameters: σj = Vjj − Vj,\j V\j,\j V\j,j , µj = Z:,\j V\j,\j V\j,j . 2 Get one sample Z:,j ∼ T MN µj , σj I, F(j) efﬁciently by using the Hough Envelope algorithm, see section 4.1. end for Resample V ∼ IW(df + n, V0 + Z Z) . Compute correlation matrix C, s.t. Ci,j = Vi,j / Vi,i Vj,j and store sample, C(k) ← C . end for 5 An application on the Bayesian Gausian copula factor model In this section we describe an experiment that highlights the beneﬁts of our HMC treatment, compared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we ask the important question: “How do the two schemes compare when we exploit the full-advantage of the HMC machinery to jointly sample parameters and the augmented data Z, in a model of latent variables and structured correlations?” We argue that under such circumstances the superior convergence speed and mixing of HMC undeniably compensate for its computational overhead. Experimental setup In this section we provide results from an application on the Gaussian copula latent factor model of [13] (Hoff’s model [7] for low-rank structured correlation matrices). We modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps with a single HMC step. We show a much faster convergence to the true mode with considerable support on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly exploring the posterior in any reasonable ﬁnite amount of time, even for small models, even for small samples. Worse, PX fails in ways one cannot easily detect. Namely, we sample each row of the factor loadings matrix Λ jointly with the corresponding column of the augmented data matrix Z, conditioning on the higher-order latent factors. This step is analogous to Pakman and Paninski’s [17, sec.3.1] use of HMC in the context of a binary probit model (the extension to many levels in the discrete marginal is straightforward with direct application of the constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent factors remains identical to [13]. Our scheme involves no parameter expansion. We do however interweave the Gibbs step for the Z matrix similarly to Hoff. This has the added beneﬁt of exploring the Z sample space within their current boundaries, complementing the joint (λ, z) sampling which moves the boundaries jointly. The value of such ”interweaving” schemes has been addressed in [19]. Results We perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel time π/2 for HMC with the setups listed in the following table, along with the elapsed times of each sampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of RAM. Both methods were parallelized across the observed variables (p). Figure p (vars) k (latent factors) M (ordinal levels) elapsed (mins): HMC PX 3(a) : 20 5 2 115 8 3(b) : 10 3 2 80 6 10 3 5 203 16 3(c) : Many functionals of the loadings matrix Λ can be assessed. We focus on reconstructing the true (low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm’s 7 outcome with the root mean squared error (RMSE) of the differences between entries of the ground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC scheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b) and 3(c) . (a) (b) (c) Figure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the Gaussian copula and its histogram (along the left side). (a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster (in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. (b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note that the histogram counts RMSEs after the burn-in period of PX (iteration #500). (c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and 3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period of PX (iteration #2000). Main message HMC reaches a better mode faster (iterations/CPUtime). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an analogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal’s slice sampler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX plot alone that the algorithm is performing poorly. This mirrors Radford Neal’s statement opening Section 8 of his paper: “a wrong answer is obtained without any obvious indication that something is amiss”. The concentration on the posterior mode of PX in these simulations is misleading of the truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex algorithms for complex models. We urge practitioners to revisit their past work with this model to ﬁnd out by how much credible intervals of functionals of interest have been overconﬁdent. Whether trivially or severely, our algorithm offers the ﬁrst principled approach for checking this out. 6 Conclusion Sampling large random vectors simultaneously in order to improve mixing is in general a very hard problem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary. We expect that the method here developed is useful not only for those with data analysis problems within the large family of Gaussian copula extended rank likelihood models, but the method itself and its behaviour might provide some new insights on MCMC sampling in constrained spaces in general. Another direction of future work consists of exploring methods for elliptical copulas, and related possible extensions of general HMC for non-Gaussian copula models. Acknowledgements The quality of this work has beneﬁted largely from comments by our anonymous reviewers and useful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC grant EP/J013293/1. 8 References [1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975. [2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5:969–993, 2011. [3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972. [4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and Quantitative Finance workshop, to appear, 2013. [5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Information Processing Systems, 25:171–179, 2012. [6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265–283, 2007. [8] R. Jarvis. On the identiﬁcation of the convex hull of a ﬁnite set of points in the plane. Information Processing Letters, 2(1):18–21, 1973. [9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997. [10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Processing Systems, 2007. [11] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Conference Proceedings: AISTATS 2010, 9:541–548, 2010. [13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, to appear, 2013. [14] R. Neal. Slice sampling. The Annals of Statistics, 31:705–767, 2003. [15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2010. [16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007. [17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. arXiv:1208.4118, 2012. [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [19] Y. Yu and X. L. Meng. To center or not to center: That is not the question — An ancillaritysufﬁciency interweaving strategy (ASIS) for boosting MCMC efﬁciency. Journal of Computational and Graphical Statistics, 20(3):531–570, 2011. 9

3 0.11363042 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve signiﬁcant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for ﬁnite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

4 0.1102808 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identiﬁcation) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a ﬂexible model able to capture complex dynamical phenomena. To enable efﬁcient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1

5 0.10482301 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola

Abstract: In this paper we describe how MAP inference can be used to sample efﬁciently from Gibbs distributions. Speciﬁcally, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1

6 0.099257931 256 nips-2013-Probabilistic Principal Geodesic Analysis

7 0.09145809 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

8 0.089024402 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

9 0.086438417 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

10 0.065747648 66 nips-2013-Computing the Stationary Distribution Locally

11 0.065739155 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

12 0.064375319 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

13 0.059865918 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

14 0.058620352 161 nips-2013-Learning Stochastic Inverses

15 0.058502283 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

16 0.057817094 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

17 0.053628664 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

18 0.050837189 245 nips-2013-Pass-efficient unsupervised feature selection

19 0.048823357 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

20 0.048348747 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.148), (1, 0.048), (2, -0.022), (3, 0.01), (4, 0.0), (5, 0.141), (6, 0.128), (7, 0.005), (8, 0.069), (9, -0.072), (10, 0.062), (11, -0.01), (12, -0.035), (13, 0.045), (14, -0.0), (15, 0.13), (16, -0.127), (17, -0.103), (18, -0.049), (19, 0.149), (20, -0.288), (21, 0.052), (22, 0.023), (23, 0.148), (24, 0.297), (25, 0.043), (26, 0.162), (27, 0.041), (28, -0.029), (29, -0.206), (30, -0.174), (31, -0.089), (32, -0.053), (33, -0.033), (34, -0.013), (35, 0.047), (36, 0.018), (37, 0.013), (38, -0.055), (39, -0.041), (40, 0.093), (41, 0.102), (42, 0.036), (43, 0.12), (44, 0.009), (45, -0.075), (46, -0.066), (47, 0.037), (48, 0.068), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94293284 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

Author: Ari Pakman, Liam Paninski

2 0.84926313 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

Author: Alfredo Kalaitzis, Ricardo Silva

3 0.63665551 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series

Author: José Miguel Hernández-Lobato, James R. Lloyd, Daniel Hernández-Lobato

Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of ﬁnancial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large inﬂuence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other timevarying copula methods. 1

4 0.41158104 327 nips-2013-The Randomized Dependence Coefficient

Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf

Abstract: We introduce the Randomized Dependence Coefﬁcient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R´ nyi Maximum Correlation Coefﬁcient. RDC is deﬁned in e terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just ﬁve lines of R code, included at the end of the paper. 1

5 0.35571429 256 nips-2013-Probabilistic Principal Geodesic Analysis

Author: Miaomiao Zhang, P.T. Fletcher

Abstract: Principal geodesic analysis (PGA) is a generalization of principal component analysis (PCA) for dimensionality reduction of data on a Riemannian manifold. Currently PGA is deﬁned as a geometric ﬁt to the data, rather than as a probabilistic model. Inspired by probabilistic PCA, we present a latent variable model for PGA that provides a probabilistic framework for factor analysis on manifolds. To compute maximum likelihood estimates of the parameters in our model, we develop a Monte Carlo Expectation Maximization algorithm, where the expectation is approximated by Hamiltonian Monte Carlo sampling of the latent variables. We demonstrate the ability of our method to recover the ground truth parameters in simulated sphere data, as well as its effectiveness in analyzing shape variability of a corpus callosum data set from human brain images. 1

6 0.35436803 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

7 0.34420508 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

8 0.32826421 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

9 0.32744831 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

10 0.3230876 245 nips-2013-Pass-efficient unsupervised feature selection

11 0.32187009 299 nips-2013-Solving inverse problem of Markov chain with partial observations

12 0.32169467 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

13 0.31903911 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

14 0.3130509 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

15 0.31083882 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

16 0.30425176 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

17 0.29641017 161 nips-2013-Learning Stochastic Inverses

18 0.29449955 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

19 0.29150525 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

20 0.28498411 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.029), (12, 0.191), (16, 0.074), (33, 0.142), (34, 0.159), (41, 0.033), (49, 0.054), (56, 0.08), (70, 0.052), (85, 0.027), (89, 0.03), (93, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85782158 350 nips-2013-Wavelets on Graphs via Deep Learning

Author: Raif Rustamov, Leonidas Guibas

Abstract: An increasing number of applications require processing of signals deﬁned on weighted graphs. While wavelets provide a ﬂexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less ﬂexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is conﬁrmed via experiments both on synthetic and real data. 1

same-paper 2 0.84800756 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

Author: Ari Pakman, Liam Paninski

3 0.84383988 340 nips-2013-Understanding variable importances in forests of randomized trees

Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts

Abstract: Despite growing interest and practical use in various scientiﬁc areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientiﬁc ﬁelds is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientiﬁc areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simpliﬁed to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at ﬁlling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artiﬁcial problem; ﬁnally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we ﬁrst describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classiﬁcation and regression trees and random forests A binary classiﬁcation (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identiﬁes at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efﬁcient and simple way to address this ﬂaw is to use them in the context of randomization-based ensemble methods. Speciﬁcally, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be deﬁned for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are deﬁned by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not conﬁrmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simpliﬁed version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as deﬁned by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an inﬁnitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is deﬁned as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint conﬁgurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of fully developed totally randomized trees and an inﬁnitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The ﬁrst level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of ﬁxed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us deﬁne as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we deﬁne as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by deﬁnition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its inﬁnite sample size importance as computed with an inﬁnite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The inﬁnite sample size / importance of Xm ∈ V as computed with an inﬁnite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The inﬁnite sample size importance of any variable Xm ∈ VR as computed with an inﬁnite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We ﬁrst show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of ﬁxed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the deﬁnition of irrelevant variables. In particular, our results thus hold for the Gini index in classiﬁcation and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q ﬁrst terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of pruned totally randomized trees built up to depth q ≤ p and an inﬁnitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of pruned totally randomized trees built up to depth q ≤ p and an inﬁnitely large training sample is identical to the importance as computed for Y with an inﬁnite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as deﬁned before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an inﬁnite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K ﬁxed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a ﬁnite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly deﬁnes the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of ﬁnite size that perfectly follows the data distribution amounts to building them on a sample of inﬁnite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which conﬁrms Theorem 1. Small differences stem from the fact that a ﬁnite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards inﬁnity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which conﬁrms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus conﬁrms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the ﬁgure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the ﬁgure ﬁrst illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are signiﬁcantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this ﬁgure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the ﬁrst terms are signiﬁcantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The ﬁrst thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this ﬁgure thus indeed conﬁrms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a ﬁrst step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and ﬁnally illustrated our developments on an artiﬁcial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a ﬁnite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its ﬁnancial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Ofﬁce. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classiﬁers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artiﬁcial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classiﬁcation and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classiﬁcation. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9

4 0.82324755 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

Author: Eftychios A. Pnevmatikakis, Liam Paninski

Abstract: We propose a compressed sensing (CS) calcium imaging framework for monitoring large neuronal populations, where we image randomized projections of the spatial calcium concentration at each timestep, instead of measuring the concentration at individual locations. We develop scalable nonnegative deconvolution methods for extracting the neuronal spike time series from such observations. We also address the problem of demixing the spatial locations of the neurons using rank-penalized matrix factorization methods. By exploiting the sparsity of neural spiking we demonstrate that the number of measurements needed per timestep is signiﬁcantly smaller than the total number of neurons, a result that can potentially enable imaging of larger populations at considerably faster rates compared to traditional raster-scanning techniques. Unlike traditional CS setups, our problem involves a block-diagonal sensing matrix and a non-orthogonal sparse basis that spans multiple timesteps. We provide tight approximations to the number of measurements needed for perfect deconvolution for certain classes of spiking processes, and show that this number undergoes a “phase transition,” which we characterize using modern tools relating conic geometry to compressed sensing. 1

5 0.80334616 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed

Abstract: Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller’s long-term revenue. We deﬁne the natural notion of strategic regret — the lost revenue as measured against a truthful (non-strategic) buyer. We present seller algorithms that are no(strategic)-regret when the buyer discounts her future surplus — i.e. the buyer prefers showing advertisements to users sooner rather than later. We also give a lower bound on strategic regret that increases as the buyer’s discounting weakens and shows, in particular, that any seller algorithm will suffer linear strategic regret if there is no discounting. 1

6 0.76796967 86 nips-2013-Demixing odors - fast inference in olfaction

7 0.76570296 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

8 0.76374364 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

9 0.7624166 173 nips-2013-Least Informative Dimensions

10 0.76143909 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

11 0.75887358 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

12 0.75838023 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

13 0.75831228 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

14 0.75823164 201 nips-2013-Multi-Task Bayesian Optimization

15 0.75783902 148 nips-2013-Latent Maximum Margin Clustering

16 0.75709027 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

17 0.75575519 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

18 0.75556642 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

19 0.75433856 5 nips-2013-A Deep Architecture for Matching Short Texts

20 0.7525661 121 nips-2013-Firing rate predictions in optimal balanced networks