nips nips2013 nips2013-126 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: José Miguel Hernández-Lobato, James R. Lloyd, Daniel Hernández-Lobato
Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other timevarying copula methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 A common approach is to express these dependencies in terms of a copula function. [sent-8, score-0.716]
2 Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. [sent-9, score-0.654]
3 To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. [sent-10, score-0.349]
4 In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. [sent-11, score-0.709]
5 We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other timevarying copula methods. [sent-12, score-1.505]
6 1 Introduction Understanding dependencies within multivariate data is a central problem in the analysis of financial time series, underpinning common tasks such as portfolio construction and calculation of value-atrisk. [sent-13, score-0.105]
7 Classical methods estimate these dependencies in terms of a covariance matrix (possibly time varying) which is induced from the data [4, 5, 7, 1]. [sent-14, score-0.089]
8 However, a more general approach is to use copula functions to model dependencies [6]. [sent-15, score-0.716]
9 The use of standard copula methods to estimate dependencies is likely to be inaccurate when the actual dependencies are strongly influenced by other covariates. [sent-17, score-0.778]
10 Specifically we assume parametric copulas whose parameters are specified by unknown non-linear functions of arbitrary conditioning variables. [sent-21, score-0.392]
11 GPs have previously been used to model conditional copulas in [12] but that work only applies to copulas specified by a single parameter. [sent-23, score-0.655]
12 We extend this work to accommodate copulas with multiple parameters. [sent-24, score-0.306]
13 This is an important improvement since it allows the use of a richer set of copulas including Student’s t and asymmetric copulas. [sent-25, score-0.337]
14 We demonstrate our method by choosing the conditioning variables to be time and evaluating its ability to estimate time-varying dependencies 1 0. [sent-26, score-0.117]
15 8 Figure 1: Left, Gaussian copula density for τ = 0. [sent-50, score-0.654]
16 Right, symmetrized Joe Clayton copula density for τ U = 0. [sent-54, score-0.705]
17 The latter copula model is asymmetric along the main diagonal of the unit square. [sent-57, score-0.685]
18 Our method achieves consistently superior predictive performance compared to static copula models and other dynamic copula methods. [sent-59, score-1.434]
19 regime switching models [11] and methods proposing GARCH-style updates to copula parameters [20, 11]. [sent-62, score-0.654]
20 2 Copulas and Conditional Copulas Copulas provide a powerful framework for the construction of multivariate probabilistic models by separating the modeling of univariate marginal distributions from the modeling of dependencies between variables [6]. [sent-63, score-0.144]
21 We focus on bivariate copulas since higher dimensional copulas are typically constructed using bivariate copulas as building blocks [e. [sent-64, score-1.064]
22 Sklar’s theorem [18] states that given two one-dimensional random variables, X and Y , with continuous marginal cumulative distribution functions (cdfs) FX (X) and FY (Y ), we can express their joint cdf FX,Y as FX,Y (x, y) = CX,Y [FX (x), FY (y)], where CX,Y is the unique copula for X and Y . [sent-66, score-0.693]
23 Figure 1 shows plots of the copula densities for three parametric copula models: Gaussian, Student’s t and the symmetrized Joe Clayton (SJC) copulas. [sent-68, score-1.39]
24 1 Conditional Copulas When one has access to a covariate vector Z, one may wish to estimate a conditional version of a copula model i. [sent-75, score-0.697]
25 The estimation of the marginals FX|Z and FY |Z can be implemented using standard methods for univariate conditional distribution estimation. [sent-79, score-0.082]
26 We propose a general Bayesian non-parametric framework for the estimation of conditional copulas based on GPs and an alternating expectation propagation (EP) algorithm for efficient approximate inference. [sent-81, score-0.349]
27 i=1 i=1 We assume that CX,Y |Z is a parametric copula model Cpar [u, v|θ1 (z), . [sent-83, score-0.685]
28 Let θi (z) = σi [fi (z)], 2 where fi is an arbitrary real function and σi is a function that maps the real line to a set Θi of valid configurations for θi . [sent-90, score-0.158]
29 The prior distribution for fi given DZ is p(fi |DZ ) = N (fi |mi , Ki ), where mi = (mi (z1 ), . [sent-106, score-0.207]
30 , mi (zn ))T for some mean function mi (z) and Ki is an n × n covariance matrix generated by the squared exponential covariance function, i. [sent-109, score-0.152]
31 [Ki ]jk = Cov[fi (zj ), fi (zk )] = βi exp −(zj − zk )T diag(λi )(zj − zk ) + γi , (2) where λi is a vector of inverse length-scales and βi , γi are amplitude and noise parameters. [sent-111, score-0.204]
32 , fk |DU,V , DZ ) = n i=1 k i=1 cpar ui , vi |σ1 [f1 (zi )] , . [sent-118, score-0.278]
33 , σk [fk (zi )] N (fi |mi , Ki ) p(DU,V |DZ ) , (3) where cpar is the density of the parametric copula model and p(DU,V |DZ ) is a normalization constant often called the model evidence. [sent-121, score-0.801]
34 Given a particular value of Z denoted by z , we can make predictions about the conditional distribution of U and V using the standard GP prediction formula p(u , v |z ) = cpar (u , v |σ1 [f1 ], . [sent-122, score-0.193]
35 , fk |DU,V , DZ ) df1 · · · dfk df , (4) k where f = (f1 , . [sent-131, score-0.136]
36 , fk , z , Dz ) = i=1 p(fi |fi , z , Dz ), fi = fi (z ), p(fi |fi , z , Dz ) = N (fi |mi (z ) + kT K−1 (fi − mi ), ki − kT K−1 ki ), ki = Cov[fi (z ), fi (z )] i i i i and ki = (Cov[fi (z ), fi (z1 )], . [sent-137, score-1.309]
37 , fk and DU,V given DZ can be written as a product of n + k factors: n p(f1 , . [sent-146, score-0.136]
38 , fki , ) i=1 hi (fi ) , (5) i=1 where fji = fj (zi ), hi (fi ) = N (fi |mi , Ki ) and gi (f1i , . [sent-152, score-0.479]
39 EP approximates each factor gi with an approximate Gaussian factor gi that may not integrate to one, ˜ k 2 i. [sent-159, score-0.296]
40 , fki ) = si j=1 exp −(fji − mji ) /[2˜ji ] , where si > 0, mji and vji are param˜ ˜ v ˜ ˜ eters to be calculated by EP. [sent-164, score-0.304]
41 Since all the gi and hi are Gaussian, their product is, up to a normalization con˜ stant, a multivariate Gaussian distribution q(f1 , . [sent-166, score-0.201]
42 , fk ) which approximates the exact posterior (3) and factorizes across f1 , . [sent-169, score-0.164]
43 Finally, (4) is approximated by Monte-Carlo by sampling from q and then averaging cpar (u , v |σ1 [f1 ], . [sent-181, score-0.116]
44 EP iteratively updates each gi until convergence by first computing q \i ∝ q/˜i and then minimizing ˜ g the Kullback-Leibler divergence [3] between gi q \i and gi q \i . [sent-185, score-0.402]
45 This involves updating gi so that the ˜ ˜ first and second marginal moments of gi q \i and gi q \i match. [sent-186, score-0.426]
46 However, it is not possible to compute ˜ the moments of gi q \i analytically due to the complicated form of gi . [sent-187, score-0.292]
47 Instead we perform an additional approximation when computing the marginal moments of fji with respect to gi q \i . [sent-190, score-0.209]
48 Without loss of 3 generality, assume that we want to compute the expectation of f1i with respect to gi q \i . [sent-191, score-0.134]
49 We make the following approximation: f1i gi (f1i , . [sent-192, score-0.134]
50 , fki with respect to q \i , and C is a constant that approximates the width of the integrand around its maximum in all dimensions except f1i . [sent-213, score-0.3]
51 , fki ) is very flat when compared to q \i and the maximizer of q \i approximates well the maximizer of gi (f1i , . [sent-231, score-0.468]
52 , fk (as well as q \i ), our implementation of EP decouples into k EP sub-routines among which we alternate; the j-th sub-routine approximates the posterior distribution of fj using as input the means of q \i generated by the other EP sub-routines. [sent-241, score-0.164]
53 In the j-th EP sub-routine, the i-th factor is given by gi (f1i , . [sent-243, score-0.134]
54 , fki } \ {fji } is kept fixed to the current mean of q \i , as estimated by the other EP sub-routines. [sent-249, score-0.246]
55 4 Related Work The model proposed here is an extension of the conditional copula model of [12]. [sent-256, score-0.697]
56 In the case of bivariate data and a copula based on one parameter the models are identical. [sent-257, score-0.727]
57 We have extended the approximate inference for this model to accommodate copulas with multiple parameters; previously computationally infeasible due to requiring the numerical calculation of multidimensional integrals within an inner loop of EP inference. [sent-258, score-0.306]
58 We have also demonstrated that one can use this model to produce excellent predictive results on financial time series by conditioning the copula on time. [sent-259, score-0.791]
59 1 Dynamic Copula Models In [11] a dynamic copula model is proposed based on a two-state hidden Markov model (HMM) (St ∈ {0, 1}) that assumes that the data generating process changes between two regimes of low/high correlation. [sent-261, score-0.702]
60 At any time t the copula density is Student’s t with different parameters for the two values of the hidden state St . [sent-262, score-0.654]
61 Maximum likelihood estimation of the copula parameters and transition probabilities is performed using an EM algorithm [e. [sent-263, score-0.654]
62 A time-varying correlation (TVC) model based on the Student’s t copula is described in [20, 11]. [sent-266, score-0.698]
63 The correlation parameter1 of a Student’s t copula is assumed to satisfy ρt = (1 − α − β)ρ + αεt−1 + βρt−1 , where εt−1 is the empirical correlation of the previous 10 observations and ρ, α and β satisfy −1 ≤ ρ ≤ 1, 0 ≤ α, β ≤ 1 and α + β ≤ 1. [sent-267, score-0.742]
64 In [15] a dynamic copula based on the SJC copula (DSJCC) is introduced. [sent-271, score-1.356]
65 In this method, the parameters τ U and τ L of an SJC copula are assumed to depend on time according to τ U (t) = 0. [sent-272, score-0.654]
66 98Λ ωL + αL εt−1 + βL τ (t − 1) , (7) (8) 10 1 where Λ[·] is the logistic function, εt−1 = 10 j=1 |ut−j − vt−j |, (ut , vt ) is a copula sample at time t and the constants are used to avoid numerical instabilities. [sent-276, score-0.678]
67 We go beyond this prior work by allowing copula parameters to depend on an arbitrary conditioning variables rather than time alone. [sent-279, score-0.709]
68 Also, the models above either assume Markov independence or GARCH-like updates to copula parameters. [sent-280, score-0.654]
69 Let x(t) be a multivariate time series assumed to satisfy x(t) ∼ N (0, Σ(t)). [sent-284, score-0.086]
70 5 Experiments We evaluate the proposed Gaussian process conditional copula models (GPCC) on a one-step-ahead prediction task with synthetic data and financial time series. [sent-295, score-0.724]
71 We use time as the conditioning variable and consider three parametric copula families; Gaussian (GPCC-G), Student’s t (GPCC-T) and symmetrized Joe Clayton (GPCC-SJC). [sent-296, score-0.791]
72 The parameters of these copulas are presented in Table 1 along with the transformations used to model them. [sent-297, score-0.306]
73 Figure 1 shows plots of the densities of these three parametric copula models. [sent-298, score-0.685]
74 The three variants of GPCC were compared against three dynamic copula methods and three constant copula models. [sent-318, score-1.356]
75 The three constant copula models use Gaussian, Student’s t and SJC copulas with parameter values that do not change with time (CONST-G, CONST-T and CONST-SJC). [sent-320, score-0.96]
76 We perform a one-step-ahead rolling-window prediction task on bivariate time series {(ut , vt )}. [sent-321, score-0.14]
77 The methods are then compared by average predictive log-likelihood; an appropriate performance measure for copula estimation since copulas are probability distributions. [sent-324, score-0.999]
78 1 Synthetic Data We generated three synthetic datasets of length 5001 from copula models (Gaussian, Student’s t, SJC) whose parameters vary as periodic functions of time, as specified in Table 1. [sent-326, score-0.681]
79 This technique successfully captures the two regimes of low/high correlation corresponding to the peaks and troughs of the sinusoid that maps time t to correlation τ . [sent-333, score-0.088]
80 However, HMM performs significantly worse in the Student’s t and SJC time series since the different periods for the different copula parameter functions cannot be captured by a two state model. [sent-335, score-0.697]
81 Finally, Table 2 also shows that the static copula methods CONST-[G,T,SJC] are usually outperformed by all dynamic techniques GPCC-[G,T,SJC], DSJCC, TVC and HMM. [sent-340, score-0.741]
82 We evaluated the methods on eight bivariate time series, pairing each currency pair with the Swiss franc (CHF). [sent-346, score-0.152]
83 We first process our data using an asymmetric AR(1)-GARCH(1,1) process with non-parametric innovations [9] to estimate the univariate marginal cdfs at all time points. [sent-349, score-0.112]
84 We train this GARCH model on nW = 2016 data points and then predict the cdf of the next data point; subsequent cdfs are predicted by shifting the training window by one data point in a rolling-window methodology. [sent-350, score-0.081]
85 The cdf estimates are used to transform the raw logarithmic returns (xt , yt ) into a pseudo-sample of the underlying copula (ut , vt ) as described in Section 2. [sent-351, score-0.717]
86 We note that any method for predicting univariate cdfs could have been used to produce pseudo-samples from the copula. [sent-352, score-0.081]
87 The dynamic copula methods GPCC-[G,T,SJC], HMM, and TVC outperform the static methods CONST-[G,T,SJC] in all the analyzed series. [sent-355, score-0.741]
88 The dynamic method DSJCC occasionally performed poorly; worse than the static methods for 3 experiments. [sent-356, score-0.087]
89 In the left plot, we observe a reduction in ν(t) at the onset of the 2008-2012 global recession indicating that the return series became more prone to outliers. [sent-486, score-0.104]
90 All the copula densities in Figure 1 take large values in the proximity of the points (0,0) and (1,1) i. [sent-492, score-0.654]
91 However, the Student’s t copula is the only one of these three copulas which can take high values in the proximity of the points (0,1) and (1,0) i. [sent-495, score-0.96]
92 The equities were chosen to include pairs with both high correlation (e. [sent-501, score-0.073]
93 We observe low values of ν during 2010 suggesting that a Gaussian copula would be a bad fit to the data. [sent-582, score-0.654]
94 6 Conclusions and Future Work We have proposed an inference scheme to fit a conditional copula model to multivariate data where the copula is specified by multiple parameters. [sent-584, score-1.394]
95 The copula parameters are modeled as unknown nonlinear functions of arbitrary conditioning variables. [sent-585, score-0.709]
96 We evaluated this framework by estimating timevarying copula parameters for bivariate financial time series. [sent-586, score-0.751]
97 Our method consistently outperforms static copula models and other dynamic copula models. [sent-587, score-1.395]
98 Higher dimensional copulas are typically constructed using bivariate copulas as building blocks [2, 12]. [sent-589, score-0.685]
99 For example, including a prediction of univariate volatility as a conditioning variable would allow copula parameters to change in response to changing volatility. [sent-593, score-0.748]
100 A multivariate generalized autoregressive conditional heteroscedasticity model with time-varying correlations. [sent-730, score-0.086]
wordName wordTfidf (topN-words)
[('copula', 0.654), ('copulas', 0.306), ('fki', 0.246), ('student', 0.199), ('fi', 0.158), ('dz', 0.141), ('fk', 0.136), ('gi', 0.134), ('sjc', 0.13), ('ki', 0.123), ('cpar', 0.116), ('ep', 0.115), ('chf', 0.101), ('dsjcc', 0.101), ('gpcc', 0.101), ('tvc', 0.101), ('hmm', 0.082), ('garch', 0.077), ('aug', 0.074), ('bivariate', 0.073), ('eur', 0.072), ('oct', 0.071), ('hern', 0.071), ('fy', 0.07), ('dependencies', 0.062), ('fx', 0.06), ('barc', 0.058), ('rbs', 0.058), ('vech', 0.058), ('apr', 0.055), ('currency', 0.055), ('conditioning', 0.055), ('mar', 0.054), ('dollar', 0.051), ('equity', 0.051), ('fji', 0.051), ('symmetrized', 0.051), ('nancial', 0.051), ('mi', 0.049), ('dynamic', 0.048), ('clayton', 0.047), ('nw', 0.044), ('correlation', 0.044), ('series', 0.043), ('currencies', 0.043), ('conditional', 0.043), ('multivariate', 0.043), ('joe', 0.042), ('cdfs', 0.042), ('static', 0.039), ('univariate', 0.039), ('cdf', 0.039), ('nov', 0.039), ('predictive', 0.039), ('gaussian', 0.038), ('gp', 0.038), ('jun', 0.037), ('onset', 0.035), ('predictions', 0.034), ('cos', 0.032), ('parametric', 0.031), ('cov', 0.031), ('asymmetric', 0.031), ('maximizer', 0.03), ('jan', 0.029), ('aud', 0.029), ('cad', 0.029), ('equities', 0.029), ('gbp', 0.029), ('hsbc', 0.029), ('jpy', 0.029), ('mji', 0.029), ('nok', 0.029), ('nzd', 0.029), ('sek', 0.029), ('finance', 0.028), ('approximates', 0.028), ('synthetic', 0.027), ('covariance', 0.027), ('wishart', 0.026), ('ui', 0.026), ('integrand', 0.026), ('axp', 0.026), ('bekk', 0.026), ('engle', 0.026), ('infosys', 0.026), ('recession', 0.026), ('ut', 0.025), ('moments', 0.024), ('editors', 0.024), ('vt', 0.024), ('hi', 0.024), ('vec', 0.024), ('haven', 0.024), ('timevarying', 0.024), ('franc', 0.024), ('swiss', 0.024), ('zi', 0.024), ('zk', 0.023), ('table', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
Author: José Miguel Hernández-Lobato, James R. Lloyd, Daniel Hernández-Lobato
Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other timevarying copula methods. 1
2 0.38314149 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions
Author: Alfredo Kalaitzis, Ricardo Silva
Abstract: Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parameterization of joint distributions. Among other properties, copulas provide a recipe for combining flexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efficient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size. 1 Contribution There are many ways of constructing multivariate discrete distributions: from full contingency tables in the small dimensional case [1], to structured models given by sparsity constraints [11] and (hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has been combined with such standard building blocks. Our contribution is a novel algorithm for efficient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending algorithmic ideas introduced by [17]. A copula is a continuous cumulative distribution function (CDF) with uniformly distributed univariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms that provide a modular parameterization of joint distributions. The core idea is simple and given by the following observation: suppose we are given a (say) bivariate CDF F (y1 , y2 ) with marginals −1 −1 F1 (y1 ) and F2 (y2 ). This CDF can then be rewritten as F (F1 (F1 (y1 )), F2 (F2 (y2 ))). The func−1 −1 tion C(·, ·) given by F (F1 (·), F2 (·)) is a copula. For discrete distributions, this decomposition is not unique but still well-defined [16]. Copulas have found numerous applications in statistics and machine learning since they provide a way of constructing flexible multivariate distributions by mix-and-matching different copulas with different univariate marginals. For instance, one can combine flexible univariate marginals Fi (·) with useful but more constrained high-dimensional copulas. We will not further motivate the use of copula models, which has been discussed at length in recent 1 machine learning publications and conference workshops, and for which comprehensive textbooks exist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning perspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea dates back at least to the 1950s [16]. In the discrete case, copulas can be difficult to apply: transforming a copula CDF into a probability mass function (PMF) is computationally intractable in general. For the continuous case, a common ˆ trick goes as follows: transform variables by defining ai ≡ Fi (yi ) for an estimate of Fi (·) and then fit a copula density c(·, . . . , ·) to the resulting ai [e.g. 9]. It is not hard to check this breaks down in the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data point by a continuous integral on a bounded space. Sampling methods can then be used. This trick has allowed many applications of the Gaussian copula to discrete domains. Readers familiar with probit models will recognize the similarities to models where an underlying latent Gaussian field is discretized into observable integers as in Gaussian process classifiers and ordinal regression [18]. Such models can be indirectly interpreted as special cases of the Gaussian copula. In what follows, we describe in Section 2 the Gaussian copula and the general framework for constructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework. This framework entails computational issues which are discussed. A recent general approach for MCMC in constrained Gaussian fields by [17] can in principle be directly applied to this problem as a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical in general. Our key contribution is given in Section 4. An application experiment on the Bayesian Gaussian copula factor model is performed in Section 5. Conclusions are discussed in the final section. 2 Gaussian copulas and the extended rank likelihood It is not hard to see that any multivariate Gaussian copula is fully defined by a correlation matrix C, since marginal distributions have no free parameters. In practice, the following equivalent generative model is used to define a sample U according to a Gaussian copula GC(C): 1. Sample Z from a zero mean Gaussian with covariance matrix C 2. For each Zj , set Uj = Φ(zj ), where Φ(·) is the CDF of the standard Gaussian It is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables {y1 , y2 , . . . , yp } with marginal distributions Fj (·) and copula GC(C), one can add the deterministic (n) (1) (1) (2) step yj = Fj−1 (uj ). Now, given n samples of observed data Y ≡ {y1 , . . . , yp , y1 , . . . , yp }, one is interested on inferring C via a Bayesian approach and the posterior distribution p(C, θF | Y) ∝ pGC (Y | C, θF )π(C, θF ) where π(·) is a prior distribution, θF are marginal parameters for each Fj (·), which in general might need to be marginalized since they will be unknown, and pGC (·) is the PMF of a (here discrete) distribution with a Gaussian copula and marginals given by θF . Let Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a deterministic function of Z, this mapping is not invertible due to the discreteness of the distribution: each marginal Fj (·) has jumps. Instead, the reverse mapping only enforces the constraints where (i ) (i ) (i ) (i ) yj 1 < yj 2 implies zj 1 < zj 2 . Based on this observation, [7] considers the event Z ∈ D(y), where D(y) is the set of values of Z in Rn×p obeying those constraints, that is (k) (k) D(y) ≡ Z ∈ Rn×p : max zj s.t. yj (i) < yj (i) (k) (i) (k) < zj < min zj s.t. yj < yj . Since {Y = y} ⇒ Z(y) ∈ D(y), we have pGC (Y | C, θF ) = pGC (Z ∈ D(y), Y | C, θF ) = pN (Z ∈ D(y) | C) × pGC (Y| Z ∈ D(y), C, θF ), (1) the first factor of the last line being that of a zero-mean a Gaussian density function marginalized over D(y). 2 The extended rank likelihood is defined by the first factor of (1). With this likelihood, inference for C is given simply by marginalizing p(C, Z | Y) ∝ I(Z ∈ D(y)) pN (Z| C) π(C), (2) the first factor of the right-hand side being the usual binary indicator function. Strictly speaking, this is not a fully Bayesian method since partial information on the marginals is ignored. Nevertheless, it is possible to show that under some mild conditions there is information in the extended rank likelihood to consistently estimate C [13]. It has two important properties: first, in many applications where marginal distributions are nuisance parameters, this sidesteps any major assumptions about the shape of {Fi (·)} – applications include learning the degree of dependence among variables (e.g., to understand relationships between social indicators as in [7] and [13]) and copula-based dimensionality reduction (a generalization of correlation-based principal component analysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler than with the joint likelihood, since dropping marginal models will remove complicated entanglements between C and θF . Therefore, even if θF is necessary (when, for instance, predicting missing values of Y), an estimate of C can be computed separately and will not depend on the choice of estimator for {Fi (·)}. The standard model with a full correlation matrix C can be further refined to take into account structure implied by sparse inverse correlation matrices [2] or low rank decompositions via higher-order latent variable models [13], among others. We explore the latter case in section 5. An off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: first, given Z, the (full or structured) correlation matrix C can be sampled by standard methods. More to the point, sampling (i) Z is straightforward if for each variable j and data point i we sample Zj conditioned on all other variables. The corresponding distribution is an univariate truncated Gaussian. This is the approach used originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and that is where novel sampling methods can facilitate inference. 3 Exact HMC for truncated Gaussian distributions (i) Hoff’s algorithm modifies the positions of all Zj associated with a particular discrete value of Yj , conditioned on the remaining points. As the number of data points increases, the spread of the hard (i) boundaries on Zj , given by data points of Zj associated with other levels of Yj , increases. This (i) reduces the space in which variables Zj can move at a time. To improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables (i) Zj , i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them are satisfied and thus the ordering in the observation level is conserved. Standard Gibbs approaches for sampling from truncated Gaussians reduce the problem to sampling from univariate truncated Gaussians. Even though each step is computationally simple, mixing can be slow when strong correlations are induced by very tight truncation bounds. In the following, we briefly describe the methodology recently introduced by [17] that deals with the problem of sampling from log p(x) ∝ − 1 x Mx + r x , where x, r ∈ Rn and M is positive 2 definite, with linear constraints of the form fj x ≤ gj , where fj ∈ Rn , j = 1 . . . m, is the normal vector to some linear boundary in the sample space. Later in this section we shall describe how this framework can be applied to the Gaussian copula extended rank likelihood model. More importantly, the observed rank statistics impose only linear constraints of the form xi1 ≤ xi2 . We shall describe how this special structure can be exploited to reduce the runtime complexity of the constrained sampler from O(n2 ) (in the number of observations) to O(n) in practice. 3.1 Hamiltonian Monte Carlo for the Gaussian distribution Hamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with auxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to 3 potentially far places in the original variable space. Deterministic moves cannot in general be done, but this is possible in the Gaussian case. The form of the Hamiltonian for the general d-dimensional Gaussian case with mean µ and precision matrix M is: 1 1 H = x Mx − r x + s M−1 s , (3) 2 2 where M is also known in the present context as the mass matrix, r = Mµ and s is the velocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant) as the negative log of the product of two independent Gaussian random variables. The physical interpretation is that of a sum of potential and kinetic energy terms, where the total energy of the system is conserved. In a system where this Hamiltonian function is constant, we can exactly compute its evolution through the pair of differential equations: ˙ x= sH = M−1 s , ˙ s=− xH = −Mx + r . (4) These are solved exactly by x(t) = µ + a sin(t) + b cos(t) , where a and b can be identified at initial conditions (t = 0) : ˙ a = x(0) = M−1 s , b = x(0) − µ . (5) Therefore, the exact HMC algorithm can be summarised as follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b and store the new position at the end of the trajectory xk+1 = x(T ) as an HMC sample. It can be easily shown that the Markov chain of sampled positions has the desired equilibrium distribution N µ, M−1 [17]. 3.2 Sampling with linear constraints Sampling from multivariate Gaussians does not require any method as sophisticated as HMC, but the plot thickens when the target distribution is truncated by linear constraints of the form Fx ≤ g . Here, F ∈ Rm×n is a constraint matrix whose every row is the normal vector to a linear boundary in the sample space. This is equivalent to sampling from a Gaussian that is confined in the (not necessarily bounded) convex polyhedron {x : Fx ≤ g}. In general, to remain within the boundaries of each wall, once a new velocity has been sampled one must compute all possible collision times with the walls. The smallest of all collision times signifies the wall that the particle should bounce from at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3 dimensions. The collision times can be computed analytically and their equations can be found in the supplementary material. We also point the reader to [17] for a more detailed discussion of this implementation. Once the wall to be hit has been found, then position and velocity at impact time are computed and the velocity is reflected about the boundary normal1 . The constrained HMC sampler is summarized follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b . 1 Also equivalent to transforming the velocity with a Householder reflection matrix about the bounding hyperplane. 4 1 2 3 4 1 2 3 4 Figure 1: Left: Trajectories of the first 40 iterations of the exact HMC sampler on a 2D truncated Gaussian. A reflection of the velocity can clearly be seen when the particle meets wall #2 . Here, the constraint matrix F is a 4 × 2 matrix. Center: The same example after 40000 samples. The coloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The walls are now planes and in this case F is a 2 × 3 matrix. Figure best seen in color. 3. Reset remaining travel time Tleft ← T . Until no travel time is left or no walls can be reached (no solutions exist), do: (a) Compute impact times with all walls and pick the smallest one, th (if a solution exists). (b) Compute v(th ) and reflect it about the hyperplane fh . This is the updated velocity after impact. The updated position is x(th ) . (c) Tleft ← Tleft − th 4. Store the new position at the end of the trajectory xk+1 as an HMC sample. In general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the number of constraints. This means that the computational load is concentrated in step 3(a). Another consideration is that of the allocated travel time T . Depending on the shape of the bounding polyhedron and the number of walls, a very large travel time can induce many more bounces thus requiring more computations per sample. On the other hand, a very small travel time explores the distribution more locally so the mixing of the chain can suffer. What constitutes a given travel time “large” or “small” is relative to the dimensionality, the number of constraints and the structure of the constraints. Due to the nature of our problem, the number of constraints, when explicitly expressed as linear functions, is O(n2 ) . Clearly, this restricts any direct application of the HMC framework for Gaussian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the structure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) . 4 HMC for the Gaussian Copula extended rank likelihood model Given some discrete data Y ∈ Rn×p , the task is to infer the correlation matrix of the underlying Gaussian copula. Hoff’s sampling algorithm proceeds by alternating between sampling the continu(i) (i) ous latent representation Zj of each Yj , for i = 1 . . . n, j = 1 . . . p , and sampling a covariance matrix from an inverse-Wishart distribution conditioned on the sampled matrix Z ∈ Rn×p , which is then renormalized as a correlation matrix. From here on, we use matrix notation for the samples, as opposed to the random variables – with (i) Zi,j replacing Zj , Z:,j being a column of Z, and Z:,\j being the submatrix of Z without the j-th column. In a similar vein to Hoff’s sampling algorithm, we replace the successive sampling of each Zi,j conditioned on Zi,\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of Z:,j conditioned on Z:,\j . This is done through an HMC step from a conditional multivariate truncated Gaussian. The added benefit of this HMC step over the standard Gibbs approach, is that of a handle for regulating the trade-off between exploration and runtime via the allocated travel time T . Larger travel times potentially allow for larger moves in the sample space, but it comes at a cost as explained in the sequel. 5 4.1 The Hough envelope algorithm The special structure of constraints. Recall that the number of constraints is quadratic in the dimension of the distribution. This is because every Z sample must satisfy the conditions of the event Z ∈ D(y) of the extended rank likelihood (see Section 2). In other words, for any column Z:,j , all entries are organised into a partition L(j) of |L(j) | levels, the number of unique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent levels lk , lk+1 ∈ L(j) and any pair i1 ∈ lk , i2 ∈ lk+1 , it must be true that Zli ,j < Zli+1 ,j . Equivalently, a constraint f exists where fi1 = 1, fi2 = −1 and g = 0 . It is easy to see that O(n2 ) of such constraints are induced by the order statistics of the j-th variable. To deal with this boundary explosion, we developed the Hough Envelope algorithm to search efficiently, within all pairs in {Z:,j }, in practically linear time. Recall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as xi (t) = ai sin(t) + bi cos(t) + µi , (6) and there are n such functions, grouped into a partition of levels as described above. The Hough envelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 dimensions and two levels in Figure 2, without loss of generalization to any number of levels or dimensions. Assume we represent trajectories for points in level lk with blue curves, and points in lk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all blue curves. The goal is to find the smallest t where a blue curve meets a red curve. This will be our collision time where a bounce will be necessary. 5 3 1 2 Figure 2: The trajectories xj (t) of each component are sinusoid functions. The right-most green dot signifies the wall and the time th of the earliest bounce, where the first inter-level pair (that is, any two components respectively from the blue and red level) becomes equal, in this case the constraint activated being xblue2 = xred2 . 4 4 5 1 2 3 0.2 0.4 0.6 t 0.8 1 1.2 1.4 1. First we find the largest component bluemax of the blue level at t = 0. This takes O(n) time. Clearly, this will be the largest component until its sinusoid intersects that of any other component. 2. To find the next largest component, compute the roots of xbluemax (t) − xi (t) = 0 for all components and pick the smallest (earliest) one (represented by a green dot). This also takes O(n) time. 3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When this happens, the time of earliest bounce and its constraint are found. In the worst-case scenario, n such repetitions have to be made, but in practice we can safely assume an fixed upper bound h on the number of blue crossings before a inter-level crossing occurs. In experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of curves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis of other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls [8]. Our HMC sampling approach is summarized in Algorithm 1. 2 The name is inspired from the fact that each xi (t) is the sinusoid representation, in angle-distance space, of all lines that pass from the (ai , bi ) point in a − b space. A representation known in image processing as the Hough transform [3]. 6 Algorithm 1 HMC for GCERL # Notation: T MN (µ, C, F) is a truncated multivariate normal with location vector µ, scale matrix C and constraints encoded by F and g = 0 . # IW(df, V0 ) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 . Input: Y ∈ Rn×p , allocated travel time T , a starting Z and variable covariance V ∈ Rp×p , df = p + 2, V0 = df Ip and chain size N . Generate constraints F(j) from Y:,j , for j = 1 . . . p . for samples k = 1 . . . N do # Resample Z as follows: for variables j = 1 . . . p do −1 −1 2 Compute parameters: σj = Vjj − Vj,\j V\j,\j V\j,j , µj = Z:,\j V\j,\j V\j,j . 2 Get one sample Z:,j ∼ T MN µj , σj I, F(j) efficiently by using the Hough Envelope algorithm, see section 4.1. end for Resample V ∼ IW(df + n, V0 + Z Z) . Compute correlation matrix C, s.t. Ci,j = Vi,j / Vi,i Vj,j and store sample, C(k) ← C . end for 5 An application on the Bayesian Gausian copula factor model In this section we describe an experiment that highlights the benefits of our HMC treatment, compared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we ask the important question: “How do the two schemes compare when we exploit the full-advantage of the HMC machinery to jointly sample parameters and the augmented data Z, in a model of latent variables and structured correlations?” We argue that under such circumstances the superior convergence speed and mixing of HMC undeniably compensate for its computational overhead. Experimental setup In this section we provide results from an application on the Gaussian copula latent factor model of [13] (Hoff’s model [7] for low-rank structured correlation matrices). We modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps with a single HMC step. We show a much faster convergence to the true mode with considerable support on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly exploring the posterior in any reasonable finite amount of time, even for small models, even for small samples. Worse, PX fails in ways one cannot easily detect. Namely, we sample each row of the factor loadings matrix Λ jointly with the corresponding column of the augmented data matrix Z, conditioning on the higher-order latent factors. This step is analogous to Pakman and Paninski’s [17, sec.3.1] use of HMC in the context of a binary probit model (the extension to many levels in the discrete marginal is straightforward with direct application of the constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent factors remains identical to [13]. Our scheme involves no parameter expansion. We do however interweave the Gibbs step for the Z matrix similarly to Hoff. This has the added benefit of exploring the Z sample space within their current boundaries, complementing the joint (λ, z) sampling which moves the boundaries jointly. The value of such ”interweaving” schemes has been addressed in [19]. Results We perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel time π/2 for HMC with the setups listed in the following table, along with the elapsed times of each sampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of RAM. Both methods were parallelized across the observed variables (p). Figure p (vars) k (latent factors) M (ordinal levels) elapsed (mins): HMC PX 3(a) : 20 5 2 115 8 3(b) : 10 3 2 80 6 10 3 5 203 16 3(c) : Many functionals of the loadings matrix Λ can be assessed. We focus on reconstructing the true (low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm’s 7 outcome with the root mean squared error (RMSE) of the differences between entries of the ground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC scheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b) and 3(c) . (a) (b) (c) Figure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the Gaussian copula and its histogram (along the left side). (a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster (in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. (b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note that the histogram counts RMSEs after the burn-in period of PX (iteration #500). (c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and 3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period of PX (iteration #2000). Main message HMC reaches a better mode faster (iterations/CPUtime). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an analogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal’s slice sampler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX plot alone that the algorithm is performing poorly. This mirrors Radford Neal’s statement opening Section 8 of his paper: “a wrong answer is obtained without any obvious indication that something is amiss”. The concentration on the posterior mode of PX in these simulations is misleading of the truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex algorithms for complex models. We urge practitioners to revisit their past work with this model to find out by how much credible intervals of functionals of interest have been overconfident. Whether trivially or severely, our algorithm offers the first principled approach for checking this out. 6 Conclusion Sampling large random vectors simultaneously in order to improve mixing is in general a very hard problem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary. We expect that the method here developed is useful not only for those with data analysis problems within the large family of Gaussian copula extended rank likelihood models, but the method itself and its behaviour might provide some new insights on MCMC sampling in constrained spaces in general. Another direction of future work consists of exploring methods for elliptical copulas, and related possible extensions of general HMC for non-Gaussian copula models. Acknowledgements The quality of this work has benefited largely from comments by our anonymous reviewers and useful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC grant EP/J013293/1. 8 References [1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975. [2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5:969–993, 2011. [3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972. [4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and Quantitative Finance workshop, to appear, 2013. [5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Information Processing Systems, 25:171–179, 2012. [6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265–283, 2007. [8] R. Jarvis. On the identification of the convex hull of a finite set of points in the plane. Information Processing Letters, 2(1):18–21, 1973. [9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997. [10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Processing Systems, 2007. [11] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Conference Proceedings: AISTATS 2010, 9:541–548, 2010. [13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, to appear, 2013. [14] R. Neal. Slice sampling. The Annals of Statistics, 31:705–767, 2003. [15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2010. [16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007. [17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. arXiv:1208.4118, 2012. [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [19] Y. Yu and X. L. Meng. To center or not to center: That is not the question — An ancillaritysufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics, 20(3):531–570, 2011. 9
3 0.15047716 327 nips-2013-The Randomized Dependence Coefficient
Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf
Abstract: We introduce the Randomized Dependence Coefficient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R´ nyi Maximum Correlation Coefficient. RDC is defined in e terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper. 1
4 0.080245428 217 nips-2013-On Poisson Graphical Models
Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu
Abstract: Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its infinite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution. While this model can accommodate a wider range of conditional dependencies, some limitations still remain. To address this, we investigate two additional novel variants of the Poisson distribution and their corresponding joint graphical model distributions. Our three novel approaches provide classes of Poisson-like graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our models via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-sequencing data. 1
5 0.077109009 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
6 0.06770473 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
7 0.062023684 193 nips-2013-Mixed Optimization for Smooth Functions
8 0.0579684 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
9 0.055118907 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
10 0.053143274 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
11 0.05094824 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
12 0.050418012 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
13 0.047407046 168 nips-2013-Learning to Pass Expectation Propagation Messages
14 0.0455334 280 nips-2013-Robust Data-Driven Dynamic Programming
15 0.04189365 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
16 0.040893584 318 nips-2013-Structured Learning via Logistic Regression
17 0.040354673 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
18 0.039412364 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
19 0.036634304 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
20 0.036632646 201 nips-2013-Multi-Task Bayesian Optimization
topicId topicWeight
[(0, 0.114), (1, 0.03), (2, 0.002), (3, 0.017), (4, -0.019), (5, 0.077), (6, 0.058), (7, 0.023), (8, 0.049), (9, -0.026), (10, 0.003), (11, -0.024), (12, -0.095), (13, -0.001), (14, -0.008), (15, 0.121), (16, -0.078), (17, -0.03), (18, -0.107), (19, 0.039), (20, -0.156), (21, 0.069), (22, -0.014), (23, 0.076), (24, 0.265), (25, -0.019), (26, 0.108), (27, 0.081), (28, 0.029), (29, -0.184), (30, -0.091), (31, -0.082), (32, -0.072), (33, 0.032), (34, -0.096), (35, 0.048), (36, 0.029), (37, 0.102), (38, -0.088), (39, 0.073), (40, -0.02), (41, -0.085), (42, 0.075), (43, -0.026), (44, 0.046), (45, 0.163), (46, 0.026), (47, -0.027), (48, 0.003), (49, 0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.92962712 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
Author: José Miguel Hernández-Lobato, James R. Lloyd, Daniel Hernández-Lobato
Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other timevarying copula methods. 1
2 0.79956931 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions
Author: Alfredo Kalaitzis, Ricardo Silva
Abstract: Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parameterization of joint distributions. Among other properties, copulas provide a recipe for combining flexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efficient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size. 1 Contribution There are many ways of constructing multivariate discrete distributions: from full contingency tables in the small dimensional case [1], to structured models given by sparsity constraints [11] and (hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has been combined with such standard building blocks. Our contribution is a novel algorithm for efficient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending algorithmic ideas introduced by [17]. A copula is a continuous cumulative distribution function (CDF) with uniformly distributed univariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms that provide a modular parameterization of joint distributions. The core idea is simple and given by the following observation: suppose we are given a (say) bivariate CDF F (y1 , y2 ) with marginals −1 −1 F1 (y1 ) and F2 (y2 ). This CDF can then be rewritten as F (F1 (F1 (y1 )), F2 (F2 (y2 ))). The func−1 −1 tion C(·, ·) given by F (F1 (·), F2 (·)) is a copula. For discrete distributions, this decomposition is not unique but still well-defined [16]. Copulas have found numerous applications in statistics and machine learning since they provide a way of constructing flexible multivariate distributions by mix-and-matching different copulas with different univariate marginals. For instance, one can combine flexible univariate marginals Fi (·) with useful but more constrained high-dimensional copulas. We will not further motivate the use of copula models, which has been discussed at length in recent 1 machine learning publications and conference workshops, and for which comprehensive textbooks exist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning perspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea dates back at least to the 1950s [16]. In the discrete case, copulas can be difficult to apply: transforming a copula CDF into a probability mass function (PMF) is computationally intractable in general. For the continuous case, a common ˆ trick goes as follows: transform variables by defining ai ≡ Fi (yi ) for an estimate of Fi (·) and then fit a copula density c(·, . . . , ·) to the resulting ai [e.g. 9]. It is not hard to check this breaks down in the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data point by a continuous integral on a bounded space. Sampling methods can then be used. This trick has allowed many applications of the Gaussian copula to discrete domains. Readers familiar with probit models will recognize the similarities to models where an underlying latent Gaussian field is discretized into observable integers as in Gaussian process classifiers and ordinal regression [18]. Such models can be indirectly interpreted as special cases of the Gaussian copula. In what follows, we describe in Section 2 the Gaussian copula and the general framework for constructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework. This framework entails computational issues which are discussed. A recent general approach for MCMC in constrained Gaussian fields by [17] can in principle be directly applied to this problem as a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical in general. Our key contribution is given in Section 4. An application experiment on the Bayesian Gaussian copula factor model is performed in Section 5. Conclusions are discussed in the final section. 2 Gaussian copulas and the extended rank likelihood It is not hard to see that any multivariate Gaussian copula is fully defined by a correlation matrix C, since marginal distributions have no free parameters. In practice, the following equivalent generative model is used to define a sample U according to a Gaussian copula GC(C): 1. Sample Z from a zero mean Gaussian with covariance matrix C 2. For each Zj , set Uj = Φ(zj ), where Φ(·) is the CDF of the standard Gaussian It is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables {y1 , y2 , . . . , yp } with marginal distributions Fj (·) and copula GC(C), one can add the deterministic (n) (1) (1) (2) step yj = Fj−1 (uj ). Now, given n samples of observed data Y ≡ {y1 , . . . , yp , y1 , . . . , yp }, one is interested on inferring C via a Bayesian approach and the posterior distribution p(C, θF | Y) ∝ pGC (Y | C, θF )π(C, θF ) where π(·) is a prior distribution, θF are marginal parameters for each Fj (·), which in general might need to be marginalized since they will be unknown, and pGC (·) is the PMF of a (here discrete) distribution with a Gaussian copula and marginals given by θF . Let Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a deterministic function of Z, this mapping is not invertible due to the discreteness of the distribution: each marginal Fj (·) has jumps. Instead, the reverse mapping only enforces the constraints where (i ) (i ) (i ) (i ) yj 1 < yj 2 implies zj 1 < zj 2 . Based on this observation, [7] considers the event Z ∈ D(y), where D(y) is the set of values of Z in Rn×p obeying those constraints, that is (k) (k) D(y) ≡ Z ∈ Rn×p : max zj s.t. yj (i) < yj (i) (k) (i) (k) < zj < min zj s.t. yj < yj . Since {Y = y} ⇒ Z(y) ∈ D(y), we have pGC (Y | C, θF ) = pGC (Z ∈ D(y), Y | C, θF ) = pN (Z ∈ D(y) | C) × pGC (Y| Z ∈ D(y), C, θF ), (1) the first factor of the last line being that of a zero-mean a Gaussian density function marginalized over D(y). 2 The extended rank likelihood is defined by the first factor of (1). With this likelihood, inference for C is given simply by marginalizing p(C, Z | Y) ∝ I(Z ∈ D(y)) pN (Z| C) π(C), (2) the first factor of the right-hand side being the usual binary indicator function. Strictly speaking, this is not a fully Bayesian method since partial information on the marginals is ignored. Nevertheless, it is possible to show that under some mild conditions there is information in the extended rank likelihood to consistently estimate C [13]. It has two important properties: first, in many applications where marginal distributions are nuisance parameters, this sidesteps any major assumptions about the shape of {Fi (·)} – applications include learning the degree of dependence among variables (e.g., to understand relationships between social indicators as in [7] and [13]) and copula-based dimensionality reduction (a generalization of correlation-based principal component analysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler than with the joint likelihood, since dropping marginal models will remove complicated entanglements between C and θF . Therefore, even if θF is necessary (when, for instance, predicting missing values of Y), an estimate of C can be computed separately and will not depend on the choice of estimator for {Fi (·)}. The standard model with a full correlation matrix C can be further refined to take into account structure implied by sparse inverse correlation matrices [2] or low rank decompositions via higher-order latent variable models [13], among others. We explore the latter case in section 5. An off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: first, given Z, the (full or structured) correlation matrix C can be sampled by standard methods. More to the point, sampling (i) Z is straightforward if for each variable j and data point i we sample Zj conditioned on all other variables. The corresponding distribution is an univariate truncated Gaussian. This is the approach used originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and that is where novel sampling methods can facilitate inference. 3 Exact HMC for truncated Gaussian distributions (i) Hoff’s algorithm modifies the positions of all Zj associated with a particular discrete value of Yj , conditioned on the remaining points. As the number of data points increases, the spread of the hard (i) boundaries on Zj , given by data points of Zj associated with other levels of Yj , increases. This (i) reduces the space in which variables Zj can move at a time. To improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables (i) Zj , i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them are satisfied and thus the ordering in the observation level is conserved. Standard Gibbs approaches for sampling from truncated Gaussians reduce the problem to sampling from univariate truncated Gaussians. Even though each step is computationally simple, mixing can be slow when strong correlations are induced by very tight truncation bounds. In the following, we briefly describe the methodology recently introduced by [17] that deals with the problem of sampling from log p(x) ∝ − 1 x Mx + r x , where x, r ∈ Rn and M is positive 2 definite, with linear constraints of the form fj x ≤ gj , where fj ∈ Rn , j = 1 . . . m, is the normal vector to some linear boundary in the sample space. Later in this section we shall describe how this framework can be applied to the Gaussian copula extended rank likelihood model. More importantly, the observed rank statistics impose only linear constraints of the form xi1 ≤ xi2 . We shall describe how this special structure can be exploited to reduce the runtime complexity of the constrained sampler from O(n2 ) (in the number of observations) to O(n) in practice. 3.1 Hamiltonian Monte Carlo for the Gaussian distribution Hamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with auxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to 3 potentially far places in the original variable space. Deterministic moves cannot in general be done, but this is possible in the Gaussian case. The form of the Hamiltonian for the general d-dimensional Gaussian case with mean µ and precision matrix M is: 1 1 H = x Mx − r x + s M−1 s , (3) 2 2 where M is also known in the present context as the mass matrix, r = Mµ and s is the velocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant) as the negative log of the product of two independent Gaussian random variables. The physical interpretation is that of a sum of potential and kinetic energy terms, where the total energy of the system is conserved. In a system where this Hamiltonian function is constant, we can exactly compute its evolution through the pair of differential equations: ˙ x= sH = M−1 s , ˙ s=− xH = −Mx + r . (4) These are solved exactly by x(t) = µ + a sin(t) + b cos(t) , where a and b can be identified at initial conditions (t = 0) : ˙ a = x(0) = M−1 s , b = x(0) − µ . (5) Therefore, the exact HMC algorithm can be summarised as follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b and store the new position at the end of the trajectory xk+1 = x(T ) as an HMC sample. It can be easily shown that the Markov chain of sampled positions has the desired equilibrium distribution N µ, M−1 [17]. 3.2 Sampling with linear constraints Sampling from multivariate Gaussians does not require any method as sophisticated as HMC, but the plot thickens when the target distribution is truncated by linear constraints of the form Fx ≤ g . Here, F ∈ Rm×n is a constraint matrix whose every row is the normal vector to a linear boundary in the sample space. This is equivalent to sampling from a Gaussian that is confined in the (not necessarily bounded) convex polyhedron {x : Fx ≤ g}. In general, to remain within the boundaries of each wall, once a new velocity has been sampled one must compute all possible collision times with the walls. The smallest of all collision times signifies the wall that the particle should bounce from at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3 dimensions. The collision times can be computed analytically and their equations can be found in the supplementary material. We also point the reader to [17] for a more detailed discussion of this implementation. Once the wall to be hit has been found, then position and velocity at impact time are computed and the velocity is reflected about the boundary normal1 . The constrained HMC sampler is summarized follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b . 1 Also equivalent to transforming the velocity with a Householder reflection matrix about the bounding hyperplane. 4 1 2 3 4 1 2 3 4 Figure 1: Left: Trajectories of the first 40 iterations of the exact HMC sampler on a 2D truncated Gaussian. A reflection of the velocity can clearly be seen when the particle meets wall #2 . Here, the constraint matrix F is a 4 × 2 matrix. Center: The same example after 40000 samples. The coloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The walls are now planes and in this case F is a 2 × 3 matrix. Figure best seen in color. 3. Reset remaining travel time Tleft ← T . Until no travel time is left or no walls can be reached (no solutions exist), do: (a) Compute impact times with all walls and pick the smallest one, th (if a solution exists). (b) Compute v(th ) and reflect it about the hyperplane fh . This is the updated velocity after impact. The updated position is x(th ) . (c) Tleft ← Tleft − th 4. Store the new position at the end of the trajectory xk+1 as an HMC sample. In general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the number of constraints. This means that the computational load is concentrated in step 3(a). Another consideration is that of the allocated travel time T . Depending on the shape of the bounding polyhedron and the number of walls, a very large travel time can induce many more bounces thus requiring more computations per sample. On the other hand, a very small travel time explores the distribution more locally so the mixing of the chain can suffer. What constitutes a given travel time “large” or “small” is relative to the dimensionality, the number of constraints and the structure of the constraints. Due to the nature of our problem, the number of constraints, when explicitly expressed as linear functions, is O(n2 ) . Clearly, this restricts any direct application of the HMC framework for Gaussian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the structure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) . 4 HMC for the Gaussian Copula extended rank likelihood model Given some discrete data Y ∈ Rn×p , the task is to infer the correlation matrix of the underlying Gaussian copula. Hoff’s sampling algorithm proceeds by alternating between sampling the continu(i) (i) ous latent representation Zj of each Yj , for i = 1 . . . n, j = 1 . . . p , and sampling a covariance matrix from an inverse-Wishart distribution conditioned on the sampled matrix Z ∈ Rn×p , which is then renormalized as a correlation matrix. From here on, we use matrix notation for the samples, as opposed to the random variables – with (i) Zi,j replacing Zj , Z:,j being a column of Z, and Z:,\j being the submatrix of Z without the j-th column. In a similar vein to Hoff’s sampling algorithm, we replace the successive sampling of each Zi,j conditioned on Zi,\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of Z:,j conditioned on Z:,\j . This is done through an HMC step from a conditional multivariate truncated Gaussian. The added benefit of this HMC step over the standard Gibbs approach, is that of a handle for regulating the trade-off between exploration and runtime via the allocated travel time T . Larger travel times potentially allow for larger moves in the sample space, but it comes at a cost as explained in the sequel. 5 4.1 The Hough envelope algorithm The special structure of constraints. Recall that the number of constraints is quadratic in the dimension of the distribution. This is because every Z sample must satisfy the conditions of the event Z ∈ D(y) of the extended rank likelihood (see Section 2). In other words, for any column Z:,j , all entries are organised into a partition L(j) of |L(j) | levels, the number of unique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent levels lk , lk+1 ∈ L(j) and any pair i1 ∈ lk , i2 ∈ lk+1 , it must be true that Zli ,j < Zli+1 ,j . Equivalently, a constraint f exists where fi1 = 1, fi2 = −1 and g = 0 . It is easy to see that O(n2 ) of such constraints are induced by the order statistics of the j-th variable. To deal with this boundary explosion, we developed the Hough Envelope algorithm to search efficiently, within all pairs in {Z:,j }, in practically linear time. Recall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as xi (t) = ai sin(t) + bi cos(t) + µi , (6) and there are n such functions, grouped into a partition of levels as described above. The Hough envelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 dimensions and two levels in Figure 2, without loss of generalization to any number of levels or dimensions. Assume we represent trajectories for points in level lk with blue curves, and points in lk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all blue curves. The goal is to find the smallest t where a blue curve meets a red curve. This will be our collision time where a bounce will be necessary. 5 3 1 2 Figure 2: The trajectories xj (t) of each component are sinusoid functions. The right-most green dot signifies the wall and the time th of the earliest bounce, where the first inter-level pair (that is, any two components respectively from the blue and red level) becomes equal, in this case the constraint activated being xblue2 = xred2 . 4 4 5 1 2 3 0.2 0.4 0.6 t 0.8 1 1.2 1.4 1. First we find the largest component bluemax of the blue level at t = 0. This takes O(n) time. Clearly, this will be the largest component until its sinusoid intersects that of any other component. 2. To find the next largest component, compute the roots of xbluemax (t) − xi (t) = 0 for all components and pick the smallest (earliest) one (represented by a green dot). This also takes O(n) time. 3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When this happens, the time of earliest bounce and its constraint are found. In the worst-case scenario, n such repetitions have to be made, but in practice we can safely assume an fixed upper bound h on the number of blue crossings before a inter-level crossing occurs. In experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of curves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis of other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls [8]. Our HMC sampling approach is summarized in Algorithm 1. 2 The name is inspired from the fact that each xi (t) is the sinusoid representation, in angle-distance space, of all lines that pass from the (ai , bi ) point in a − b space. A representation known in image processing as the Hough transform [3]. 6 Algorithm 1 HMC for GCERL # Notation: T MN (µ, C, F) is a truncated multivariate normal with location vector µ, scale matrix C and constraints encoded by F and g = 0 . # IW(df, V0 ) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 . Input: Y ∈ Rn×p , allocated travel time T , a starting Z and variable covariance V ∈ Rp×p , df = p + 2, V0 = df Ip and chain size N . Generate constraints F(j) from Y:,j , for j = 1 . . . p . for samples k = 1 . . . N do # Resample Z as follows: for variables j = 1 . . . p do −1 −1 2 Compute parameters: σj = Vjj − Vj,\j V\j,\j V\j,j , µj = Z:,\j V\j,\j V\j,j . 2 Get one sample Z:,j ∼ T MN µj , σj I, F(j) efficiently by using the Hough Envelope algorithm, see section 4.1. end for Resample V ∼ IW(df + n, V0 + Z Z) . Compute correlation matrix C, s.t. Ci,j = Vi,j / Vi,i Vj,j and store sample, C(k) ← C . end for 5 An application on the Bayesian Gausian copula factor model In this section we describe an experiment that highlights the benefits of our HMC treatment, compared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we ask the important question: “How do the two schemes compare when we exploit the full-advantage of the HMC machinery to jointly sample parameters and the augmented data Z, in a model of latent variables and structured correlations?” We argue that under such circumstances the superior convergence speed and mixing of HMC undeniably compensate for its computational overhead. Experimental setup In this section we provide results from an application on the Gaussian copula latent factor model of [13] (Hoff’s model [7] for low-rank structured correlation matrices). We modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps with a single HMC step. We show a much faster convergence to the true mode with considerable support on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly exploring the posterior in any reasonable finite amount of time, even for small models, even for small samples. Worse, PX fails in ways one cannot easily detect. Namely, we sample each row of the factor loadings matrix Λ jointly with the corresponding column of the augmented data matrix Z, conditioning on the higher-order latent factors. This step is analogous to Pakman and Paninski’s [17, sec.3.1] use of HMC in the context of a binary probit model (the extension to many levels in the discrete marginal is straightforward with direct application of the constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent factors remains identical to [13]. Our scheme involves no parameter expansion. We do however interweave the Gibbs step for the Z matrix similarly to Hoff. This has the added benefit of exploring the Z sample space within their current boundaries, complementing the joint (λ, z) sampling which moves the boundaries jointly. The value of such ”interweaving” schemes has been addressed in [19]. Results We perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel time π/2 for HMC with the setups listed in the following table, along with the elapsed times of each sampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of RAM. Both methods were parallelized across the observed variables (p). Figure p (vars) k (latent factors) M (ordinal levels) elapsed (mins): HMC PX 3(a) : 20 5 2 115 8 3(b) : 10 3 2 80 6 10 3 5 203 16 3(c) : Many functionals of the loadings matrix Λ can be assessed. We focus on reconstructing the true (low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm’s 7 outcome with the root mean squared error (RMSE) of the differences between entries of the ground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC scheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b) and 3(c) . (a) (b) (c) Figure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the Gaussian copula and its histogram (along the left side). (a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster (in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. (b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note that the histogram counts RMSEs after the burn-in period of PX (iteration #500). (c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and 3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period of PX (iteration #2000). Main message HMC reaches a better mode faster (iterations/CPUtime). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an analogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal’s slice sampler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX plot alone that the algorithm is performing poorly. This mirrors Radford Neal’s statement opening Section 8 of his paper: “a wrong answer is obtained without any obvious indication that something is amiss”. The concentration on the posterior mode of PX in these simulations is misleading of the truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex algorithms for complex models. We urge practitioners to revisit their past work with this model to find out by how much credible intervals of functionals of interest have been overconfident. Whether trivially or severely, our algorithm offers the first principled approach for checking this out. 6 Conclusion Sampling large random vectors simultaneously in order to improve mixing is in general a very hard problem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary. We expect that the method here developed is useful not only for those with data analysis problems within the large family of Gaussian copula extended rank likelihood models, but the method itself and its behaviour might provide some new insights on MCMC sampling in constrained spaces in general. Another direction of future work consists of exploring methods for elliptical copulas, and related possible extensions of general HMC for non-Gaussian copula models. Acknowledgements The quality of this work has benefited largely from comments by our anonymous reviewers and useful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC grant EP/J013293/1. 8 References [1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975. [2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5:969–993, 2011. [3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972. [4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and Quantitative Finance workshop, to appear, 2013. [5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Information Processing Systems, 25:171–179, 2012. [6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265–283, 2007. [8] R. Jarvis. On the identification of the convex hull of a finite set of points in the plane. Information Processing Letters, 2(1):18–21, 1973. [9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997. [10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Processing Systems, 2007. [11] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Conference Proceedings: AISTATS 2010, 9:541–548, 2010. [13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, to appear, 2013. [14] R. Neal. Slice sampling. The Annals of Statistics, 31:705–767, 2003. [15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2010. [16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007. [17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. arXiv:1208.4118, 2012. [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [19] Y. Yu and X. L. Meng. To center or not to center: That is not the question — An ancillaritysufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics, 20(3):531–570, 2011. 9
3 0.63805932 327 nips-2013-The Randomized Dependence Coefficient
Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf
Abstract: We introduce the Randomized Dependence Coefficient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R´ nyi Maximum Correlation Coefficient. RDC is defined in e terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper. 1
4 0.60449475 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
Author: Ari Pakman, Liam Paninski
Abstract: We present a new approach to sample from generic binary distributions, based on an exact Hamiltonian Monte Carlo algorithm applied to a piecewise continuous augmentation of the binary distribution of interest. An extension of this idea to distributions over mixtures of binary and possibly-truncated Gaussian or exponential variables allows us to sample from posteriors of linear and probit regression models with spike-and-slab priors and truncated parameters. We illustrate the advantages of these algorithms in several examples in which they outperform the Metropolis or Gibbs samplers. 1
5 0.38098013 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
6 0.37183008 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
7 0.36299029 217 nips-2013-On Poisson Graphical Models
8 0.35137975 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
9 0.28279227 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
10 0.28127694 280 nips-2013-Robust Data-Driven Dynamic Programming
11 0.28011414 168 nips-2013-Learning to Pass Expectation Propagation Messages
12 0.27710807 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
13 0.27657709 352 nips-2013-What do row and column marginals reveal about your dataset?
14 0.27388334 252 nips-2013-Predictive PAC Learning and Process Decompositions
15 0.26459098 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
16 0.25940195 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields
17 0.25912291 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
18 0.24991181 256 nips-2013-Probabilistic Principal Geodesic Analysis
19 0.2484123 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
20 0.23912349 189 nips-2013-Message Passing Inference with Chemical Reaction Networks
topicId topicWeight
[(2, 0.046), (16, 0.049), (33, 0.094), (34, 0.106), (41, 0.045), (49, 0.021), (56, 0.073), (70, 0.022), (71, 0.017), (72, 0.335), (84, 0.021), (85, 0.042), (89, 0.021), (93, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.71577626 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
Author: José Miguel Hernández-Lobato, James R. Lloyd, Daniel Hernández-Lobato
Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other timevarying copula methods. 1
2 0.6659838 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
3 0.60458356 244 nips-2013-Parametric Task Learning
Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima
Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.
4 0.59125626 167 nips-2013-Learning the Local Statistics of Optical Flow
Author: Dan Rosenbaum, Daniel Zoran, Yair Weiss
Abstract: Motivated by recent progress in natural image statistics, we use newly available datasets with ground truth optical flow to learn the local statistics of optical flow and compare the learned models to prior models assumed by computer vision researchers. We find that a Gaussian mixture model (GMM) with 64 components provides a significantly better model for local flow statistics when compared to commonly used models. We investigate the source of the GMM’s success and show it is related to an explicit representation of flow boundaries. We also learn a model that jointly models the local intensity pattern and the local optical flow. In accordance with the assumptions often made in computer vision, the model learns that flow boundaries are more likely at intensity boundaries. However, when evaluated on a large dataset, this dependency is very weak and the benefit of conditioning flow estimation on the local intensity pattern is marginal. 1
5 0.56782788 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
Author: David Carlson, Vinayak Rao, Joshua T. Vogelstein, Lawrence Carin
Abstract: With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparametric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-theart. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we find several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) detecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using multiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain. 1
6 0.52045119 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
7 0.46032447 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions
8 0.45994169 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
9 0.45891094 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
10 0.45748609 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
11 0.4563767 173 nips-2013-Least Informative Dimensions
12 0.4559294 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
13 0.45284104 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
14 0.45281503 86 nips-2013-Demixing odors - fast inference in olfaction
15 0.45240971 350 nips-2013-Wavelets on Graphs via Deep Learning
16 0.45215765 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
17 0.45084766 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
18 0.45081669 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
19 0.45079705 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
20 0.45028397 201 nips-2013-Multi-Task Bayesian Optimization