nips nips2010 nips2010-54 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrew Wilson, Zoubin Ghahramani
Abstract: We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures. Imagine measuring the distance of a rocket as it leaves Earth, and wanting to know how these measurements correlate with one another. How much does the value of the measurement at fifteen minutes depend on the measurement at five minutes? Once we’ve learned this correlation structure, suppose we want to compare it to the dependence between measurements of the rocket’s velocity. To do this, it is convenient to separate dependence from the marginal distributions of our measurements. At any given time, a rocket’s distance from Earth could have a Gamma distribution, while its velocity could have a Gaussian distribution. And separating dependence from marginal distributions is precisely what a copula function does. While copulas have recently become popular, especially in financial applications [1, 2], as Nelsen [3] writes, “the study of copulas and the role they play in probability, statistics, and stochastic processes is a subject still in its infancy. There are many open problems. . . ” Typically only bivariate (and recently trivariate) copulas are being used and studied. In our introductory example, we are interested in learning the correlations in different stochastic processes, and comparing them. It would therefore be useful to have a copula process, which can describe the dependencies between arbitrarily many random variables independently of their marginal distributions. We define such a process. And as an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV). In doing so, we provide a Bayesian framework for the learning the marginal distributions and dependency structure of what we call a Gaussian copula process. The volatility of a random variable is its standard deviation. Stochastic volatility models are used to predict the volatilities of a heteroscedastic sequence – a sequence of random variables with different variances, like distance measurements of a rocket as it leaves the Earth. As the rocket gets further away, the variance on the measurements increases. Heteroscedasticity is especially important in econometrics; the returns on equity indices, like the S&P; 500, or on currency exchanges, are heteroscedastic. Indeed, in 2003, Robert Engle won the Nobel Prize in economics “for methods of analyzing economic time series with time-varying volatility”. GARCH [4], a generalized version of Engle’s ARCH, is arguably unsurpassed for predicting the volatility of returns on equity indices and currency exchanges [5, 6, 7]. GCPV can outperform GARCH, and is competitive on financial data that especially suits GARCH [8, 9, 10]. Before discussing GCPV, we first introduce copulas and the copula process. For a review of Gaussian processes, see Rasmussen and Williams [11]. ∗ † http://mlg.eng.cam.ac.uk/andrew Also at the machine learning department at Carnegie Mellon University. 1 1 Copulas Copulas are important because they separate the dependency structure between random variables from their marginal distributions. Intuitively, we can describe the dependency structure of any multivariate joint distribution H(x1 , . . . , xn ) = P (X1 ≤ x1 , . . . Xn ≤ xn ) through a two step process. First we take each univariate random variable Xi and transform it through its cumulative distribution function (cdf) Fi to get Ui = Fi (Xi ), a uniform random variable. We then express the dependencies between these transformed variables through the n-copula C(u1 , . . . , un ). Formally, an n-copula C : [0, 1]n → [0, 1] is a multivariate cdf with uniform univariate marginals: C(u1 , u2 , . . . , un ) = P (U1 ≤ u1 , U2 ≤ u2 , . . . , Un ≤ un ), where U1 , U2 , . . . , Un are standard uniform random variables. Sklar [12] precisely expressed our intuition in the theorem below. Theorem 1.1. Sklar’s theorem Let H be an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . Then there exists an n-copula C such that for all (x1 , x2 , . . . , xn ) ∈ [−∞, ∞]n , H(x1 , x2 , . . . , xn ) = C(F1 (x1 ), F2 (x2 ), . . . , Fn (xn )) = C(u1 , u2 , . . . , un ). (1) If F1 , F2 , . . . , Fn are all continuous then C is unique; otherwise C is uniquely determined on Range F1 × Range F2 × · · · × Range Fn . Conversely, if C is an n-copula and F1 , F2 , . . . , Fn are distribution functions, then the function H is an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . (−1) As a corollary, if Fi (u) = inf{x : F (x) ≥ u}, the quasi-inverse of Fi , then for all u1 , u2 , . . . , un ∈ [0, 1]n , (−1) C(u1 , u2 , . . . , un ) = H(F1 (−1) (u1 ), F2 (−1) (u2 ), . . . , Fn (un )). (2) In other words, (2) can be used to construct a copula. For example, the bivariate Gaussian copula is defined as C(u, v) = Φρ (Φ−1 (u), Φ−1 (v)), (3) where Φρ is a bivariate Gaussian cdf with correlation coefficient ρ, and Φ is the standard univariate Gaussian cdf. Li [2] popularised the bivariate Gaussian copula, by showing how it could be used to study financial risk and default correlation, using credit derivatives as an example. By substituting F (x) for u and G(y) for v in equation (3), we have a bivariate distribution H(x, y), with a Gaussian dependency structure, and marginals F and G. Regardless of F and G, the resulting H(x, y) can still be uniquely expressed as a Gaussian copula, so long as F and G are continuous. It is then a copula itself that captures the underlying dependencies between random variables, regardless of their marginal distributions. For this reason, copulas have been called dependence functions [13, 14]. Nelsen [3] contains an extensive discussion of copulas. 2 Copula Processes Imagine choosing a covariance function, and then drawing a sample function at some finite number of points from a Gaussian process. The result is a sample from a collection of Gaussian random variables, with a dependency structure encoded by the specified covariance function. Now, suppose we transform each of these values through a univariate Gaussian cdf, such that we have a sample from a collection of uniform random variables. These uniform random variables also have this underlying Gaussian process dependency structure. One might call the resulting values a draw from a Gaussian-Uniform Process. We could subsequently put these values through an inverse beta cdf, to obtain a draw from what could be called a Gaussian-Beta Process: the values would be a sample from beta random variables, again with an underlying Gaussian process dependency structure. We could also transform the uniform values with different inverse cdfs, which would give a sample from different random variables, with dependencies encoded by the Gaussian process. The above procedure is a means to generate samples from arbitrarily many random variables, with arbitrary marginal distributions, and desired dependencies. It is an example of how to use what we call a copula process – in this case, a Gaussian copula process, since a Gaussian copula describes the dependency structure of a finite number of samples. We now formally define a copula process. 2 Definition 2.1. Copula Process Let {Wt } be a collection of random variables indexed by t ∈ T , with marginal distribution functions Ft , and let Qt = Ft (Wt ). Further, let µ be a stochastic process measure with marginal distribution functions Gt , and joint distribution function H. Then Wt is copula process distributed with base measure µ, or Wt ∼ CP(µ), if and only if for all n ∈ N, ai ∈ R, n P( (−1) {Gti (Qti ) ≤ ai }) = Ht1 ,t2 ,...,tn (a1 , a2 , . . . , an ). (4) i=1 (−1) Each Qti ∼ Uniform(0, 1), and Gti is the quasi-inverse of Gti , as previously defined. Definition 2.2. Gaussian Copula Process Wt is Gaussian copula process distributed if it is copula process distributed and the base measure µ is a Gaussian process. If there is a mapping Ψ such that Ψ(Wt ) ∼ GP(m(t), k(t, t )), then we write Wt ∼ GCP(Ψ, m(t), k(t, t )). For example, if we have Wt ∼ GCP with m(t) = 0 and k(t, t) = 1, then in the definition of a copula process, Gt = Φ, the standard univariate Gaussian cdf, and H is the usual GP joint distribution function. Supposing this GCP is a Gaussian-Beta process, then Ψ = Φ−1 ◦ FB , where FB is a univariate Beta cdf. One could similarly define other copula processes. We described generally how a copula process can be used to generate samples of arbitrarily many random variables with desired marginals and dependencies. We now develop a specific and practical application of this framework. We introduce a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), as an example of how to learn the joint distribution of arbitrarily many random variables, the marginals of these random variables, and to make predictions. To do this, we fit a Gaussian copula process by using a type of Warped Gaussian Process [15]. However, our methodology varies substantially from Snelson et al. [15], since we are doing inference on latent variables as opposed to observations, which is a much greater undertaking that involves approximations, and we are doing so in a different context. 3 Gaussian Copula Process Volatility Assume we have a sequence of observations y = (y1 , . . . , yn ) at times t = (t1 , . . . , tn ) . The observations are random variables with different latent standard deviations. We therefore have n unobserved standard deviations, σ1 , . . . , σn , and want to learn the correlation structure between these standard deviations, and also to predict the distribution of σ∗ at some unrealised time t∗ . We model the standard deviation function as a Gaussian copula process: σt ∼ GCP(g −1 , 0, k(t, t )). (5) f (t) ∼ GP(m(t) = 0, k(t, t )) σ(t) = g(f (t), ω) (6) (7) y(t) ∼ N (0, σ 2 (t)), (8) Specifically, where g is a monotonic warping function, parametrized by ω. For each of the observations y = (y1 , . . . , yn ) we have corresponding GP latent function values f = (f1 , . . . , fn ) , where σ(ti ) = g(fi , ω), using the shorthand fi to mean f (ti ). σt ∼ GCP, because any finite sequence (σ1 , . . . , σp ) is distributed as a Gaussian copula: P (σ1 ≤ a1 , . . . , σp ≤ ap ) = P (g −1 (σ1 ) ≤ g −1 (a1 ), . . . , g −1 (σp ) ≤ g −1 (ap )) = ΦΓ (g −1 (a1 ), . . . , g −1 −1 (ap )) = ΦΓ (Φ = ΦΓ (Φ −1 (F (a1 )), . . . , Φ (u1 ), . . . , Φ −1 −1 (9) (F (ap ))) (up )) = C(u1 , . . . , up ), where Φ is the standard univariate Gaussian cdf (supposing k(t, t) = 1), ΦΓ is a multivariate Gaussian cdf with covariance matrix Γij = cov(g −1 (σi ), g −1 (σj )), and F is the marginal distribution of 3 each σi . In (5), we have Ψ = g −1 , because it is g −1 which maps σt to a GP. The specification in (5) is equivalently expressed by (6) and (7). With GCPV, the form of g is learned so that g −1 (σt ) is best modelled by a GP. By learning g, we learn the marginal of each σ: F (a) = Φ(g −1 (a)) for a ∈ R. Recently, a different sort of ‘kernel copula process’ has been used, where the marginals of the variables being modelled are not learned [16].1 Further, we also consider a more subtle and flexible form of our model, where the function g itself is indexed by time: g = gt (f (t), ω). We only assume that the marginal distributions of σt are stationary over ‘small’ time periods, and for each of these time periods (5)-(7) hold true. We return to this in the final discussion section. Here we have assumed that each observation, conditioned on knowing its variance, is normally distributed with zero mean. This is a common assumption in heteroscedastic models. The zero mean and normality assumptions can be relaxed and are not central to this paper. 4 Predictions with GCPV Ultimately, we wish to infer p(σ(t∗ )|y, z), where z = {θ, ω}, and θ are the hyperparameters of the GP covariance function. To do this, we sample from p(f∗ |y, z) = p(f∗ |f , θ)p(f |y, z)df (10) and then transform these samples by g. Letting (Cf )ij = δij g(fi , ω)2 , where δij is the Kronecker delta, Kij = k(ti , tj ), (k∗ )i = k(t∗ , ti ), we have p(f |y, z) = N (f ; 0, K)N (y; 0, Cf )/p(y|z), p(f∗ |f , θ) = N (k∗ K −1 f , k(t∗ , t∗ ) − k∗ K −1 (11) k∗ ). (12) We also wish to learn z, which we can do by finding the z that maximizes the marginal likelihood, ˆ p(y|z) = p(y|f , ω)p(f |θ)df . (13) Unfortunately, for many functions g, (10) and (13) are intractable. Our methods of dealing with this can be used in very general circumstances, where one has a Gaussian process prior, but an (optionally parametrized) non-Gaussian likelihood. We use the Laplace approximation to estimate p(f |y, z) as a Gaussian. Then we can integrate (10) for a Gaussian approximation to p(f∗ |y, z), which we sample from to make predictions of σ∗ . Using Laplace, we can also find an expression for an approximate marginal likelihood, which we maximize to determine z. Once we have found z with Laplace, we use Markov chain Monte Carlo to sample from p(f∗ |y, z), and compare that to using Laplace to sample from p(f∗ |y, z). In the supplement we relate this discussion to (9). 4.1 Laplace Approximation The goal is to approximate (11) with a Gaussian, so that we can evaluate (10) and (13) and make predictions. In doing so, we follow Rasmussen and Williams [11] in their treatment of Gaussian process classification, except we use a parametrized likelihood, and modify Newton’s method. First, consider as an objective function the logarithm of an unnormalized (11): s(f |y, z) = log p(y|f , ω) + log p(f |θ). (14) ˆ The Laplace approximation uses a second order Taylor expansion about the f which maximizes ˆ, for which we use (14), to find an approximate objective s(f |y, z). So the first step is to find f ˜ Newton’s method. The Newton update is f new = f − ( s(f ))−1 s(f ). Differentiating (14), s(f |y, z) = s(f |y, z) = where W is the diagonal matrix − 1 log p(y|f , ω) − K −1 f log p(y|f , ω) − K −1 = −W − K −1 , log p(y|f , ω). Note added in proof : Also, for a very recent related model, see Rodr´guez et al. [17]. ı 4 (15) (16) If the likelihood function p(y|f , ω) is not log concave, then W may have negative entries. Vanhatalo et al. [18] found this to be problematic when doing Gaussian process regression with a Student-t ˆ likelihood. They instead use an expectation-maximization (EM) algorithm for finding f , and iterate ordered rank one Cholesky updates to evaluate the Laplace approximate marginal likelihood. But EM can converge slowly, especially near a local optimum, and each of the rank one updates is vulnerable to numerical instability. With a small modification of Newton’s method, we often get close to ˆ quadratic convergence for finding f , and can evaluate the Laplace approximate marginal likelihood in a numerically stable fashion, with no approximate Cholesky factors, and optimal computational requirements. Some comments are in the supplementary material but, in short, we use an approximate negative Hessian, − s ≈ M + K −1 , which is guaranteed to be positive definite, since M is formed on each iteration by zeroing the negative entries of W . For stability, we reformulate our 1 1 1 1 optimization in terms of B = I + M 2 KM 2 , and let Q = M 2 B −1 M 2 , b = M f + log p(y|f ), a = b − QKb. Since (K −1 + M )−1 = K − KQK, the Newton update becomes f new = Ka. ˆ With these updates we find f and get an expression for s which we use to approximate (13) and ˜ (11). The approximate marginal likelihood q(y|z) is given by exp(˜)df . Taking its logarithm, s 1ˆ 1 ˆ log q(y|z) = − f af + log p(y|f ) − log |Bf |, (17) ˆ ˆ 2 2 ˆ ˆ where Bf is B evaluated at f , and af is a numerically stable evaluation of K −1 f . ˆ ˆ To learn the parameters z, we use conjugate gradient descent to maximize (17) with respect to z. ˆ ˆ Since f is a function of z, we initialize z, and update f every time we vary z. Once we have found an optimum z , we can make predictions. By exponentiating s, we find a Gaussian approximation to ˆ ˜ ˆ the posterior (11), q(f |y, z) = N (f , K − KQK). The product of this approximate posterior with p(f∗ |f ) is Gaussian. Integrating this product, we approximate p(f∗ |y, z) as ˆ q(f∗ |y, z) = N (k∗ log p(y|f ), k(t∗ , t∗ ) − k∗ Qk∗ ). (18) Given n training observations, the cost of each Newton iteration is dominated by computing the cholesky decomposition of B, which takes O(n3 ) operations. The objective function typically changes by less than 10−6 after 3 iterations. Once Newton’s method has converged, it takes only O(1) operations to draw from q(f∗ |y, z) and make predictions. 4.2 Markov chain Monte Carlo We use Markov chain Monte Carlo (MCMC) to sample from (11), so that we can later sample from p(σ∗ |y, z) to make predictions. Sampling from (11) is difficult, because the variables f are strongly coupled by a Gaussian process prior. We use a new technique, Elliptical Slice Sampling [19], and find it extremely effective for this purpose. It was specifically designed to sample from posteriors with correlated Gaussian priors. It has no free parameters, and jointly updates every element of f . For our setting, it is over 100 times as fast as axis aligned slice sampling with univariate updates. To make predictions, we take J samples of p(f |y, z), {f 1 , . . . , f J }, and then approximate (10) as a mixture of J Gaussians: J 1 p(f∗ |f i , θ). (19) p(f∗ |y, z) ≈ J i=1 Each of the Gaussians in this mixture have equal weight. So for each sample of f∗ |y, we uniformly choose a random p(f∗ |f i , θ) and draw a sample. In the limit J → ∞, we are sampling from the exact p(f∗ |y, z). Mapping these samples through g gives samples from p(σ∗ |y, z). After one O(n3 ) and one O(J) operation, a draw from (19) takes O(1) operations. 4.3 Warping Function The warping function, g, maps fi , a GP function value, to σi , a standard deviation. Since fi can take any value in R, and σi can take any non-negative real value, g : R → R+ . For each fi to correspond to a unique deviation, g must also be one-to-one. We use K g(x, ω) = aj log[exp[bj (x + cj )] + 1], j=1 5 aj , bj > 0. (20) This is monotonic, positive, infinitely differentiable, asymptotic towards zero as x → −∞, and K tends to ( j=1 aj bj )x as x → ∞. In practice, it is useful to add a small constant to (20), to avoid rare situations where the parameters ω are trained to make g extremely small for certain inputs, at the expense of a good overall fit; this can happen when the parameters ω are learned by optimizing a likelihood. A suitable constant could be one tenth the absolute value of the smallest nonzero observation. By inferring the parameters of the warping function, or distributions of these parameters, we are learning a transformation which will best model σt with a Gaussian process. The more flexible the warping function, the more potential there is to improve the GCPV fit – in other words, the better we can estimate the ‘perfect’ transformation. To test the importance of this flexibility, we also try a simple unparametrized warping function, g(x) = ex . In related work, Goldberg et al. [20] place a GP prior on the log noise level in a standard GP regression model on observations, except for inference they use Gibbs sampling, and a high level of ‘jitter’ for conditioning. Once g is trained, we can infer the marginal distribution of each σ: F (a) = Φ(g −1 (a)), for a ∈ R. This suggests an alternate way to initialize g: we can initialize F as a mixture of Gaussians, and then map through Φ−1 to find g −1 . Since mixtures of Gaussians are dense in the set of probability distributions, we could in principle find the ‘perfect’ g using an infinite mixture of Gaussians [21]. 5 Experiments In our experiments, we predict the latent standard deviations σ of observations y at times t, and also σ∗ at unobserved times t∗ . To do this, we use two versions of GCPV. The first variant, which we simply refer to as GCPV, uses the warping function (20) with K = 1, and squared exponential covariance function, k(t, t ) = A exp(−(t−t )2 /l2 ), with A = 1. The second variant, which we call GP-EXP, uses the unparametrized warping function ex , and the same covariance function, except the amplitude A is a trained hyperparameter. The other hyperparameter l is called the lengthscale of the covariance function. The greater l, the greater the covariance between σt and σt+a for a ∈ R. We train hyperparameters by maximizing the Laplace approximate log marginal likelihood (17). We then sample from p(f∗ |y) using the Laplace approximation (18). We also do this using MCMC (19) with J = 10000, after discarding a previous 10000 samples of p(f |y) as burn-in. We pass 2 these samples of f∗ |y through g and g 2 to draw from p(σ∗ |y) and p(σ∗ |y), and compute the sample mean and variance of σ∗ |y. We use the sample mean as a point predictor, and the sample variance for error bounds on these predictions, and we use 10000 samples to compute these quantities. For GCPV we use Laplace and MCMC for inference, but for GP-EXP we only use Laplace. We compare predictions to GARCH(1,1), which has been shown in extensive and recent reviews to be competitive with other GARCH variants, and more sophisticated models [5, 6, 7]. GARCH(p,q) specifies y(t) ∼ p 2 2 N (0, σ 2 (t)), and lets the variance be a deterministic function of the past: σt = a0 + i=1 ai yt−i + q 2 j=1 bj σt−j . We use the Matlab Econometrics Toolbox implementation of GARCH, where the parameters a0 , ai and bj are estimated using a constrained maximum likelihood. We make forecasts of volatility, and we predict historical volatility. By ‘historical volatility’ we mean the volatility at observed time points, or between these points. Uncovering historical volatility is important. It could, for instance, be used to study what causes fluctuations in the stock market, or to understand physical systems. To evaluate our model, we use the Mean Squared Error (MSE) between the true variance, or proxy for the truth, and the predicted variance. Although likelihood has advantages, we are limited in space, and we wish to harmonize with the econometrics literature, and other assessments of volatility models, where MSE is the standard. In a similar assessment of volatility models, Brownlees et al. [7] found that MSE and quasi-likelihood rankings were comparable. When the true variance is unknown we follow Brownlees et al. [7] and use squared observations as a proxy for the truth, to compare our model to GARCH.2 The more observations, the more reliable these performance estimates will be. However, not many observations (e.g. 100) are needed for a stable ranking of competing models; in Brownlees et al. [7], the rankings derived from high frequency squared observations are similar to those derived using daily squared observations. 2 Since each observation y is assumed to have zero mean and variance σ 2 , E[y 2 ] = σ 2 . 6 5.1 Simulations We simulate observations from N (0, σ 2 (t)), using σ(t) = sin(t) cos(t2 ) + 1, at t = (0, 0.02, 0.04, . . . , 4) . We call this data set TRIG. We also simulate using a standard deviation that jumps from 0.1 to 7 and back, at times t = (0, 0.1, 0.2, . . . , 6) . We call this data set JUMP. To forecast, we use all observations up until the current time point, and make 1, 7, and 30 step ahead predictions. So, for example, in TRIG we start by observing t = 0, and make forecasts at t = 0.02, 0.14, 0.60. Then we observe t = 0, 0.02 and make forecasts at t = 0.04, 0.16, 0.62, and so on, until all data points have been observed. For historical volatility, we predict the latent σt at the observation times, which is safe since we are comparing to the true volatility, which is not used in training; the results are similar if we interpolate. Figure 1 panels a) and b) show the true volatility for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP Laplace, and GARCH(1,1) predictions of historical volatility. Table 1 shows the results for forecasting and historical volatility. In panel a) we see that GCPV more accurately captures the dependencies between σ at different times points than GARCH: if we manually decrease the lengthscale in the GCPV covariance function, we can replicate the erratic GARCH behaviour, which inaccurately suggests that the covariance between σt and σt+a decreases quickly with increases in a. We also see that GCPV with an unparametrized exponential warping function tends to overestimates peaks and underestimate troughs. In panel b), the volatility is extremely difficult to reconstruct or forecast – with no warning it will immediately and dramatically increase or decrease. This behaviour is not suited to a smooth squared exponential covariance function. Nevertheless, GCPV outperforms GARCH, especially in regions of low volatility. We also see this in panel a) for t ∈ (1.5, 2). GARCH is known to respond slowly to large returns, and to overpredict volatility [22]. In JUMP, the greater the peaks, and the smaller the troughs, the more GARCH suffers, while GCPV is mostly robust to these changes. 5.2 Financial Data The returns on the daily exchange rate between the Deutschmark (DM) and the Great Britain Pound (GBP) from 1984 to 1992 have become a benchmark for assessing the performance of GARCH models [8, 9, 10]. This exchange data, which we refer to as DMGBP, can be obtained from www.datastream.com, and the returns are calculated as rt = log(Pt+1 /Pt ), where Pt is the number of DM to GBP on day t. The returns are assumed to have a zero mean function. We use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility forecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992 (659 trading days). Every 7 days, we retrain the parameters of GCPV and GARCH. Every time we retrain parameters, we predict historical volatility over the past 120 days. The average MSE for these historical predictions is given in Table 1, although they should be observed with caution; unlike with the simulations, the DMGBP historical predictions are trained using the same data they are assessed on. In Figure 1c), we see that the GARCH one day ahead forecasts are lifted above the GCPV forecasts, but unlike in the simulations, they are now operating on a similar lengthscale. This suggests that GARCH could still be overpredicting volatility, but that GCPV has adapted its estimation of how σt and σt+a correlate with one another. Since GARCH is suited to this financial data set, it is reassuring that GCPV predictions have a similar time varying structure. Overall, GCPV and GARCH are competitive with one another for forecasting currency exchange returns, as seen in Table 1. Moreover, a learned warping function g outperforms an unparametrized one, and a full Laplace solution is comparable to using MCMC for inference, in accuracy and speed. This is also true for the simulations. Therefore we recommend whichever is more convenient to implement. 6 Discussion We defined a copula process, and as an example, developed a stochastic volatility model, GCPV, which can outperform GARCH. With GCPV, the volatility σt is distributed as a Gaussian Copula Process, which separates the modelling of the dependencies between volatilities at different times from their marginal distributions – arguably the most useful property of a copula. Further, GCPV fits the marginals in the Gaussian copula process by learning a warping function. If we had simply chosen an unparametrized exponential warping function, we would incorrectly be assuming that the log 7 Table 1: MSE for predicting volatility. Data set Model Historical 1 step 7 step 30 step TRIG GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.0953 0.0760 0.193 0.938 0.588 0.622 0.646 1.04 0.951 0.979 1.36 1.79 1.71 1.76 1.15 5.12 JUMP GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.588 1.21 1.43 1.88 0.891 0.951 1.76 1.58 1.38 1.37 6.95 3.43 1.35 1.35 14.7 5.65 GCPV (LA) GCPV (MCMC) GP-EXP GARCH 2.43 2.39 2.52 2.83 3.00 3.00 3.20 3.03 3.08 3.08 3.46 3.12 3.17 3.17 5.14 3.32 ×103 DMGBP ×10−9 TRIG JUMP DMGBP 20 DMGBP 0.015 600 Probability Density 3 1 Volatility Volatility Volatility 15 2 10 0.01 0.005 5 0 0 1 2 Time (a) 3 4 0 0 2 4 0 6 Time (b) 0 200 400 Days (c) 600 400 200 0 0 0.005 σ (d) 0.01 Figure 1: Predicting volatility and learning its marginal pdf. For a) and b), the true volatility, and GCPV (MCMC), GCPV (LA), GP-EXP, and GARCH predictions, are shown respectively by a thick green line, a dashed thick blue line, a dashed black line, a cyan line, and a red line. a) shows predictions of historical volatility for TRIG, where the shade is a 95% confidence interval about GCPV (MCMC) predictions. b) shows predictions of historical volatility for JUMP. In c), a black line and a dashed red line respectively show GCPV (LA) and GARCH one day ahead volatility forecasts for DMGBP. In d), a black line and a dashed blue line respectively show the GCPV learned marginal pdf of σt in DMGBP and a Gamma(4.15,0.00045) pdf. volatilities are marginally Gaussian distributed. Indeed, for the DMGBP data, we trained the warping function g over a 120 day period, and mapped its inverse through the univariate standard Gaussian cdf Φ, and differenced, to estimate the marginal probability density function (pdf) of σt over this period. The learned marginal pdf, shown in Figure 1d), is similar to a Gamma(4.15,0.00045) distribution. However, in using a rolling window to retrain the parameters of g, we do not assume that the marginals of σt are stationary; we have a time changing warping function. While GARCH is successful, and its simplicity is attractive, our model is also simple and has a number of advantages. We can effortlessly handle missing data, we can easily incorporate covariates other than time (like interest rates) in our covariance function, and we can choose from a rich class of covariance functions – squared exponential, Brownian motion, Mat´ rn, periodic, etc. In fact, the e volatility of high frequency intradaily returns on equity indices and currency exchanges is cyclical [23], and GCPV with a periodic covariance function is uniquely well suited to this data. And the parameters of GCPV, like the covariance function lengthscale, or the learned warping function, provide insight into the underlying source of volatility, unlike the parameters of GARCH. Finally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are being used. With our copula process one can learn the dependencies between arbitrarily many random variables independently of their marginal distributions. We hope the Gaussian Copula Process Volatility model will encourage other applications of copula processes. More generally, we hope our work will help bring together the machine learning and econometrics communities. Acknowledgments: Thanks to Carl Edward Rasmussen and Ferenc Husz´ r for helpful conversaa tions. AGW is supported by an NSERC grant. 8 References [1] Paul Embrechts, Alexander McNeil, and Daniel Straumann. Correlation and dependence in risk management: Properties and pitfalls. In Risk Management: Value at risk and beyond, pages 176–223. Cambridge University Press, 1999. [2] David X. Li. On default correlation: A copula function approach. Journal of Fixed Income, 9(4):43–54, 2000. [3] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics, second edition, 2006. [4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31 (3):307–327, 1986. [5] Ser-Huang Poon and Clive W.J. Granger. Practical issues in forecasting volatility. Financial Analysts Journal, 61(1):45–56, 2005. [6] Peter Reinhard Hansen and Asger Lunde. A forecast comparison of volatility models: Does anything beat a GARCH(1,1). Journal of Applied Econometrics, 20(7):873–889, 2005. [7] Christian T. Brownlees, Robert F. Engle, and Bryan T. Kelly. A practical guide to volatility forecasting through calm and storm, 2009. Available at SSRN: http://ssrn.com/abstract=1502915. [8] T. Bollerslev and E. Ghysels. Periodic autoregressive conditional heteroscedasticity. Journal of Business and Economic Statistics, 14:139–151, 1996. [9] B.D. McCullough and C.G. Renfro. Benchmarks and software standards: A case study of GARCH procedures. Journal of Economic and Social Measurement, 25:59–71, 1998. [10] C. Brooks, S.P. Burke, and G. Persand. Benchmarks and the accuracy of GARCH model estimation. International Journal of Forecasting, 17:45–56, 2001. [11] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006. ` [12] Abe Sklar. Fonctions de r´ partition a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8: e 229–231, 1959. [13] P Deheuvels. Caract´ isation compl` te des lois extrˆ mes multivari´ s et de la convergence des types e e e e extrˆ mes. Publications de l’Institut de Statistique de l’Universit´ de Paris, 23:1–36, 1978. e e [14] G Kimeldorf and A Sampson. Uniform representations of bivariate distributions. Communications in Statistics, 4:617–627, 1982. [15] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes. In NIPS, 2003. [16] Sebastian Jaimungal and Eddie K.H. Ng. Kernel-based Copula processes. In ECML PKDD, 2009. [17] A. Rodr´guez, D.B. Dunson, and A.E. Gelfand. Latent stick-breaking processes. Journal of the American ı Statistical Association, 105(490):647–659, 2010. [18] Jarno Vanhatalo, Pasi Jylanki, and Aki Vehtari. Gaussian process regression with Student-t likelihood. In NIPS, 2009. [19] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical Slice Sampling. In AISTATS, 2010. [20] Paul W. Goldberg, Christopher K.I. Williams, and Christopher M. Bishop. Regression with inputdependent noise: A Gaussian process treatment. In NIPS, 1998. [21] Carl Edward Rasmussen. The Infinite Gaussian Mixture Model. In NIPS, 2000. [22] Ruey S. Tsay. Analysis of Financial Time Series. John Wiley & Sons, 2002. [23] Torben G. Andersen and Tim Bollerslev. Intraday periodicity and volatility persistence in financial markets. Journal of Empirical Finance, 4(2-3):115–158, 1997. 9
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. [sent-6, score-0.6]
2 As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. [sent-7, score-0.516]
3 To do this, it is convenient to separate dependence from the marginal distributions of our measurements. [sent-14, score-0.101]
4 And separating dependence from marginal distributions is precisely what a copula function does. [sent-16, score-0.465]
5 While copulas have recently become popular, especially in financial applications [1, 2], as Nelsen [3] writes, “the study of copulas and the role they play in probability, statistics, and stochastic processes is a subject still in its infancy. [sent-17, score-0.168]
6 ” Typically only bivariate (and recently trivariate) copulas are being used and studied. [sent-21, score-0.149]
7 It would therefore be useful to have a copula process, which can describe the dependencies between arbitrarily many random variables independently of their marginal distributions. [sent-23, score-0.553]
8 And as an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV). [sent-25, score-0.451]
9 In doing so, we provide a Bayesian framework for the learning the marginal distributions and dependency structure of what we call a Gaussian copula process. [sent-26, score-0.505]
10 The volatility of a random variable is its standard deviation. [sent-27, score-0.451]
11 Stochastic volatility models are used to predict the volatilities of a heteroscedastic sequence – a sequence of random variables with different variances, like distance measurements of a rocket as it leaves the Earth. [sent-28, score-0.631]
12 Heteroscedasticity is especially important in econometrics; the returns on equity indices, like the S&P; 500, or on currency exchanges, are heteroscedastic. [sent-30, score-0.133]
13 GARCH [4], a generalized version of Engle’s ARCH, is arguably unsurpassed for predicting the volatility of returns on equity indices and currency exchanges [5, 6, 7]. [sent-32, score-0.625]
14 Before discussing GCPV, we first introduce copulas and the copula process. [sent-34, score-0.448]
15 1 1 Copulas Copulas are important because they separate the dependency structure between random variables from their marginal distributions. [sent-41, score-0.166]
16 First we take each univariate random variable Xi and transform it through its cumulative distribution function (cdf) Fi to get Ui = Fi (Xi ), a uniform random variable. [sent-49, score-0.092]
17 Formally, an n-copula C : [0, 1]n → [0, 1] is a multivariate cdf with uniform univariate marginals: C(u1 , u2 , . [sent-54, score-0.167]
18 Sklar’s theorem Let H be an n-dimensional distribution function with marginal distribution functions F1 , F2 , . [sent-67, score-0.101]
19 , Fn are distribution functions, then the function H is an n-dimensional distribution function with marginal distribution functions F1 , F2 , . [sent-91, score-0.101]
20 For example, the bivariate Gaussian copula is defined as C(u, v) = Φρ (Φ−1 (u), Φ−1 (v)), (3) where Φρ is a bivariate Gaussian cdf with correlation coefficient ρ, and Φ is the standard univariate Gaussian cdf. [sent-106, score-0.661]
21 By substituting F (x) for u and G(y) for v in equation (3), we have a bivariate distribution H(x, y), with a Gaussian dependency structure, and marginals F and G. [sent-108, score-0.148]
22 It is then a copula itself that captures the underlying dependencies between random variables, regardless of their marginal distributions. [sent-110, score-0.5]
23 For this reason, copulas have been called dependence functions [13, 14]. [sent-111, score-0.084]
24 The result is a sample from a collection of Gaussian random variables, with a dependency structure encoded by the specified covariance function. [sent-114, score-0.11]
25 Now, suppose we transform each of these values through a univariate Gaussian cdf, such that we have a sample from a collection of uniform random variables. [sent-115, score-0.114]
26 These uniform random variables also have this underlying Gaussian process dependency structure. [sent-116, score-0.133]
27 We could subsequently put these values through an inverse beta cdf, to obtain a draw from what could be called a Gaussian-Beta Process: the values would be a sample from beta random variables, again with an underlying Gaussian process dependency structure. [sent-118, score-0.178]
28 We could also transform the uniform values with different inverse cdfs, which would give a sample from different random variables, with dependencies encoded by the Gaussian process. [sent-119, score-0.078]
29 The above procedure is a means to generate samples from arbitrarily many random variables, with arbitrary marginal distributions, and desired dependencies. [sent-120, score-0.129]
30 It is an example of how to use what we call a copula process – in this case, a Gaussian copula process, since a Gaussian copula describes the dependency structure of a finite number of samples. [sent-121, score-1.179]
31 Copula Process Let {Wt } be a collection of random variables indexed by t ∈ T , with marginal distribution functions Ft , and let Qt = Ft (Wt ). [sent-125, score-0.126]
32 Further, let µ be a stochastic process measure with marginal distribution functions Gt , and joint distribution function H. [sent-126, score-0.148]
33 Then Wt is copula process distributed with base measure µ, or Wt ∼ CP(µ), if and only if for all n ∈ N, ai ∈ R, n P( (−1) {Gti (Qti ) ≤ ai }) = Ht1 ,t2 ,. [sent-127, score-0.411]
34 Gaussian Copula Process Wt is Gaussian copula process distributed if it is copula process distributed and the base measure µ is a Gaussian process. [sent-137, score-0.822]
35 For example, if we have Wt ∼ GCP with m(t) = 0 and k(t, t) = 1, then in the definition of a copula process, Gt = Φ, the standard univariate Gaussian cdf, and H is the usual GP joint distribution function. [sent-139, score-0.435]
36 We described generally how a copula process can be used to generate samples of arbitrarily many random variables with desired marginals and dependencies. [sent-142, score-0.507]
37 We introduce a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), as an example of how to learn the joint distribution of arbitrarily many random variables, the marginals of these random variables, and to make predictions. [sent-144, score-0.522]
38 To do this, we fit a Gaussian copula process by using a type of Warped Gaussian Process [15]. [sent-145, score-0.411]
39 The observations are random variables with different latent standard deviations. [sent-155, score-0.081]
40 We model the standard deviation function as a Gaussian copula process: σt ∼ GCP(g −1 , 0, k(t, t )). [sent-160, score-0.364]
41 (5) f (t) ∼ GP(m(t) = 0, k(t, t )) σ(t) = g(f (t), ω) (6) (7) y(t) ∼ N (0, σ 2 (t)), (8) Specifically, where g is a monotonic warping function, parametrized by ω. [sent-161, score-0.13]
42 , fn ) , where σ(ti ) = g(fi , ω), using the shorthand fi to mean f (ti ). [sent-168, score-0.105]
43 , up ), where Φ is the standard univariate Gaussian cdf (supposing k(t, t) = 1), ΦΓ is a multivariate Gaussian cdf with covariance matrix Γij = cov(g −1 (σi ), g −1 (σj )), and F is the marginal distribution of 3 each σi . [sent-190, score-0.37]
44 By learning g, we learn the marginal of each σ: F (a) = Φ(g −1 (a)) for a ∈ R. [sent-194, score-0.101]
45 Recently, a different sort of ‘kernel copula process’ has been used, where the marginals of the variables being modelled are not learned [16]. [sent-195, score-0.432]
46 We only assume that the marginal distributions of σt are stationary over ‘small’ time periods, and for each of these time periods (5)-(7) hold true. [sent-197, score-0.101]
47 (12) We also wish to learn z, which we can do by finding the z that maximizes the marginal likelihood, ˆ p(y|z) = p(y|f , ω)p(f |θ)df . [sent-205, score-0.101]
48 Using Laplace, we can also find an expression for an approximate marginal likelihood, which we maximize to determine z. [sent-210, score-0.101]
49 They instead use an expectation-maximization (EM) algorithm for finding f , and iterate ordered rank one Cholesky updates to evaluate the Laplace approximate marginal likelihood. [sent-226, score-0.101]
50 With a small modification of Newton’s method, we often get close to ˆ quadratic convergence for finding f , and can evaluate the Laplace approximate marginal likelihood in a numerically stable fashion, with no approximate Cholesky factors, and optimal computational requirements. [sent-228, score-0.101]
51 The approximate marginal likelihood q(y|z) is given by exp(˜)df . [sent-233, score-0.101]
52 For our setting, it is over 100 times as fast as axis aligned slice sampling with univariate updates. [sent-250, score-0.095]
53 3 Warping Function The warping function, g, maps fi , a GP function value, to σi , a standard deviation. [sent-261, score-0.182]
54 By inferring the parameters of the warping function, or distributions of these parameters, we are learning a transformation which will best model σt with a Gaussian process. [sent-268, score-0.13]
55 The more flexible the warping function, the more potential there is to improve the GCPV fit – in other words, the better we can estimate the ‘perfect’ transformation. [sent-269, score-0.13]
56 To test the importance of this flexibility, we also try a simple unparametrized warping function, g(x) = ex . [sent-270, score-0.198]
57 Once g is trained, we can infer the marginal distribution of each σ: F (a) = Φ(g −1 (a)), for a ∈ R. [sent-273, score-0.101]
58 5 Experiments In our experiments, we predict the latent standard deviations σ of observations y at times t, and also σ∗ at unobserved times t∗ . [sent-276, score-0.099]
59 The first variant, which we simply refer to as GCPV, uses the warping function (20) with K = 1, and squared exponential covariance function, k(t, t ) = A exp(−(t−t )2 /l2 ), with A = 1. [sent-278, score-0.204]
60 The second variant, which we call GP-EXP, uses the unparametrized warping function ex , and the same covariance function, except the amplitude A is a trained hyperparameter. [sent-279, score-0.246]
61 The other hyperparameter l is called the lengthscale of the covariance function. [sent-280, score-0.084]
62 We train hyperparameters by maximizing the Laplace approximate log marginal likelihood (17). [sent-282, score-0.101]
63 We make forecasts of volatility, and we predict historical volatility. [sent-291, score-0.223]
64 By ‘historical volatility’ we mean the volatility at observed time points, or between these points. [sent-292, score-0.451]
65 Although likelihood has advantages, we are limited in space, and we wish to harmonize with the econometrics literature, and other assessments of volatility models, where MSE is the standard. [sent-296, score-0.507]
66 In a similar assessment of volatility models, Brownlees et al. [sent-297, score-0.451]
67 [7], the rankings derived from high frequency squared observations are similar to those derived using daily squared observations. [sent-305, score-0.086]
68 So, for example, in TRIG we start by observing t = 0, and make forecasts at t = 0. [sent-324, score-0.084]
69 For historical volatility, we predict the latent σt at the observation times, which is safe since we are comparing to the true volatility, which is not used in training; the results are similar if we interpolate. [sent-333, score-0.161]
70 Figure 1 panels a) and b) show the true volatility for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP Laplace, and GARCH(1,1) predictions of historical volatility. [sent-334, score-0.613]
71 Table 1 shows the results for forecasting and historical volatility. [sent-335, score-0.169]
72 We also see that GCPV with an unparametrized exponential warping function tends to overestimates peaks and underestimate troughs. [sent-337, score-0.198]
73 In panel b), the volatility is extremely difficult to reconstruct or forecast – with no warning it will immediately and dramatically increase or decrease. [sent-338, score-0.482]
74 GARCH is known to respond slowly to large returns, and to overpredict volatility [22]. [sent-343, score-0.451]
75 com, and the returns are calculated as rt = log(Pt+1 /Pt ), where Pt is the number of DM to GBP on day t. [sent-349, score-0.081]
76 We use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility forecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992 (659 trading days). [sent-351, score-0.621]
77 Every time we retrain parameters, we predict historical volatility over the past 120 days. [sent-353, score-0.626]
78 The average MSE for these historical predictions is given in Table 1, although they should be observed with caution; unlike with the simulations, the DMGBP historical predictions are trained using the same data they are assessed on. [sent-354, score-0.324]
79 In Figure 1c), we see that the GARCH one day ahead forecasts are lifted above the GCPV forecasts, but unlike in the simulations, they are now operating on a similar lengthscale. [sent-355, score-0.156]
80 Overall, GCPV and GARCH are competitive with one another for forecasting currency exchange returns, as seen in Table 1. [sent-358, score-0.127]
81 Moreover, a learned warping function g outperforms an unparametrized one, and a full Laplace solution is comparable to using MCMC for inference, in accuracy and speed. [sent-359, score-0.198]
82 6 Discussion We defined a copula process, and as an example, developed a stochastic volatility model, GCPV, which can outperform GARCH. [sent-362, score-0.815]
83 With GCPV, the volatility σt is distributed as a Gaussian Copula Process, which separates the modelling of the dependencies between volatilities at different times from their marginal distributions – arguably the most useful property of a copula. [sent-363, score-0.628]
84 Further, GCPV fits the marginals in the Gaussian copula process by learning a warping function. [sent-364, score-0.584]
85 If we had simply chosen an unparametrized exponential warping function, we would incorrectly be assuming that the log 7 Table 1: MSE for predicting volatility. [sent-365, score-0.198]
86 01 Figure 1: Predicting volatility and learning its marginal pdf. [sent-419, score-0.552]
87 a) shows predictions of historical volatility for TRIG, where the shade is a 95% confidence interval about GCPV (MCMC) predictions. [sent-421, score-0.613]
88 In c), a black line and a dashed red line respectively show GCPV (LA) and GARCH one day ahead volatility forecasts for DMGBP. [sent-423, score-0.607]
89 In d), a black line and a dashed blue line respectively show the GCPV learned marginal pdf of σt in DMGBP and a Gamma(4. [sent-424, score-0.126]
90 Indeed, for the DMGBP data, we trained the warping function g over a 120 day period, and mapped its inverse through the univariate standard Gaussian cdf Φ, and differenced, to estimate the marginal probability density function (pdf) of σt over this period. [sent-428, score-0.414]
91 The learned marginal pdf, shown in Figure 1d), is similar to a Gamma(4. [sent-429, score-0.101]
92 However, in using a rolling window to retrain the parameters of g, we do not assume that the marginals of σt are stationary; we have a time changing warping function. [sent-432, score-0.236]
93 We can effortlessly handle missing data, we can easily incorporate covariates other than time (like interest rates) in our covariance function, and we can choose from a rich class of covariance functions – squared exponential, Brownian motion, Mat´ rn, periodic, etc. [sent-434, score-0.122]
94 In fact, the e volatility of high frequency intradaily returns on equity indices and currency exchanges is cyclical [23], and GCPV with a periodic covariance function is uniquely well suited to this data. [sent-435, score-0.697]
95 And the parameters of GCPV, like the covariance function lengthscale, or the learned warping function, provide insight into the underlying source of volatility, unlike the parameters of GARCH. [sent-436, score-0.178]
96 Finally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are being used. [sent-437, score-0.233]
97 With our copula process one can learn the dependencies between arbitrarily many random variables independently of their marginal distributions. [sent-438, score-0.6]
98 We hope the Gaussian Copula Process Volatility model will encourage other applications of copula processes. [sent-439, score-0.364]
99 A forecast comparison of volatility models: Does anything beat a GARCH(1,1). [sent-464, score-0.482]
100 A practical guide to volatility forecasting through calm and storm, 2009. [sent-470, score-0.503]
wordName wordTfidf (topN-words)
[('gcpv', 0.533), ('volatility', 0.451), ('copula', 0.364), ('garch', 0.348), ('warping', 0.13), ('historical', 0.117), ('marginal', 0.101), ('laplace', 0.1), ('dmgbp', 0.096), ('copulas', 0.084), ('forecasts', 0.084), ('cdf', 0.075), ('univariate', 0.071), ('un', 0.071), ('gaussian', 0.07), ('gcp', 0.068), ('rocket', 0.068), ('trig', 0.068), ('unparametrized', 0.068), ('bivariate', 0.065), ('mcmc', 0.063), ('econometrics', 0.056), ('gp', 0.056), ('brownlees', 0.055), ('fn', 0.053), ('fi', 0.052), ('forecasting', 0.052), ('currency', 0.048), ('covariance', 0.048), ('newton', 0.047), ('process', 0.047), ('nancial', 0.046), ('predictions', 0.045), ('returns', 0.044), ('marginals', 0.043), ('edward', 0.042), ('engle', 0.041), ('equity', 0.041), ('exchanges', 0.041), ('gti', 0.041), ('volatilities', 0.041), ('mse', 0.041), ('wt', 0.04), ('dependency', 0.04), ('la', 0.038), ('day', 0.037), ('financial', 0.036), ('lengthscale', 0.036), ('retrain', 0.036), ('rasmussen', 0.036), ('dependencies', 0.035), ('ahead', 0.035), ('observations', 0.034), ('bj', 0.033), ('carl', 0.032), ('forecast', 0.031), ('jump', 0.03), ('ap', 0.028), ('arbitrarily', 0.028), ('days', 0.027), ('extr', 0.027), ('gbp', 0.027), ('guez', 0.027), ('kqk', 0.027), ('nelsen', 0.027), ('qti', 0.027), ('rolling', 0.027), ('cholesky', 0.027), ('exchange', 0.027), ('squared', 0.026), ('gaussians', 0.026), ('variables', 0.025), ('pdf', 0.025), ('draw', 0.025), ('periodic', 0.024), ('heteroscedastic', 0.024), ('rodr', 0.024), ('snelson', 0.024), ('tim', 0.024), ('vanhatalo', 0.024), ('slice', 0.024), ('df', 0.023), ('zoubin', 0.022), ('latent', 0.022), ('sample', 0.022), ('sklar', 0.022), ('elliptical', 0.022), ('bf', 0.022), ('fb', 0.022), ('supposing', 0.022), ('beta', 0.022), ('predict', 0.022), ('aj', 0.021), ('economic', 0.021), ('gt', 0.021), ('ti', 0.021), ('deviations', 0.021), ('correlation', 0.021), ('uniform', 0.021), ('earth', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 54 nips-2010-Copula Processes
Author: Andrew Wilson, Zoubin Ghahramani
Abstract: We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures. Imagine measuring the distance of a rocket as it leaves Earth, and wanting to know how these measurements correlate with one another. How much does the value of the measurement at fifteen minutes depend on the measurement at five minutes? Once we’ve learned this correlation structure, suppose we want to compare it to the dependence between measurements of the rocket’s velocity. To do this, it is convenient to separate dependence from the marginal distributions of our measurements. At any given time, a rocket’s distance from Earth could have a Gamma distribution, while its velocity could have a Gaussian distribution. And separating dependence from marginal distributions is precisely what a copula function does. While copulas have recently become popular, especially in financial applications [1, 2], as Nelsen [3] writes, “the study of copulas and the role they play in probability, statistics, and stochastic processes is a subject still in its infancy. There are many open problems. . . ” Typically only bivariate (and recently trivariate) copulas are being used and studied. In our introductory example, we are interested in learning the correlations in different stochastic processes, and comparing them. It would therefore be useful to have a copula process, which can describe the dependencies between arbitrarily many random variables independently of their marginal distributions. We define such a process. And as an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV). In doing so, we provide a Bayesian framework for the learning the marginal distributions and dependency structure of what we call a Gaussian copula process. The volatility of a random variable is its standard deviation. Stochastic volatility models are used to predict the volatilities of a heteroscedastic sequence – a sequence of random variables with different variances, like distance measurements of a rocket as it leaves the Earth. As the rocket gets further away, the variance on the measurements increases. Heteroscedasticity is especially important in econometrics; the returns on equity indices, like the S&P; 500, or on currency exchanges, are heteroscedastic. Indeed, in 2003, Robert Engle won the Nobel Prize in economics “for methods of analyzing economic time series with time-varying volatility”. GARCH [4], a generalized version of Engle’s ARCH, is arguably unsurpassed for predicting the volatility of returns on equity indices and currency exchanges [5, 6, 7]. GCPV can outperform GARCH, and is competitive on financial data that especially suits GARCH [8, 9, 10]. Before discussing GCPV, we first introduce copulas and the copula process. For a review of Gaussian processes, see Rasmussen and Williams [11]. ∗ † http://mlg.eng.cam.ac.uk/andrew Also at the machine learning department at Carnegie Mellon University. 1 1 Copulas Copulas are important because they separate the dependency structure between random variables from their marginal distributions. Intuitively, we can describe the dependency structure of any multivariate joint distribution H(x1 , . . . , xn ) = P (X1 ≤ x1 , . . . Xn ≤ xn ) through a two step process. First we take each univariate random variable Xi and transform it through its cumulative distribution function (cdf) Fi to get Ui = Fi (Xi ), a uniform random variable. We then express the dependencies between these transformed variables through the n-copula C(u1 , . . . , un ). Formally, an n-copula C : [0, 1]n → [0, 1] is a multivariate cdf with uniform univariate marginals: C(u1 , u2 , . . . , un ) = P (U1 ≤ u1 , U2 ≤ u2 , . . . , Un ≤ un ), where U1 , U2 , . . . , Un are standard uniform random variables. Sklar [12] precisely expressed our intuition in the theorem below. Theorem 1.1. Sklar’s theorem Let H be an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . Then there exists an n-copula C such that for all (x1 , x2 , . . . , xn ) ∈ [−∞, ∞]n , H(x1 , x2 , . . . , xn ) = C(F1 (x1 ), F2 (x2 ), . . . , Fn (xn )) = C(u1 , u2 , . . . , un ). (1) If F1 , F2 , . . . , Fn are all continuous then C is unique; otherwise C is uniquely determined on Range F1 × Range F2 × · · · × Range Fn . Conversely, if C is an n-copula and F1 , F2 , . . . , Fn are distribution functions, then the function H is an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . (−1) As a corollary, if Fi (u) = inf{x : F (x) ≥ u}, the quasi-inverse of Fi , then for all u1 , u2 , . . . , un ∈ [0, 1]n , (−1) C(u1 , u2 , . . . , un ) = H(F1 (−1) (u1 ), F2 (−1) (u2 ), . . . , Fn (un )). (2) In other words, (2) can be used to construct a copula. For example, the bivariate Gaussian copula is defined as C(u, v) = Φρ (Φ−1 (u), Φ−1 (v)), (3) where Φρ is a bivariate Gaussian cdf with correlation coefficient ρ, and Φ is the standard univariate Gaussian cdf. Li [2] popularised the bivariate Gaussian copula, by showing how it could be used to study financial risk and default correlation, using credit derivatives as an example. By substituting F (x) for u and G(y) for v in equation (3), we have a bivariate distribution H(x, y), with a Gaussian dependency structure, and marginals F and G. Regardless of F and G, the resulting H(x, y) can still be uniquely expressed as a Gaussian copula, so long as F and G are continuous. It is then a copula itself that captures the underlying dependencies between random variables, regardless of their marginal distributions. For this reason, copulas have been called dependence functions [13, 14]. Nelsen [3] contains an extensive discussion of copulas. 2 Copula Processes Imagine choosing a covariance function, and then drawing a sample function at some finite number of points from a Gaussian process. The result is a sample from a collection of Gaussian random variables, with a dependency structure encoded by the specified covariance function. Now, suppose we transform each of these values through a univariate Gaussian cdf, such that we have a sample from a collection of uniform random variables. These uniform random variables also have this underlying Gaussian process dependency structure. One might call the resulting values a draw from a Gaussian-Uniform Process. We could subsequently put these values through an inverse beta cdf, to obtain a draw from what could be called a Gaussian-Beta Process: the values would be a sample from beta random variables, again with an underlying Gaussian process dependency structure. We could also transform the uniform values with different inverse cdfs, which would give a sample from different random variables, with dependencies encoded by the Gaussian process. The above procedure is a means to generate samples from arbitrarily many random variables, with arbitrary marginal distributions, and desired dependencies. It is an example of how to use what we call a copula process – in this case, a Gaussian copula process, since a Gaussian copula describes the dependency structure of a finite number of samples. We now formally define a copula process. 2 Definition 2.1. Copula Process Let {Wt } be a collection of random variables indexed by t ∈ T , with marginal distribution functions Ft , and let Qt = Ft (Wt ). Further, let µ be a stochastic process measure with marginal distribution functions Gt , and joint distribution function H. Then Wt is copula process distributed with base measure µ, or Wt ∼ CP(µ), if and only if for all n ∈ N, ai ∈ R, n P( (−1) {Gti (Qti ) ≤ ai }) = Ht1 ,t2 ,...,tn (a1 , a2 , . . . , an ). (4) i=1 (−1) Each Qti ∼ Uniform(0, 1), and Gti is the quasi-inverse of Gti , as previously defined. Definition 2.2. Gaussian Copula Process Wt is Gaussian copula process distributed if it is copula process distributed and the base measure µ is a Gaussian process. If there is a mapping Ψ such that Ψ(Wt ) ∼ GP(m(t), k(t, t )), then we write Wt ∼ GCP(Ψ, m(t), k(t, t )). For example, if we have Wt ∼ GCP with m(t) = 0 and k(t, t) = 1, then in the definition of a copula process, Gt = Φ, the standard univariate Gaussian cdf, and H is the usual GP joint distribution function. Supposing this GCP is a Gaussian-Beta process, then Ψ = Φ−1 ◦ FB , where FB is a univariate Beta cdf. One could similarly define other copula processes. We described generally how a copula process can be used to generate samples of arbitrarily many random variables with desired marginals and dependencies. We now develop a specific and practical application of this framework. We introduce a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), as an example of how to learn the joint distribution of arbitrarily many random variables, the marginals of these random variables, and to make predictions. To do this, we fit a Gaussian copula process by using a type of Warped Gaussian Process [15]. However, our methodology varies substantially from Snelson et al. [15], since we are doing inference on latent variables as opposed to observations, which is a much greater undertaking that involves approximations, and we are doing so in a different context. 3 Gaussian Copula Process Volatility Assume we have a sequence of observations y = (y1 , . . . , yn ) at times t = (t1 , . . . , tn ) . The observations are random variables with different latent standard deviations. We therefore have n unobserved standard deviations, σ1 , . . . , σn , and want to learn the correlation structure between these standard deviations, and also to predict the distribution of σ∗ at some unrealised time t∗ . We model the standard deviation function as a Gaussian copula process: σt ∼ GCP(g −1 , 0, k(t, t )). (5) f (t) ∼ GP(m(t) = 0, k(t, t )) σ(t) = g(f (t), ω) (6) (7) y(t) ∼ N (0, σ 2 (t)), (8) Specifically, where g is a monotonic warping function, parametrized by ω. For each of the observations y = (y1 , . . . , yn ) we have corresponding GP latent function values f = (f1 , . . . , fn ) , where σ(ti ) = g(fi , ω), using the shorthand fi to mean f (ti ). σt ∼ GCP, because any finite sequence (σ1 , . . . , σp ) is distributed as a Gaussian copula: P (σ1 ≤ a1 , . . . , σp ≤ ap ) = P (g −1 (σ1 ) ≤ g −1 (a1 ), . . . , g −1 (σp ) ≤ g −1 (ap )) = ΦΓ (g −1 (a1 ), . . . , g −1 −1 (ap )) = ΦΓ (Φ = ΦΓ (Φ −1 (F (a1 )), . . . , Φ (u1 ), . . . , Φ −1 −1 (9) (F (ap ))) (up )) = C(u1 , . . . , up ), where Φ is the standard univariate Gaussian cdf (supposing k(t, t) = 1), ΦΓ is a multivariate Gaussian cdf with covariance matrix Γij = cov(g −1 (σi ), g −1 (σj )), and F is the marginal distribution of 3 each σi . In (5), we have Ψ = g −1 , because it is g −1 which maps σt to a GP. The specification in (5) is equivalently expressed by (6) and (7). With GCPV, the form of g is learned so that g −1 (σt ) is best modelled by a GP. By learning g, we learn the marginal of each σ: F (a) = Φ(g −1 (a)) for a ∈ R. Recently, a different sort of ‘kernel copula process’ has been used, where the marginals of the variables being modelled are not learned [16].1 Further, we also consider a more subtle and flexible form of our model, where the function g itself is indexed by time: g = gt (f (t), ω). We only assume that the marginal distributions of σt are stationary over ‘small’ time periods, and for each of these time periods (5)-(7) hold true. We return to this in the final discussion section. Here we have assumed that each observation, conditioned on knowing its variance, is normally distributed with zero mean. This is a common assumption in heteroscedastic models. The zero mean and normality assumptions can be relaxed and are not central to this paper. 4 Predictions with GCPV Ultimately, we wish to infer p(σ(t∗ )|y, z), where z = {θ, ω}, and θ are the hyperparameters of the GP covariance function. To do this, we sample from p(f∗ |y, z) = p(f∗ |f , θ)p(f |y, z)df (10) and then transform these samples by g. Letting (Cf )ij = δij g(fi , ω)2 , where δij is the Kronecker delta, Kij = k(ti , tj ), (k∗ )i = k(t∗ , ti ), we have p(f |y, z) = N (f ; 0, K)N (y; 0, Cf )/p(y|z), p(f∗ |f , θ) = N (k∗ K −1 f , k(t∗ , t∗ ) − k∗ K −1 (11) k∗ ). (12) We also wish to learn z, which we can do by finding the z that maximizes the marginal likelihood, ˆ p(y|z) = p(y|f , ω)p(f |θ)df . (13) Unfortunately, for many functions g, (10) and (13) are intractable. Our methods of dealing with this can be used in very general circumstances, where one has a Gaussian process prior, but an (optionally parametrized) non-Gaussian likelihood. We use the Laplace approximation to estimate p(f |y, z) as a Gaussian. Then we can integrate (10) for a Gaussian approximation to p(f∗ |y, z), which we sample from to make predictions of σ∗ . Using Laplace, we can also find an expression for an approximate marginal likelihood, which we maximize to determine z. Once we have found z with Laplace, we use Markov chain Monte Carlo to sample from p(f∗ |y, z), and compare that to using Laplace to sample from p(f∗ |y, z). In the supplement we relate this discussion to (9). 4.1 Laplace Approximation The goal is to approximate (11) with a Gaussian, so that we can evaluate (10) and (13) and make predictions. In doing so, we follow Rasmussen and Williams [11] in their treatment of Gaussian process classification, except we use a parametrized likelihood, and modify Newton’s method. First, consider as an objective function the logarithm of an unnormalized (11): s(f |y, z) = log p(y|f , ω) + log p(f |θ). (14) ˆ The Laplace approximation uses a second order Taylor expansion about the f which maximizes ˆ, for which we use (14), to find an approximate objective s(f |y, z). So the first step is to find f ˜ Newton’s method. The Newton update is f new = f − ( s(f ))−1 s(f ). Differentiating (14), s(f |y, z) = s(f |y, z) = where W is the diagonal matrix − 1 log p(y|f , ω) − K −1 f log p(y|f , ω) − K −1 = −W − K −1 , log p(y|f , ω). Note added in proof : Also, for a very recent related model, see Rodr´guez et al. [17]. ı 4 (15) (16) If the likelihood function p(y|f , ω) is not log concave, then W may have negative entries. Vanhatalo et al. [18] found this to be problematic when doing Gaussian process regression with a Student-t ˆ likelihood. They instead use an expectation-maximization (EM) algorithm for finding f , and iterate ordered rank one Cholesky updates to evaluate the Laplace approximate marginal likelihood. But EM can converge slowly, especially near a local optimum, and each of the rank one updates is vulnerable to numerical instability. With a small modification of Newton’s method, we often get close to ˆ quadratic convergence for finding f , and can evaluate the Laplace approximate marginal likelihood in a numerically stable fashion, with no approximate Cholesky factors, and optimal computational requirements. Some comments are in the supplementary material but, in short, we use an approximate negative Hessian, − s ≈ M + K −1 , which is guaranteed to be positive definite, since M is formed on each iteration by zeroing the negative entries of W . For stability, we reformulate our 1 1 1 1 optimization in terms of B = I + M 2 KM 2 , and let Q = M 2 B −1 M 2 , b = M f + log p(y|f ), a = b − QKb. Since (K −1 + M )−1 = K − KQK, the Newton update becomes f new = Ka. ˆ With these updates we find f and get an expression for s which we use to approximate (13) and ˜ (11). The approximate marginal likelihood q(y|z) is given by exp(˜)df . Taking its logarithm, s 1ˆ 1 ˆ log q(y|z) = − f af + log p(y|f ) − log |Bf |, (17) ˆ ˆ 2 2 ˆ ˆ where Bf is B evaluated at f , and af is a numerically stable evaluation of K −1 f . ˆ ˆ To learn the parameters z, we use conjugate gradient descent to maximize (17) with respect to z. ˆ ˆ Since f is a function of z, we initialize z, and update f every time we vary z. Once we have found an optimum z , we can make predictions. By exponentiating s, we find a Gaussian approximation to ˆ ˜ ˆ the posterior (11), q(f |y, z) = N (f , K − KQK). The product of this approximate posterior with p(f∗ |f ) is Gaussian. Integrating this product, we approximate p(f∗ |y, z) as ˆ q(f∗ |y, z) = N (k∗ log p(y|f ), k(t∗ , t∗ ) − k∗ Qk∗ ). (18) Given n training observations, the cost of each Newton iteration is dominated by computing the cholesky decomposition of B, which takes O(n3 ) operations. The objective function typically changes by less than 10−6 after 3 iterations. Once Newton’s method has converged, it takes only O(1) operations to draw from q(f∗ |y, z) and make predictions. 4.2 Markov chain Monte Carlo We use Markov chain Monte Carlo (MCMC) to sample from (11), so that we can later sample from p(σ∗ |y, z) to make predictions. Sampling from (11) is difficult, because the variables f are strongly coupled by a Gaussian process prior. We use a new technique, Elliptical Slice Sampling [19], and find it extremely effective for this purpose. It was specifically designed to sample from posteriors with correlated Gaussian priors. It has no free parameters, and jointly updates every element of f . For our setting, it is over 100 times as fast as axis aligned slice sampling with univariate updates. To make predictions, we take J samples of p(f |y, z), {f 1 , . . . , f J }, and then approximate (10) as a mixture of J Gaussians: J 1 p(f∗ |f i , θ). (19) p(f∗ |y, z) ≈ J i=1 Each of the Gaussians in this mixture have equal weight. So for each sample of f∗ |y, we uniformly choose a random p(f∗ |f i , θ) and draw a sample. In the limit J → ∞, we are sampling from the exact p(f∗ |y, z). Mapping these samples through g gives samples from p(σ∗ |y, z). After one O(n3 ) and one O(J) operation, a draw from (19) takes O(1) operations. 4.3 Warping Function The warping function, g, maps fi , a GP function value, to σi , a standard deviation. Since fi can take any value in R, and σi can take any non-negative real value, g : R → R+ . For each fi to correspond to a unique deviation, g must also be one-to-one. We use K g(x, ω) = aj log[exp[bj (x + cj )] + 1], j=1 5 aj , bj > 0. (20) This is monotonic, positive, infinitely differentiable, asymptotic towards zero as x → −∞, and K tends to ( j=1 aj bj )x as x → ∞. In practice, it is useful to add a small constant to (20), to avoid rare situations where the parameters ω are trained to make g extremely small for certain inputs, at the expense of a good overall fit; this can happen when the parameters ω are learned by optimizing a likelihood. A suitable constant could be one tenth the absolute value of the smallest nonzero observation. By inferring the parameters of the warping function, or distributions of these parameters, we are learning a transformation which will best model σt with a Gaussian process. The more flexible the warping function, the more potential there is to improve the GCPV fit – in other words, the better we can estimate the ‘perfect’ transformation. To test the importance of this flexibility, we also try a simple unparametrized warping function, g(x) = ex . In related work, Goldberg et al. [20] place a GP prior on the log noise level in a standard GP regression model on observations, except for inference they use Gibbs sampling, and a high level of ‘jitter’ for conditioning. Once g is trained, we can infer the marginal distribution of each σ: F (a) = Φ(g −1 (a)), for a ∈ R. This suggests an alternate way to initialize g: we can initialize F as a mixture of Gaussians, and then map through Φ−1 to find g −1 . Since mixtures of Gaussians are dense in the set of probability distributions, we could in principle find the ‘perfect’ g using an infinite mixture of Gaussians [21]. 5 Experiments In our experiments, we predict the latent standard deviations σ of observations y at times t, and also σ∗ at unobserved times t∗ . To do this, we use two versions of GCPV. The first variant, which we simply refer to as GCPV, uses the warping function (20) with K = 1, and squared exponential covariance function, k(t, t ) = A exp(−(t−t )2 /l2 ), with A = 1. The second variant, which we call GP-EXP, uses the unparametrized warping function ex , and the same covariance function, except the amplitude A is a trained hyperparameter. The other hyperparameter l is called the lengthscale of the covariance function. The greater l, the greater the covariance between σt and σt+a for a ∈ R. We train hyperparameters by maximizing the Laplace approximate log marginal likelihood (17). We then sample from p(f∗ |y) using the Laplace approximation (18). We also do this using MCMC (19) with J = 10000, after discarding a previous 10000 samples of p(f |y) as burn-in. We pass 2 these samples of f∗ |y through g and g 2 to draw from p(σ∗ |y) and p(σ∗ |y), and compute the sample mean and variance of σ∗ |y. We use the sample mean as a point predictor, and the sample variance for error bounds on these predictions, and we use 10000 samples to compute these quantities. For GCPV we use Laplace and MCMC for inference, but for GP-EXP we only use Laplace. We compare predictions to GARCH(1,1), which has been shown in extensive and recent reviews to be competitive with other GARCH variants, and more sophisticated models [5, 6, 7]. GARCH(p,q) specifies y(t) ∼ p 2 2 N (0, σ 2 (t)), and lets the variance be a deterministic function of the past: σt = a0 + i=1 ai yt−i + q 2 j=1 bj σt−j . We use the Matlab Econometrics Toolbox implementation of GARCH, where the parameters a0 , ai and bj are estimated using a constrained maximum likelihood. We make forecasts of volatility, and we predict historical volatility. By ‘historical volatility’ we mean the volatility at observed time points, or between these points. Uncovering historical volatility is important. It could, for instance, be used to study what causes fluctuations in the stock market, or to understand physical systems. To evaluate our model, we use the Mean Squared Error (MSE) between the true variance, or proxy for the truth, and the predicted variance. Although likelihood has advantages, we are limited in space, and we wish to harmonize with the econometrics literature, and other assessments of volatility models, where MSE is the standard. In a similar assessment of volatility models, Brownlees et al. [7] found that MSE and quasi-likelihood rankings were comparable. When the true variance is unknown we follow Brownlees et al. [7] and use squared observations as a proxy for the truth, to compare our model to GARCH.2 The more observations, the more reliable these performance estimates will be. However, not many observations (e.g. 100) are needed for a stable ranking of competing models; in Brownlees et al. [7], the rankings derived from high frequency squared observations are similar to those derived using daily squared observations. 2 Since each observation y is assumed to have zero mean and variance σ 2 , E[y 2 ] = σ 2 . 6 5.1 Simulations We simulate observations from N (0, σ 2 (t)), using σ(t) = sin(t) cos(t2 ) + 1, at t = (0, 0.02, 0.04, . . . , 4) . We call this data set TRIG. We also simulate using a standard deviation that jumps from 0.1 to 7 and back, at times t = (0, 0.1, 0.2, . . . , 6) . We call this data set JUMP. To forecast, we use all observations up until the current time point, and make 1, 7, and 30 step ahead predictions. So, for example, in TRIG we start by observing t = 0, and make forecasts at t = 0.02, 0.14, 0.60. Then we observe t = 0, 0.02 and make forecasts at t = 0.04, 0.16, 0.62, and so on, until all data points have been observed. For historical volatility, we predict the latent σt at the observation times, which is safe since we are comparing to the true volatility, which is not used in training; the results are similar if we interpolate. Figure 1 panels a) and b) show the true volatility for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP Laplace, and GARCH(1,1) predictions of historical volatility. Table 1 shows the results for forecasting and historical volatility. In panel a) we see that GCPV more accurately captures the dependencies between σ at different times points than GARCH: if we manually decrease the lengthscale in the GCPV covariance function, we can replicate the erratic GARCH behaviour, which inaccurately suggests that the covariance between σt and σt+a decreases quickly with increases in a. We also see that GCPV with an unparametrized exponential warping function tends to overestimates peaks and underestimate troughs. In panel b), the volatility is extremely difficult to reconstruct or forecast – with no warning it will immediately and dramatically increase or decrease. This behaviour is not suited to a smooth squared exponential covariance function. Nevertheless, GCPV outperforms GARCH, especially in regions of low volatility. We also see this in panel a) for t ∈ (1.5, 2). GARCH is known to respond slowly to large returns, and to overpredict volatility [22]. In JUMP, the greater the peaks, and the smaller the troughs, the more GARCH suffers, while GCPV is mostly robust to these changes. 5.2 Financial Data The returns on the daily exchange rate between the Deutschmark (DM) and the Great Britain Pound (GBP) from 1984 to 1992 have become a benchmark for assessing the performance of GARCH models [8, 9, 10]. This exchange data, which we refer to as DMGBP, can be obtained from www.datastream.com, and the returns are calculated as rt = log(Pt+1 /Pt ), where Pt is the number of DM to GBP on day t. The returns are assumed to have a zero mean function. We use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility forecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992 (659 trading days). Every 7 days, we retrain the parameters of GCPV and GARCH. Every time we retrain parameters, we predict historical volatility over the past 120 days. The average MSE for these historical predictions is given in Table 1, although they should be observed with caution; unlike with the simulations, the DMGBP historical predictions are trained using the same data they are assessed on. In Figure 1c), we see that the GARCH one day ahead forecasts are lifted above the GCPV forecasts, but unlike in the simulations, they are now operating on a similar lengthscale. This suggests that GARCH could still be overpredicting volatility, but that GCPV has adapted its estimation of how σt and σt+a correlate with one another. Since GARCH is suited to this financial data set, it is reassuring that GCPV predictions have a similar time varying structure. Overall, GCPV and GARCH are competitive with one another for forecasting currency exchange returns, as seen in Table 1. Moreover, a learned warping function g outperforms an unparametrized one, and a full Laplace solution is comparable to using MCMC for inference, in accuracy and speed. This is also true for the simulations. Therefore we recommend whichever is more convenient to implement. 6 Discussion We defined a copula process, and as an example, developed a stochastic volatility model, GCPV, which can outperform GARCH. With GCPV, the volatility σt is distributed as a Gaussian Copula Process, which separates the modelling of the dependencies between volatilities at different times from their marginal distributions – arguably the most useful property of a copula. Further, GCPV fits the marginals in the Gaussian copula process by learning a warping function. If we had simply chosen an unparametrized exponential warping function, we would incorrectly be assuming that the log 7 Table 1: MSE for predicting volatility. Data set Model Historical 1 step 7 step 30 step TRIG GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.0953 0.0760 0.193 0.938 0.588 0.622 0.646 1.04 0.951 0.979 1.36 1.79 1.71 1.76 1.15 5.12 JUMP GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.588 1.21 1.43 1.88 0.891 0.951 1.76 1.58 1.38 1.37 6.95 3.43 1.35 1.35 14.7 5.65 GCPV (LA) GCPV (MCMC) GP-EXP GARCH 2.43 2.39 2.52 2.83 3.00 3.00 3.20 3.03 3.08 3.08 3.46 3.12 3.17 3.17 5.14 3.32 ×103 DMGBP ×10−9 TRIG JUMP DMGBP 20 DMGBP 0.015 600 Probability Density 3 1 Volatility Volatility Volatility 15 2 10 0.01 0.005 5 0 0 1 2 Time (a) 3 4 0 0 2 4 0 6 Time (b) 0 200 400 Days (c) 600 400 200 0 0 0.005 σ (d) 0.01 Figure 1: Predicting volatility and learning its marginal pdf. For a) and b), the true volatility, and GCPV (MCMC), GCPV (LA), GP-EXP, and GARCH predictions, are shown respectively by a thick green line, a dashed thick blue line, a dashed black line, a cyan line, and a red line. a) shows predictions of historical volatility for TRIG, where the shade is a 95% confidence interval about GCPV (MCMC) predictions. b) shows predictions of historical volatility for JUMP. In c), a black line and a dashed red line respectively show GCPV (LA) and GARCH one day ahead volatility forecasts for DMGBP. In d), a black line and a dashed blue line respectively show the GCPV learned marginal pdf of σt in DMGBP and a Gamma(4.15,0.00045) pdf. volatilities are marginally Gaussian distributed. Indeed, for the DMGBP data, we trained the warping function g over a 120 day period, and mapped its inverse through the univariate standard Gaussian cdf Φ, and differenced, to estimate the marginal probability density function (pdf) of σt over this period. The learned marginal pdf, shown in Figure 1d), is similar to a Gamma(4.15,0.00045) distribution. However, in using a rolling window to retrain the parameters of g, we do not assume that the marginals of σt are stationary; we have a time changing warping function. While GARCH is successful, and its simplicity is attractive, our model is also simple and has a number of advantages. We can effortlessly handle missing data, we can easily incorporate covariates other than time (like interest rates) in our covariance function, and we can choose from a rich class of covariance functions – squared exponential, Brownian motion, Mat´ rn, periodic, etc. In fact, the e volatility of high frequency intradaily returns on equity indices and currency exchanges is cyclical [23], and GCPV with a periodic covariance function is uniquely well suited to this data. And the parameters of GCPV, like the covariance function lengthscale, or the learned warping function, provide insight into the underlying source of volatility, unlike the parameters of GARCH. Finally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are being used. With our copula process one can learn the dependencies between arbitrarily many random variables independently of their marginal distributions. We hope the Gaussian Copula Process Volatility model will encourage other applications of copula processes. More generally, we hope our work will help bring together the machine learning and econometrics communities. Acknowledgments: Thanks to Carl Edward Rasmussen and Ferenc Husz´ r for helpful conversaa tions. AGW is supported by an NSERC grant. 8 References [1] Paul Embrechts, Alexander McNeil, and Daniel Straumann. Correlation and dependence in risk management: Properties and pitfalls. In Risk Management: Value at risk and beyond, pages 176–223. Cambridge University Press, 1999. [2] David X. Li. On default correlation: A copula function approach. Journal of Fixed Income, 9(4):43–54, 2000. [3] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics, second edition, 2006. [4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31 (3):307–327, 1986. [5] Ser-Huang Poon and Clive W.J. Granger. Practical issues in forecasting volatility. Financial Analysts Journal, 61(1):45–56, 2005. [6] Peter Reinhard Hansen and Asger Lunde. A forecast comparison of volatility models: Does anything beat a GARCH(1,1). Journal of Applied Econometrics, 20(7):873–889, 2005. [7] Christian T. Brownlees, Robert F. Engle, and Bryan T. Kelly. A practical guide to volatility forecasting through calm and storm, 2009. Available at SSRN: http://ssrn.com/abstract=1502915. [8] T. Bollerslev and E. Ghysels. Periodic autoregressive conditional heteroscedasticity. Journal of Business and Economic Statistics, 14:139–151, 1996. [9] B.D. McCullough and C.G. Renfro. Benchmarks and software standards: A case study of GARCH procedures. Journal of Economic and Social Measurement, 25:59–71, 1998. [10] C. Brooks, S.P. Burke, and G. Persand. Benchmarks and the accuracy of GARCH model estimation. International Journal of Forecasting, 17:45–56, 2001. [11] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006. ` [12] Abe Sklar. Fonctions de r´ partition a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8: e 229–231, 1959. [13] P Deheuvels. Caract´ isation compl` te des lois extrˆ mes multivari´ s et de la convergence des types e e e e extrˆ mes. Publications de l’Institut de Statistique de l’Universit´ de Paris, 23:1–36, 1978. e e [14] G Kimeldorf and A Sampson. Uniform representations of bivariate distributions. Communications in Statistics, 4:617–627, 1982. [15] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes. In NIPS, 2003. [16] Sebastian Jaimungal and Eddie K.H. Ng. Kernel-based Copula processes. In ECML PKDD, 2009. [17] A. Rodr´guez, D.B. Dunson, and A.E. Gelfand. Latent stick-breaking processes. Journal of the American ı Statistical Association, 105(490):647–659, 2010. [18] Jarno Vanhatalo, Pasi Jylanki, and Aki Vehtari. Gaussian process regression with Student-t likelihood. In NIPS, 2009. [19] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical Slice Sampling. In AISTATS, 2010. [20] Paul W. Goldberg, Christopher K.I. Williams, and Christopher M. Bishop. Regression with inputdependent noise: A Gaussian process treatment. In NIPS, 1998. [21] Carl Edward Rasmussen. The Infinite Gaussian Mixture Model. In NIPS, 2000. [22] Ruey S. Tsay. Analysis of Financial Time Series. John Wiley & Sons, 2002. [23] Torben G. Andersen and Tim Bollerslev. Intraday periodicity and volatility persistence in financial markets. Journal of Empirical Finance, 4(2-3):115–158, 1997. 9
2 0.31212968 53 nips-2010-Copula Bayesian Networks
Author: Gal Elidan
Abstract: We present the Copula Bayesian Network model for representing multivariate continuous distributions, while taking advantage of the relative ease of estimating univariate distributions. Using a novel copula-based reparameterization of a conditional density, joined with a graph that encodes independencies, our model offers great flexibility in modeling high-dimensional densities, while maintaining control over the form of the univariate marginals. We demonstrate the advantage of our framework for generalization over standard Bayesian networks as well as tree structured copula models for varied real-life domains that are of substantially higher dimension than those typically considered in the copula literature. 1
3 0.12763101 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
Author: Fabian L. Wauthier, Michael I. Jordan
Abstract: Heavy-tailed distributions are often used to enhance the robustness of regression and classification methods to outliers in output space. Often, however, we are confronted with “outliers” in input space, which are isolated observations in sparsely populated regions. We show that heavy-tailed stochastic processes (which we construct from Gaussian processes via a copula), can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions. We carry out a theoretical analysis to show that selective shrinkage occurs when the marginals of the heavy-tailed process have sufficiently heavy tails. The analysis is complemented by experiments on biological data which indicate significant improvements of estimates in sparse regions while producing competitive results in dense regions. 1
4 0.097128019 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
Author: Barnabás Póczos, Csaba Szepesvári, David Tax
Abstract: We present simple and computationally efficient nonparametric estimators of R´ nyi entropy and mutual information based on an i.i.d. sample drawn from an e unknown, absolutely continuous distribution over Rd . The estimators are calculated as the sum of p-th powers of the Euclidean lengths of the edges of the ‘generalized nearest-neighbor’ graph of the sample and the empirical copula of the sample respectively. For the first time, we prove the almost sure consistency of these estimators and upper bounds on their rates of convergence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis. 1
5 0.086870939 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
6 0.066435337 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
7 0.058731094 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
8 0.053414099 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
9 0.053370234 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
10 0.05274836 118 nips-2010-Implicit Differentiation by Perturbation
11 0.048921537 101 nips-2010-Gaussian sampling by local perturbations
12 0.048081439 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
13 0.046535466 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
14 0.04403203 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
15 0.041579712 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
16 0.040979065 100 nips-2010-Gaussian Process Preference Elicitation
17 0.040421162 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect
18 0.038991373 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
19 0.036641423 233 nips-2010-Scrambled Objects for Least-Squares Regression
20 0.035469353 212 nips-2010-Predictive State Temporal Difference Learning
topicId topicWeight
[(0, 0.115), (1, 0.017), (2, 0.012), (3, 0.056), (4, -0.065), (5, 0.034), (6, 0.04), (7, 0.026), (8, -0.028), (9, -0.011), (10, -0.149), (11, -0.071), (12, -0.004), (13, -0.083), (14, -0.034), (15, 0.042), (16, 0.053), (17, 0.065), (18, 0.118), (19, 0.101), (20, -0.187), (21, -0.05), (22, -0.051), (23, -0.093), (24, -0.127), (25, -0.156), (26, 0.25), (27, -0.054), (28, 0.123), (29, 0.032), (30, -0.038), (31, 0.071), (32, 0.109), (33, 0.219), (34, -0.087), (35, -0.062), (36, 0.108), (37, 0.119), (38, 0.036), (39, 0.073), (40, 0.044), (41, 0.009), (42, -0.061), (43, 0.02), (44, -0.106), (45, -0.091), (46, 0.12), (47, -0.014), (48, -0.014), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.92521399 54 nips-2010-Copula Processes
Author: Andrew Wilson, Zoubin Ghahramani
Abstract: We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures. Imagine measuring the distance of a rocket as it leaves Earth, and wanting to know how these measurements correlate with one another. How much does the value of the measurement at fifteen minutes depend on the measurement at five minutes? Once we’ve learned this correlation structure, suppose we want to compare it to the dependence between measurements of the rocket’s velocity. To do this, it is convenient to separate dependence from the marginal distributions of our measurements. At any given time, a rocket’s distance from Earth could have a Gamma distribution, while its velocity could have a Gaussian distribution. And separating dependence from marginal distributions is precisely what a copula function does. While copulas have recently become popular, especially in financial applications [1, 2], as Nelsen [3] writes, “the study of copulas and the role they play in probability, statistics, and stochastic processes is a subject still in its infancy. There are many open problems. . . ” Typically only bivariate (and recently trivariate) copulas are being used and studied. In our introductory example, we are interested in learning the correlations in different stochastic processes, and comparing them. It would therefore be useful to have a copula process, which can describe the dependencies between arbitrarily many random variables independently of their marginal distributions. We define such a process. And as an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV). In doing so, we provide a Bayesian framework for the learning the marginal distributions and dependency structure of what we call a Gaussian copula process. The volatility of a random variable is its standard deviation. Stochastic volatility models are used to predict the volatilities of a heteroscedastic sequence – a sequence of random variables with different variances, like distance measurements of a rocket as it leaves the Earth. As the rocket gets further away, the variance on the measurements increases. Heteroscedasticity is especially important in econometrics; the returns on equity indices, like the S&P; 500, or on currency exchanges, are heteroscedastic. Indeed, in 2003, Robert Engle won the Nobel Prize in economics “for methods of analyzing economic time series with time-varying volatility”. GARCH [4], a generalized version of Engle’s ARCH, is arguably unsurpassed for predicting the volatility of returns on equity indices and currency exchanges [5, 6, 7]. GCPV can outperform GARCH, and is competitive on financial data that especially suits GARCH [8, 9, 10]. Before discussing GCPV, we first introduce copulas and the copula process. For a review of Gaussian processes, see Rasmussen and Williams [11]. ∗ † http://mlg.eng.cam.ac.uk/andrew Also at the machine learning department at Carnegie Mellon University. 1 1 Copulas Copulas are important because they separate the dependency structure between random variables from their marginal distributions. Intuitively, we can describe the dependency structure of any multivariate joint distribution H(x1 , . . . , xn ) = P (X1 ≤ x1 , . . . Xn ≤ xn ) through a two step process. First we take each univariate random variable Xi and transform it through its cumulative distribution function (cdf) Fi to get Ui = Fi (Xi ), a uniform random variable. We then express the dependencies between these transformed variables through the n-copula C(u1 , . . . , un ). Formally, an n-copula C : [0, 1]n → [0, 1] is a multivariate cdf with uniform univariate marginals: C(u1 , u2 , . . . , un ) = P (U1 ≤ u1 , U2 ≤ u2 , . . . , Un ≤ un ), where U1 , U2 , . . . , Un are standard uniform random variables. Sklar [12] precisely expressed our intuition in the theorem below. Theorem 1.1. Sklar’s theorem Let H be an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . Then there exists an n-copula C such that for all (x1 , x2 , . . . , xn ) ∈ [−∞, ∞]n , H(x1 , x2 , . . . , xn ) = C(F1 (x1 ), F2 (x2 ), . . . , Fn (xn )) = C(u1 , u2 , . . . , un ). (1) If F1 , F2 , . . . , Fn are all continuous then C is unique; otherwise C is uniquely determined on Range F1 × Range F2 × · · · × Range Fn . Conversely, if C is an n-copula and F1 , F2 , . . . , Fn are distribution functions, then the function H is an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . (−1) As a corollary, if Fi (u) = inf{x : F (x) ≥ u}, the quasi-inverse of Fi , then for all u1 , u2 , . . . , un ∈ [0, 1]n , (−1) C(u1 , u2 , . . . , un ) = H(F1 (−1) (u1 ), F2 (−1) (u2 ), . . . , Fn (un )). (2) In other words, (2) can be used to construct a copula. For example, the bivariate Gaussian copula is defined as C(u, v) = Φρ (Φ−1 (u), Φ−1 (v)), (3) where Φρ is a bivariate Gaussian cdf with correlation coefficient ρ, and Φ is the standard univariate Gaussian cdf. Li [2] popularised the bivariate Gaussian copula, by showing how it could be used to study financial risk and default correlation, using credit derivatives as an example. By substituting F (x) for u and G(y) for v in equation (3), we have a bivariate distribution H(x, y), with a Gaussian dependency structure, and marginals F and G. Regardless of F and G, the resulting H(x, y) can still be uniquely expressed as a Gaussian copula, so long as F and G are continuous. It is then a copula itself that captures the underlying dependencies between random variables, regardless of their marginal distributions. For this reason, copulas have been called dependence functions [13, 14]. Nelsen [3] contains an extensive discussion of copulas. 2 Copula Processes Imagine choosing a covariance function, and then drawing a sample function at some finite number of points from a Gaussian process. The result is a sample from a collection of Gaussian random variables, with a dependency structure encoded by the specified covariance function. Now, suppose we transform each of these values through a univariate Gaussian cdf, such that we have a sample from a collection of uniform random variables. These uniform random variables also have this underlying Gaussian process dependency structure. One might call the resulting values a draw from a Gaussian-Uniform Process. We could subsequently put these values through an inverse beta cdf, to obtain a draw from what could be called a Gaussian-Beta Process: the values would be a sample from beta random variables, again with an underlying Gaussian process dependency structure. We could also transform the uniform values with different inverse cdfs, which would give a sample from different random variables, with dependencies encoded by the Gaussian process. The above procedure is a means to generate samples from arbitrarily many random variables, with arbitrary marginal distributions, and desired dependencies. It is an example of how to use what we call a copula process – in this case, a Gaussian copula process, since a Gaussian copula describes the dependency structure of a finite number of samples. We now formally define a copula process. 2 Definition 2.1. Copula Process Let {Wt } be a collection of random variables indexed by t ∈ T , with marginal distribution functions Ft , and let Qt = Ft (Wt ). Further, let µ be a stochastic process measure with marginal distribution functions Gt , and joint distribution function H. Then Wt is copula process distributed with base measure µ, or Wt ∼ CP(µ), if and only if for all n ∈ N, ai ∈ R, n P( (−1) {Gti (Qti ) ≤ ai }) = Ht1 ,t2 ,...,tn (a1 , a2 , . . . , an ). (4) i=1 (−1) Each Qti ∼ Uniform(0, 1), and Gti is the quasi-inverse of Gti , as previously defined. Definition 2.2. Gaussian Copula Process Wt is Gaussian copula process distributed if it is copula process distributed and the base measure µ is a Gaussian process. If there is a mapping Ψ such that Ψ(Wt ) ∼ GP(m(t), k(t, t )), then we write Wt ∼ GCP(Ψ, m(t), k(t, t )). For example, if we have Wt ∼ GCP with m(t) = 0 and k(t, t) = 1, then in the definition of a copula process, Gt = Φ, the standard univariate Gaussian cdf, and H is the usual GP joint distribution function. Supposing this GCP is a Gaussian-Beta process, then Ψ = Φ−1 ◦ FB , where FB is a univariate Beta cdf. One could similarly define other copula processes. We described generally how a copula process can be used to generate samples of arbitrarily many random variables with desired marginals and dependencies. We now develop a specific and practical application of this framework. We introduce a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), as an example of how to learn the joint distribution of arbitrarily many random variables, the marginals of these random variables, and to make predictions. To do this, we fit a Gaussian copula process by using a type of Warped Gaussian Process [15]. However, our methodology varies substantially from Snelson et al. [15], since we are doing inference on latent variables as opposed to observations, which is a much greater undertaking that involves approximations, and we are doing so in a different context. 3 Gaussian Copula Process Volatility Assume we have a sequence of observations y = (y1 , . . . , yn ) at times t = (t1 , . . . , tn ) . The observations are random variables with different latent standard deviations. We therefore have n unobserved standard deviations, σ1 , . . . , σn , and want to learn the correlation structure between these standard deviations, and also to predict the distribution of σ∗ at some unrealised time t∗ . We model the standard deviation function as a Gaussian copula process: σt ∼ GCP(g −1 , 0, k(t, t )). (5) f (t) ∼ GP(m(t) = 0, k(t, t )) σ(t) = g(f (t), ω) (6) (7) y(t) ∼ N (0, σ 2 (t)), (8) Specifically, where g is a monotonic warping function, parametrized by ω. For each of the observations y = (y1 , . . . , yn ) we have corresponding GP latent function values f = (f1 , . . . , fn ) , where σ(ti ) = g(fi , ω), using the shorthand fi to mean f (ti ). σt ∼ GCP, because any finite sequence (σ1 , . . . , σp ) is distributed as a Gaussian copula: P (σ1 ≤ a1 , . . . , σp ≤ ap ) = P (g −1 (σ1 ) ≤ g −1 (a1 ), . . . , g −1 (σp ) ≤ g −1 (ap )) = ΦΓ (g −1 (a1 ), . . . , g −1 −1 (ap )) = ΦΓ (Φ = ΦΓ (Φ −1 (F (a1 )), . . . , Φ (u1 ), . . . , Φ −1 −1 (9) (F (ap ))) (up )) = C(u1 , . . . , up ), where Φ is the standard univariate Gaussian cdf (supposing k(t, t) = 1), ΦΓ is a multivariate Gaussian cdf with covariance matrix Γij = cov(g −1 (σi ), g −1 (σj )), and F is the marginal distribution of 3 each σi . In (5), we have Ψ = g −1 , because it is g −1 which maps σt to a GP. The specification in (5) is equivalently expressed by (6) and (7). With GCPV, the form of g is learned so that g −1 (σt ) is best modelled by a GP. By learning g, we learn the marginal of each σ: F (a) = Φ(g −1 (a)) for a ∈ R. Recently, a different sort of ‘kernel copula process’ has been used, where the marginals of the variables being modelled are not learned [16].1 Further, we also consider a more subtle and flexible form of our model, where the function g itself is indexed by time: g = gt (f (t), ω). We only assume that the marginal distributions of σt are stationary over ‘small’ time periods, and for each of these time periods (5)-(7) hold true. We return to this in the final discussion section. Here we have assumed that each observation, conditioned on knowing its variance, is normally distributed with zero mean. This is a common assumption in heteroscedastic models. The zero mean and normality assumptions can be relaxed and are not central to this paper. 4 Predictions with GCPV Ultimately, we wish to infer p(σ(t∗ )|y, z), where z = {θ, ω}, and θ are the hyperparameters of the GP covariance function. To do this, we sample from p(f∗ |y, z) = p(f∗ |f , θ)p(f |y, z)df (10) and then transform these samples by g. Letting (Cf )ij = δij g(fi , ω)2 , where δij is the Kronecker delta, Kij = k(ti , tj ), (k∗ )i = k(t∗ , ti ), we have p(f |y, z) = N (f ; 0, K)N (y; 0, Cf )/p(y|z), p(f∗ |f , θ) = N (k∗ K −1 f , k(t∗ , t∗ ) − k∗ K −1 (11) k∗ ). (12) We also wish to learn z, which we can do by finding the z that maximizes the marginal likelihood, ˆ p(y|z) = p(y|f , ω)p(f |θ)df . (13) Unfortunately, for many functions g, (10) and (13) are intractable. Our methods of dealing with this can be used in very general circumstances, where one has a Gaussian process prior, but an (optionally parametrized) non-Gaussian likelihood. We use the Laplace approximation to estimate p(f |y, z) as a Gaussian. Then we can integrate (10) for a Gaussian approximation to p(f∗ |y, z), which we sample from to make predictions of σ∗ . Using Laplace, we can also find an expression for an approximate marginal likelihood, which we maximize to determine z. Once we have found z with Laplace, we use Markov chain Monte Carlo to sample from p(f∗ |y, z), and compare that to using Laplace to sample from p(f∗ |y, z). In the supplement we relate this discussion to (9). 4.1 Laplace Approximation The goal is to approximate (11) with a Gaussian, so that we can evaluate (10) and (13) and make predictions. In doing so, we follow Rasmussen and Williams [11] in their treatment of Gaussian process classification, except we use a parametrized likelihood, and modify Newton’s method. First, consider as an objective function the logarithm of an unnormalized (11): s(f |y, z) = log p(y|f , ω) + log p(f |θ). (14) ˆ The Laplace approximation uses a second order Taylor expansion about the f which maximizes ˆ, for which we use (14), to find an approximate objective s(f |y, z). So the first step is to find f ˜ Newton’s method. The Newton update is f new = f − ( s(f ))−1 s(f ). Differentiating (14), s(f |y, z) = s(f |y, z) = where W is the diagonal matrix − 1 log p(y|f , ω) − K −1 f log p(y|f , ω) − K −1 = −W − K −1 , log p(y|f , ω). Note added in proof : Also, for a very recent related model, see Rodr´guez et al. [17]. ı 4 (15) (16) If the likelihood function p(y|f , ω) is not log concave, then W may have negative entries. Vanhatalo et al. [18] found this to be problematic when doing Gaussian process regression with a Student-t ˆ likelihood. They instead use an expectation-maximization (EM) algorithm for finding f , and iterate ordered rank one Cholesky updates to evaluate the Laplace approximate marginal likelihood. But EM can converge slowly, especially near a local optimum, and each of the rank one updates is vulnerable to numerical instability. With a small modification of Newton’s method, we often get close to ˆ quadratic convergence for finding f , and can evaluate the Laplace approximate marginal likelihood in a numerically stable fashion, with no approximate Cholesky factors, and optimal computational requirements. Some comments are in the supplementary material but, in short, we use an approximate negative Hessian, − s ≈ M + K −1 , which is guaranteed to be positive definite, since M is formed on each iteration by zeroing the negative entries of W . For stability, we reformulate our 1 1 1 1 optimization in terms of B = I + M 2 KM 2 , and let Q = M 2 B −1 M 2 , b = M f + log p(y|f ), a = b − QKb. Since (K −1 + M )−1 = K − KQK, the Newton update becomes f new = Ka. ˆ With these updates we find f and get an expression for s which we use to approximate (13) and ˜ (11). The approximate marginal likelihood q(y|z) is given by exp(˜)df . Taking its logarithm, s 1ˆ 1 ˆ log q(y|z) = − f af + log p(y|f ) − log |Bf |, (17) ˆ ˆ 2 2 ˆ ˆ where Bf is B evaluated at f , and af is a numerically stable evaluation of K −1 f . ˆ ˆ To learn the parameters z, we use conjugate gradient descent to maximize (17) with respect to z. ˆ ˆ Since f is a function of z, we initialize z, and update f every time we vary z. Once we have found an optimum z , we can make predictions. By exponentiating s, we find a Gaussian approximation to ˆ ˜ ˆ the posterior (11), q(f |y, z) = N (f , K − KQK). The product of this approximate posterior with p(f∗ |f ) is Gaussian. Integrating this product, we approximate p(f∗ |y, z) as ˆ q(f∗ |y, z) = N (k∗ log p(y|f ), k(t∗ , t∗ ) − k∗ Qk∗ ). (18) Given n training observations, the cost of each Newton iteration is dominated by computing the cholesky decomposition of B, which takes O(n3 ) operations. The objective function typically changes by less than 10−6 after 3 iterations. Once Newton’s method has converged, it takes only O(1) operations to draw from q(f∗ |y, z) and make predictions. 4.2 Markov chain Monte Carlo We use Markov chain Monte Carlo (MCMC) to sample from (11), so that we can later sample from p(σ∗ |y, z) to make predictions. Sampling from (11) is difficult, because the variables f are strongly coupled by a Gaussian process prior. We use a new technique, Elliptical Slice Sampling [19], and find it extremely effective for this purpose. It was specifically designed to sample from posteriors with correlated Gaussian priors. It has no free parameters, and jointly updates every element of f . For our setting, it is over 100 times as fast as axis aligned slice sampling with univariate updates. To make predictions, we take J samples of p(f |y, z), {f 1 , . . . , f J }, and then approximate (10) as a mixture of J Gaussians: J 1 p(f∗ |f i , θ). (19) p(f∗ |y, z) ≈ J i=1 Each of the Gaussians in this mixture have equal weight. So for each sample of f∗ |y, we uniformly choose a random p(f∗ |f i , θ) and draw a sample. In the limit J → ∞, we are sampling from the exact p(f∗ |y, z). Mapping these samples through g gives samples from p(σ∗ |y, z). After one O(n3 ) and one O(J) operation, a draw from (19) takes O(1) operations. 4.3 Warping Function The warping function, g, maps fi , a GP function value, to σi , a standard deviation. Since fi can take any value in R, and σi can take any non-negative real value, g : R → R+ . For each fi to correspond to a unique deviation, g must also be one-to-one. We use K g(x, ω) = aj log[exp[bj (x + cj )] + 1], j=1 5 aj , bj > 0. (20) This is monotonic, positive, infinitely differentiable, asymptotic towards zero as x → −∞, and K tends to ( j=1 aj bj )x as x → ∞. In practice, it is useful to add a small constant to (20), to avoid rare situations where the parameters ω are trained to make g extremely small for certain inputs, at the expense of a good overall fit; this can happen when the parameters ω are learned by optimizing a likelihood. A suitable constant could be one tenth the absolute value of the smallest nonzero observation. By inferring the parameters of the warping function, or distributions of these parameters, we are learning a transformation which will best model σt with a Gaussian process. The more flexible the warping function, the more potential there is to improve the GCPV fit – in other words, the better we can estimate the ‘perfect’ transformation. To test the importance of this flexibility, we also try a simple unparametrized warping function, g(x) = ex . In related work, Goldberg et al. [20] place a GP prior on the log noise level in a standard GP regression model on observations, except for inference they use Gibbs sampling, and a high level of ‘jitter’ for conditioning. Once g is trained, we can infer the marginal distribution of each σ: F (a) = Φ(g −1 (a)), for a ∈ R. This suggests an alternate way to initialize g: we can initialize F as a mixture of Gaussians, and then map through Φ−1 to find g −1 . Since mixtures of Gaussians are dense in the set of probability distributions, we could in principle find the ‘perfect’ g using an infinite mixture of Gaussians [21]. 5 Experiments In our experiments, we predict the latent standard deviations σ of observations y at times t, and also σ∗ at unobserved times t∗ . To do this, we use two versions of GCPV. The first variant, which we simply refer to as GCPV, uses the warping function (20) with K = 1, and squared exponential covariance function, k(t, t ) = A exp(−(t−t )2 /l2 ), with A = 1. The second variant, which we call GP-EXP, uses the unparametrized warping function ex , and the same covariance function, except the amplitude A is a trained hyperparameter. The other hyperparameter l is called the lengthscale of the covariance function. The greater l, the greater the covariance between σt and σt+a for a ∈ R. We train hyperparameters by maximizing the Laplace approximate log marginal likelihood (17). We then sample from p(f∗ |y) using the Laplace approximation (18). We also do this using MCMC (19) with J = 10000, after discarding a previous 10000 samples of p(f |y) as burn-in. We pass 2 these samples of f∗ |y through g and g 2 to draw from p(σ∗ |y) and p(σ∗ |y), and compute the sample mean and variance of σ∗ |y. We use the sample mean as a point predictor, and the sample variance for error bounds on these predictions, and we use 10000 samples to compute these quantities. For GCPV we use Laplace and MCMC for inference, but for GP-EXP we only use Laplace. We compare predictions to GARCH(1,1), which has been shown in extensive and recent reviews to be competitive with other GARCH variants, and more sophisticated models [5, 6, 7]. GARCH(p,q) specifies y(t) ∼ p 2 2 N (0, σ 2 (t)), and lets the variance be a deterministic function of the past: σt = a0 + i=1 ai yt−i + q 2 j=1 bj σt−j . We use the Matlab Econometrics Toolbox implementation of GARCH, where the parameters a0 , ai and bj are estimated using a constrained maximum likelihood. We make forecasts of volatility, and we predict historical volatility. By ‘historical volatility’ we mean the volatility at observed time points, or between these points. Uncovering historical volatility is important. It could, for instance, be used to study what causes fluctuations in the stock market, or to understand physical systems. To evaluate our model, we use the Mean Squared Error (MSE) between the true variance, or proxy for the truth, and the predicted variance. Although likelihood has advantages, we are limited in space, and we wish to harmonize with the econometrics literature, and other assessments of volatility models, where MSE is the standard. In a similar assessment of volatility models, Brownlees et al. [7] found that MSE and quasi-likelihood rankings were comparable. When the true variance is unknown we follow Brownlees et al. [7] and use squared observations as a proxy for the truth, to compare our model to GARCH.2 The more observations, the more reliable these performance estimates will be. However, not many observations (e.g. 100) are needed for a stable ranking of competing models; in Brownlees et al. [7], the rankings derived from high frequency squared observations are similar to those derived using daily squared observations. 2 Since each observation y is assumed to have zero mean and variance σ 2 , E[y 2 ] = σ 2 . 6 5.1 Simulations We simulate observations from N (0, σ 2 (t)), using σ(t) = sin(t) cos(t2 ) + 1, at t = (0, 0.02, 0.04, . . . , 4) . We call this data set TRIG. We also simulate using a standard deviation that jumps from 0.1 to 7 and back, at times t = (0, 0.1, 0.2, . . . , 6) . We call this data set JUMP. To forecast, we use all observations up until the current time point, and make 1, 7, and 30 step ahead predictions. So, for example, in TRIG we start by observing t = 0, and make forecasts at t = 0.02, 0.14, 0.60. Then we observe t = 0, 0.02 and make forecasts at t = 0.04, 0.16, 0.62, and so on, until all data points have been observed. For historical volatility, we predict the latent σt at the observation times, which is safe since we are comparing to the true volatility, which is not used in training; the results are similar if we interpolate. Figure 1 panels a) and b) show the true volatility for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP Laplace, and GARCH(1,1) predictions of historical volatility. Table 1 shows the results for forecasting and historical volatility. In panel a) we see that GCPV more accurately captures the dependencies between σ at different times points than GARCH: if we manually decrease the lengthscale in the GCPV covariance function, we can replicate the erratic GARCH behaviour, which inaccurately suggests that the covariance between σt and σt+a decreases quickly with increases in a. We also see that GCPV with an unparametrized exponential warping function tends to overestimates peaks and underestimate troughs. In panel b), the volatility is extremely difficult to reconstruct or forecast – with no warning it will immediately and dramatically increase or decrease. This behaviour is not suited to a smooth squared exponential covariance function. Nevertheless, GCPV outperforms GARCH, especially in regions of low volatility. We also see this in panel a) for t ∈ (1.5, 2). GARCH is known to respond slowly to large returns, and to overpredict volatility [22]. In JUMP, the greater the peaks, and the smaller the troughs, the more GARCH suffers, while GCPV is mostly robust to these changes. 5.2 Financial Data The returns on the daily exchange rate between the Deutschmark (DM) and the Great Britain Pound (GBP) from 1984 to 1992 have become a benchmark for assessing the performance of GARCH models [8, 9, 10]. This exchange data, which we refer to as DMGBP, can be obtained from www.datastream.com, and the returns are calculated as rt = log(Pt+1 /Pt ), where Pt is the number of DM to GBP on day t. The returns are assumed to have a zero mean function. We use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility forecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992 (659 trading days). Every 7 days, we retrain the parameters of GCPV and GARCH. Every time we retrain parameters, we predict historical volatility over the past 120 days. The average MSE for these historical predictions is given in Table 1, although they should be observed with caution; unlike with the simulations, the DMGBP historical predictions are trained using the same data they are assessed on. In Figure 1c), we see that the GARCH one day ahead forecasts are lifted above the GCPV forecasts, but unlike in the simulations, they are now operating on a similar lengthscale. This suggests that GARCH could still be overpredicting volatility, but that GCPV has adapted its estimation of how σt and σt+a correlate with one another. Since GARCH is suited to this financial data set, it is reassuring that GCPV predictions have a similar time varying structure. Overall, GCPV and GARCH are competitive with one another for forecasting currency exchange returns, as seen in Table 1. Moreover, a learned warping function g outperforms an unparametrized one, and a full Laplace solution is comparable to using MCMC for inference, in accuracy and speed. This is also true for the simulations. Therefore we recommend whichever is more convenient to implement. 6 Discussion We defined a copula process, and as an example, developed a stochastic volatility model, GCPV, which can outperform GARCH. With GCPV, the volatility σt is distributed as a Gaussian Copula Process, which separates the modelling of the dependencies between volatilities at different times from their marginal distributions – arguably the most useful property of a copula. Further, GCPV fits the marginals in the Gaussian copula process by learning a warping function. If we had simply chosen an unparametrized exponential warping function, we would incorrectly be assuming that the log 7 Table 1: MSE for predicting volatility. Data set Model Historical 1 step 7 step 30 step TRIG GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.0953 0.0760 0.193 0.938 0.588 0.622 0.646 1.04 0.951 0.979 1.36 1.79 1.71 1.76 1.15 5.12 JUMP GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.588 1.21 1.43 1.88 0.891 0.951 1.76 1.58 1.38 1.37 6.95 3.43 1.35 1.35 14.7 5.65 GCPV (LA) GCPV (MCMC) GP-EXP GARCH 2.43 2.39 2.52 2.83 3.00 3.00 3.20 3.03 3.08 3.08 3.46 3.12 3.17 3.17 5.14 3.32 ×103 DMGBP ×10−9 TRIG JUMP DMGBP 20 DMGBP 0.015 600 Probability Density 3 1 Volatility Volatility Volatility 15 2 10 0.01 0.005 5 0 0 1 2 Time (a) 3 4 0 0 2 4 0 6 Time (b) 0 200 400 Days (c) 600 400 200 0 0 0.005 σ (d) 0.01 Figure 1: Predicting volatility and learning its marginal pdf. For a) and b), the true volatility, and GCPV (MCMC), GCPV (LA), GP-EXP, and GARCH predictions, are shown respectively by a thick green line, a dashed thick blue line, a dashed black line, a cyan line, and a red line. a) shows predictions of historical volatility for TRIG, where the shade is a 95% confidence interval about GCPV (MCMC) predictions. b) shows predictions of historical volatility for JUMP. In c), a black line and a dashed red line respectively show GCPV (LA) and GARCH one day ahead volatility forecasts for DMGBP. In d), a black line and a dashed blue line respectively show the GCPV learned marginal pdf of σt in DMGBP and a Gamma(4.15,0.00045) pdf. volatilities are marginally Gaussian distributed. Indeed, for the DMGBP data, we trained the warping function g over a 120 day period, and mapped its inverse through the univariate standard Gaussian cdf Φ, and differenced, to estimate the marginal probability density function (pdf) of σt over this period. The learned marginal pdf, shown in Figure 1d), is similar to a Gamma(4.15,0.00045) distribution. However, in using a rolling window to retrain the parameters of g, we do not assume that the marginals of σt are stationary; we have a time changing warping function. While GARCH is successful, and its simplicity is attractive, our model is also simple and has a number of advantages. We can effortlessly handle missing data, we can easily incorporate covariates other than time (like interest rates) in our covariance function, and we can choose from a rich class of covariance functions – squared exponential, Brownian motion, Mat´ rn, periodic, etc. In fact, the e volatility of high frequency intradaily returns on equity indices and currency exchanges is cyclical [23], and GCPV with a periodic covariance function is uniquely well suited to this data. And the parameters of GCPV, like the covariance function lengthscale, or the learned warping function, provide insight into the underlying source of volatility, unlike the parameters of GARCH. Finally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are being used. With our copula process one can learn the dependencies between arbitrarily many random variables independently of their marginal distributions. We hope the Gaussian Copula Process Volatility model will encourage other applications of copula processes. More generally, we hope our work will help bring together the machine learning and econometrics communities. Acknowledgments: Thanks to Carl Edward Rasmussen and Ferenc Husz´ r for helpful conversaa tions. AGW is supported by an NSERC grant. 8 References [1] Paul Embrechts, Alexander McNeil, and Daniel Straumann. Correlation and dependence in risk management: Properties and pitfalls. In Risk Management: Value at risk and beyond, pages 176–223. Cambridge University Press, 1999. [2] David X. Li. On default correlation: A copula function approach. Journal of Fixed Income, 9(4):43–54, 2000. [3] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics, second edition, 2006. [4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31 (3):307–327, 1986. [5] Ser-Huang Poon and Clive W.J. Granger. Practical issues in forecasting volatility. Financial Analysts Journal, 61(1):45–56, 2005. [6] Peter Reinhard Hansen and Asger Lunde. A forecast comparison of volatility models: Does anything beat a GARCH(1,1). Journal of Applied Econometrics, 20(7):873–889, 2005. [7] Christian T. Brownlees, Robert F. Engle, and Bryan T. Kelly. A practical guide to volatility forecasting through calm and storm, 2009. Available at SSRN: http://ssrn.com/abstract=1502915. [8] T. Bollerslev and E. Ghysels. Periodic autoregressive conditional heteroscedasticity. Journal of Business and Economic Statistics, 14:139–151, 1996. [9] B.D. McCullough and C.G. Renfro. Benchmarks and software standards: A case study of GARCH procedures. Journal of Economic and Social Measurement, 25:59–71, 1998. [10] C. Brooks, S.P. Burke, and G. Persand. Benchmarks and the accuracy of GARCH model estimation. International Journal of Forecasting, 17:45–56, 2001. [11] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006. ` [12] Abe Sklar. Fonctions de r´ partition a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8: e 229–231, 1959. [13] P Deheuvels. Caract´ isation compl` te des lois extrˆ mes multivari´ s et de la convergence des types e e e e extrˆ mes. Publications de l’Institut de Statistique de l’Universit´ de Paris, 23:1–36, 1978. e e [14] G Kimeldorf and A Sampson. Uniform representations of bivariate distributions. Communications in Statistics, 4:617–627, 1982. [15] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes. In NIPS, 2003. [16] Sebastian Jaimungal and Eddie K.H. Ng. Kernel-based Copula processes. In ECML PKDD, 2009. [17] A. Rodr´guez, D.B. Dunson, and A.E. Gelfand. Latent stick-breaking processes. Journal of the American ı Statistical Association, 105(490):647–659, 2010. [18] Jarno Vanhatalo, Pasi Jylanki, and Aki Vehtari. Gaussian process regression with Student-t likelihood. In NIPS, 2009. [19] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical Slice Sampling. In AISTATS, 2010. [20] Paul W. Goldberg, Christopher K.I. Williams, and Christopher M. Bishop. Regression with inputdependent noise: A Gaussian process treatment. In NIPS, 1998. [21] Carl Edward Rasmussen. The Infinite Gaussian Mixture Model. In NIPS, 2000. [22] Ruey S. Tsay. Analysis of Financial Time Series. John Wiley & Sons, 2002. [23] Torben G. Andersen and Tim Bollerslev. Intraday periodicity and volatility persistence in financial markets. Journal of Empirical Finance, 4(2-3):115–158, 1997. 9
2 0.90316761 53 nips-2010-Copula Bayesian Networks
Author: Gal Elidan
Abstract: We present the Copula Bayesian Network model for representing multivariate continuous distributions, while taking advantage of the relative ease of estimating univariate distributions. Using a novel copula-based reparameterization of a conditional density, joined with a graph that encodes independencies, our model offers great flexibility in modeling high-dimensional densities, while maintaining control over the form of the univariate marginals. We demonstrate the advantage of our framework for generalization over standard Bayesian networks as well as tree structured copula models for varied real-life domains that are of substantially higher dimension than those typically considered in the copula literature. 1
3 0.70913416 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
Author: Fabian L. Wauthier, Michael I. Jordan
Abstract: Heavy-tailed distributions are often used to enhance the robustness of regression and classification methods to outliers in output space. Often, however, we are confronted with “outliers” in input space, which are isolated observations in sparsely populated regions. We show that heavy-tailed stochastic processes (which we construct from Gaussian processes via a copula), can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions. We carry out a theoretical analysis to show that selective shrinkage occurs when the marginals of the heavy-tailed process have sufficiently heavy tails. The analysis is complemented by experiments on biological data which indicate significant improvements of estimates in sparse regions while producing competitive results in dense regions. 1
4 0.55809009 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
Author: Barnabás Póczos, Csaba Szepesvári, David Tax
Abstract: We present simple and computationally efficient nonparametric estimators of R´ nyi entropy and mutual information based on an i.i.d. sample drawn from an e unknown, absolutely continuous distribution over Rd . The estimators are calculated as the sum of p-th powers of the Euclidean lengths of the edges of the ‘generalized nearest-neighbor’ graph of the sample and the empirical copula of the sample respectively. For the first time, we prove the almost sure consistency of these estimators and upper bounds on their rates of convergence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis. 1
5 0.41767472 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
Author: Danny Bickson, Carlos Guestrin
Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1
6 0.3861509 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
7 0.31836644 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
8 0.31612679 118 nips-2010-Implicit Differentiation by Perturbation
9 0.31460217 262 nips-2010-Switched Latent Force Models for Movement Segmentation
10 0.30752662 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
11 0.28522509 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
12 0.27540064 101 nips-2010-Gaussian sampling by local perturbations
13 0.26878658 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
14 0.26814264 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
15 0.24806185 233 nips-2010-Scrambled Objects for Least-Squares Regression
16 0.24262814 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform
17 0.24079971 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
18 0.23331505 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
19 0.23325677 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
20 0.22199701 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
topicId topicWeight
[(0, 0.01), (13, 0.028), (17, 0.077), (27, 0.056), (28, 0.295), (30, 0.044), (35, 0.017), (45, 0.159), (50, 0.091), (52, 0.049), (60, 0.019), (77, 0.032), (78, 0.013), (90, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.72364163 54 nips-2010-Copula Processes
Author: Andrew Wilson, Zoubin Ghahramani
Abstract: We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures. Imagine measuring the distance of a rocket as it leaves Earth, and wanting to know how these measurements correlate with one another. How much does the value of the measurement at fifteen minutes depend on the measurement at five minutes? Once we’ve learned this correlation structure, suppose we want to compare it to the dependence between measurements of the rocket’s velocity. To do this, it is convenient to separate dependence from the marginal distributions of our measurements. At any given time, a rocket’s distance from Earth could have a Gamma distribution, while its velocity could have a Gaussian distribution. And separating dependence from marginal distributions is precisely what a copula function does. While copulas have recently become popular, especially in financial applications [1, 2], as Nelsen [3] writes, “the study of copulas and the role they play in probability, statistics, and stochastic processes is a subject still in its infancy. There are many open problems. . . ” Typically only bivariate (and recently trivariate) copulas are being used and studied. In our introductory example, we are interested in learning the correlations in different stochastic processes, and comparing them. It would therefore be useful to have a copula process, which can describe the dependencies between arbitrarily many random variables independently of their marginal distributions. We define such a process. And as an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV). In doing so, we provide a Bayesian framework for the learning the marginal distributions and dependency structure of what we call a Gaussian copula process. The volatility of a random variable is its standard deviation. Stochastic volatility models are used to predict the volatilities of a heteroscedastic sequence – a sequence of random variables with different variances, like distance measurements of a rocket as it leaves the Earth. As the rocket gets further away, the variance on the measurements increases. Heteroscedasticity is especially important in econometrics; the returns on equity indices, like the S&P; 500, or on currency exchanges, are heteroscedastic. Indeed, in 2003, Robert Engle won the Nobel Prize in economics “for methods of analyzing economic time series with time-varying volatility”. GARCH [4], a generalized version of Engle’s ARCH, is arguably unsurpassed for predicting the volatility of returns on equity indices and currency exchanges [5, 6, 7]. GCPV can outperform GARCH, and is competitive on financial data that especially suits GARCH [8, 9, 10]. Before discussing GCPV, we first introduce copulas and the copula process. For a review of Gaussian processes, see Rasmussen and Williams [11]. ∗ † http://mlg.eng.cam.ac.uk/andrew Also at the machine learning department at Carnegie Mellon University. 1 1 Copulas Copulas are important because they separate the dependency structure between random variables from their marginal distributions. Intuitively, we can describe the dependency structure of any multivariate joint distribution H(x1 , . . . , xn ) = P (X1 ≤ x1 , . . . Xn ≤ xn ) through a two step process. First we take each univariate random variable Xi and transform it through its cumulative distribution function (cdf) Fi to get Ui = Fi (Xi ), a uniform random variable. We then express the dependencies between these transformed variables through the n-copula C(u1 , . . . , un ). Formally, an n-copula C : [0, 1]n → [0, 1] is a multivariate cdf with uniform univariate marginals: C(u1 , u2 , . . . , un ) = P (U1 ≤ u1 , U2 ≤ u2 , . . . , Un ≤ un ), where U1 , U2 , . . . , Un are standard uniform random variables. Sklar [12] precisely expressed our intuition in the theorem below. Theorem 1.1. Sklar’s theorem Let H be an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . Then there exists an n-copula C such that for all (x1 , x2 , . . . , xn ) ∈ [−∞, ∞]n , H(x1 , x2 , . . . , xn ) = C(F1 (x1 ), F2 (x2 ), . . . , Fn (xn )) = C(u1 , u2 , . . . , un ). (1) If F1 , F2 , . . . , Fn are all continuous then C is unique; otherwise C is uniquely determined on Range F1 × Range F2 × · · · × Range Fn . Conversely, if C is an n-copula and F1 , F2 , . . . , Fn are distribution functions, then the function H is an n-dimensional distribution function with marginal distribution functions F1 , F2 , . . . , Fn . (−1) As a corollary, if Fi (u) = inf{x : F (x) ≥ u}, the quasi-inverse of Fi , then for all u1 , u2 , . . . , un ∈ [0, 1]n , (−1) C(u1 , u2 , . . . , un ) = H(F1 (−1) (u1 ), F2 (−1) (u2 ), . . . , Fn (un )). (2) In other words, (2) can be used to construct a copula. For example, the bivariate Gaussian copula is defined as C(u, v) = Φρ (Φ−1 (u), Φ−1 (v)), (3) where Φρ is a bivariate Gaussian cdf with correlation coefficient ρ, and Φ is the standard univariate Gaussian cdf. Li [2] popularised the bivariate Gaussian copula, by showing how it could be used to study financial risk and default correlation, using credit derivatives as an example. By substituting F (x) for u and G(y) for v in equation (3), we have a bivariate distribution H(x, y), with a Gaussian dependency structure, and marginals F and G. Regardless of F and G, the resulting H(x, y) can still be uniquely expressed as a Gaussian copula, so long as F and G are continuous. It is then a copula itself that captures the underlying dependencies between random variables, regardless of their marginal distributions. For this reason, copulas have been called dependence functions [13, 14]. Nelsen [3] contains an extensive discussion of copulas. 2 Copula Processes Imagine choosing a covariance function, and then drawing a sample function at some finite number of points from a Gaussian process. The result is a sample from a collection of Gaussian random variables, with a dependency structure encoded by the specified covariance function. Now, suppose we transform each of these values through a univariate Gaussian cdf, such that we have a sample from a collection of uniform random variables. These uniform random variables also have this underlying Gaussian process dependency structure. One might call the resulting values a draw from a Gaussian-Uniform Process. We could subsequently put these values through an inverse beta cdf, to obtain a draw from what could be called a Gaussian-Beta Process: the values would be a sample from beta random variables, again with an underlying Gaussian process dependency structure. We could also transform the uniform values with different inverse cdfs, which would give a sample from different random variables, with dependencies encoded by the Gaussian process. The above procedure is a means to generate samples from arbitrarily many random variables, with arbitrary marginal distributions, and desired dependencies. It is an example of how to use what we call a copula process – in this case, a Gaussian copula process, since a Gaussian copula describes the dependency structure of a finite number of samples. We now formally define a copula process. 2 Definition 2.1. Copula Process Let {Wt } be a collection of random variables indexed by t ∈ T , with marginal distribution functions Ft , and let Qt = Ft (Wt ). Further, let µ be a stochastic process measure with marginal distribution functions Gt , and joint distribution function H. Then Wt is copula process distributed with base measure µ, or Wt ∼ CP(µ), if and only if for all n ∈ N, ai ∈ R, n P( (−1) {Gti (Qti ) ≤ ai }) = Ht1 ,t2 ,...,tn (a1 , a2 , . . . , an ). (4) i=1 (−1) Each Qti ∼ Uniform(0, 1), and Gti is the quasi-inverse of Gti , as previously defined. Definition 2.2. Gaussian Copula Process Wt is Gaussian copula process distributed if it is copula process distributed and the base measure µ is a Gaussian process. If there is a mapping Ψ such that Ψ(Wt ) ∼ GP(m(t), k(t, t )), then we write Wt ∼ GCP(Ψ, m(t), k(t, t )). For example, if we have Wt ∼ GCP with m(t) = 0 and k(t, t) = 1, then in the definition of a copula process, Gt = Φ, the standard univariate Gaussian cdf, and H is the usual GP joint distribution function. Supposing this GCP is a Gaussian-Beta process, then Ψ = Φ−1 ◦ FB , where FB is a univariate Beta cdf. One could similarly define other copula processes. We described generally how a copula process can be used to generate samples of arbitrarily many random variables with desired marginals and dependencies. We now develop a specific and practical application of this framework. We introduce a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), as an example of how to learn the joint distribution of arbitrarily many random variables, the marginals of these random variables, and to make predictions. To do this, we fit a Gaussian copula process by using a type of Warped Gaussian Process [15]. However, our methodology varies substantially from Snelson et al. [15], since we are doing inference on latent variables as opposed to observations, which is a much greater undertaking that involves approximations, and we are doing so in a different context. 3 Gaussian Copula Process Volatility Assume we have a sequence of observations y = (y1 , . . . , yn ) at times t = (t1 , . . . , tn ) . The observations are random variables with different latent standard deviations. We therefore have n unobserved standard deviations, σ1 , . . . , σn , and want to learn the correlation structure between these standard deviations, and also to predict the distribution of σ∗ at some unrealised time t∗ . We model the standard deviation function as a Gaussian copula process: σt ∼ GCP(g −1 , 0, k(t, t )). (5) f (t) ∼ GP(m(t) = 0, k(t, t )) σ(t) = g(f (t), ω) (6) (7) y(t) ∼ N (0, σ 2 (t)), (8) Specifically, where g is a monotonic warping function, parametrized by ω. For each of the observations y = (y1 , . . . , yn ) we have corresponding GP latent function values f = (f1 , . . . , fn ) , where σ(ti ) = g(fi , ω), using the shorthand fi to mean f (ti ). σt ∼ GCP, because any finite sequence (σ1 , . . . , σp ) is distributed as a Gaussian copula: P (σ1 ≤ a1 , . . . , σp ≤ ap ) = P (g −1 (σ1 ) ≤ g −1 (a1 ), . . . , g −1 (σp ) ≤ g −1 (ap )) = ΦΓ (g −1 (a1 ), . . . , g −1 −1 (ap )) = ΦΓ (Φ = ΦΓ (Φ −1 (F (a1 )), . . . , Φ (u1 ), . . . , Φ −1 −1 (9) (F (ap ))) (up )) = C(u1 , . . . , up ), where Φ is the standard univariate Gaussian cdf (supposing k(t, t) = 1), ΦΓ is a multivariate Gaussian cdf with covariance matrix Γij = cov(g −1 (σi ), g −1 (σj )), and F is the marginal distribution of 3 each σi . In (5), we have Ψ = g −1 , because it is g −1 which maps σt to a GP. The specification in (5) is equivalently expressed by (6) and (7). With GCPV, the form of g is learned so that g −1 (σt ) is best modelled by a GP. By learning g, we learn the marginal of each σ: F (a) = Φ(g −1 (a)) for a ∈ R. Recently, a different sort of ‘kernel copula process’ has been used, where the marginals of the variables being modelled are not learned [16].1 Further, we also consider a more subtle and flexible form of our model, where the function g itself is indexed by time: g = gt (f (t), ω). We only assume that the marginal distributions of σt are stationary over ‘small’ time periods, and for each of these time periods (5)-(7) hold true. We return to this in the final discussion section. Here we have assumed that each observation, conditioned on knowing its variance, is normally distributed with zero mean. This is a common assumption in heteroscedastic models. The zero mean and normality assumptions can be relaxed and are not central to this paper. 4 Predictions with GCPV Ultimately, we wish to infer p(σ(t∗ )|y, z), where z = {θ, ω}, and θ are the hyperparameters of the GP covariance function. To do this, we sample from p(f∗ |y, z) = p(f∗ |f , θ)p(f |y, z)df (10) and then transform these samples by g. Letting (Cf )ij = δij g(fi , ω)2 , where δij is the Kronecker delta, Kij = k(ti , tj ), (k∗ )i = k(t∗ , ti ), we have p(f |y, z) = N (f ; 0, K)N (y; 0, Cf )/p(y|z), p(f∗ |f , θ) = N (k∗ K −1 f , k(t∗ , t∗ ) − k∗ K −1 (11) k∗ ). (12) We also wish to learn z, which we can do by finding the z that maximizes the marginal likelihood, ˆ p(y|z) = p(y|f , ω)p(f |θ)df . (13) Unfortunately, for many functions g, (10) and (13) are intractable. Our methods of dealing with this can be used in very general circumstances, where one has a Gaussian process prior, but an (optionally parametrized) non-Gaussian likelihood. We use the Laplace approximation to estimate p(f |y, z) as a Gaussian. Then we can integrate (10) for a Gaussian approximation to p(f∗ |y, z), which we sample from to make predictions of σ∗ . Using Laplace, we can also find an expression for an approximate marginal likelihood, which we maximize to determine z. Once we have found z with Laplace, we use Markov chain Monte Carlo to sample from p(f∗ |y, z), and compare that to using Laplace to sample from p(f∗ |y, z). In the supplement we relate this discussion to (9). 4.1 Laplace Approximation The goal is to approximate (11) with a Gaussian, so that we can evaluate (10) and (13) and make predictions. In doing so, we follow Rasmussen and Williams [11] in their treatment of Gaussian process classification, except we use a parametrized likelihood, and modify Newton’s method. First, consider as an objective function the logarithm of an unnormalized (11): s(f |y, z) = log p(y|f , ω) + log p(f |θ). (14) ˆ The Laplace approximation uses a second order Taylor expansion about the f which maximizes ˆ, for which we use (14), to find an approximate objective s(f |y, z). So the first step is to find f ˜ Newton’s method. The Newton update is f new = f − ( s(f ))−1 s(f ). Differentiating (14), s(f |y, z) = s(f |y, z) = where W is the diagonal matrix − 1 log p(y|f , ω) − K −1 f log p(y|f , ω) − K −1 = −W − K −1 , log p(y|f , ω). Note added in proof : Also, for a very recent related model, see Rodr´guez et al. [17]. ı 4 (15) (16) If the likelihood function p(y|f , ω) is not log concave, then W may have negative entries. Vanhatalo et al. [18] found this to be problematic when doing Gaussian process regression with a Student-t ˆ likelihood. They instead use an expectation-maximization (EM) algorithm for finding f , and iterate ordered rank one Cholesky updates to evaluate the Laplace approximate marginal likelihood. But EM can converge slowly, especially near a local optimum, and each of the rank one updates is vulnerable to numerical instability. With a small modification of Newton’s method, we often get close to ˆ quadratic convergence for finding f , and can evaluate the Laplace approximate marginal likelihood in a numerically stable fashion, with no approximate Cholesky factors, and optimal computational requirements. Some comments are in the supplementary material but, in short, we use an approximate negative Hessian, − s ≈ M + K −1 , which is guaranteed to be positive definite, since M is formed on each iteration by zeroing the negative entries of W . For stability, we reformulate our 1 1 1 1 optimization in terms of B = I + M 2 KM 2 , and let Q = M 2 B −1 M 2 , b = M f + log p(y|f ), a = b − QKb. Since (K −1 + M )−1 = K − KQK, the Newton update becomes f new = Ka. ˆ With these updates we find f and get an expression for s which we use to approximate (13) and ˜ (11). The approximate marginal likelihood q(y|z) is given by exp(˜)df . Taking its logarithm, s 1ˆ 1 ˆ log q(y|z) = − f af + log p(y|f ) − log |Bf |, (17) ˆ ˆ 2 2 ˆ ˆ where Bf is B evaluated at f , and af is a numerically stable evaluation of K −1 f . ˆ ˆ To learn the parameters z, we use conjugate gradient descent to maximize (17) with respect to z. ˆ ˆ Since f is a function of z, we initialize z, and update f every time we vary z. Once we have found an optimum z , we can make predictions. By exponentiating s, we find a Gaussian approximation to ˆ ˜ ˆ the posterior (11), q(f |y, z) = N (f , K − KQK). The product of this approximate posterior with p(f∗ |f ) is Gaussian. Integrating this product, we approximate p(f∗ |y, z) as ˆ q(f∗ |y, z) = N (k∗ log p(y|f ), k(t∗ , t∗ ) − k∗ Qk∗ ). (18) Given n training observations, the cost of each Newton iteration is dominated by computing the cholesky decomposition of B, which takes O(n3 ) operations. The objective function typically changes by less than 10−6 after 3 iterations. Once Newton’s method has converged, it takes only O(1) operations to draw from q(f∗ |y, z) and make predictions. 4.2 Markov chain Monte Carlo We use Markov chain Monte Carlo (MCMC) to sample from (11), so that we can later sample from p(σ∗ |y, z) to make predictions. Sampling from (11) is difficult, because the variables f are strongly coupled by a Gaussian process prior. We use a new technique, Elliptical Slice Sampling [19], and find it extremely effective for this purpose. It was specifically designed to sample from posteriors with correlated Gaussian priors. It has no free parameters, and jointly updates every element of f . For our setting, it is over 100 times as fast as axis aligned slice sampling with univariate updates. To make predictions, we take J samples of p(f |y, z), {f 1 , . . . , f J }, and then approximate (10) as a mixture of J Gaussians: J 1 p(f∗ |f i , θ). (19) p(f∗ |y, z) ≈ J i=1 Each of the Gaussians in this mixture have equal weight. So for each sample of f∗ |y, we uniformly choose a random p(f∗ |f i , θ) and draw a sample. In the limit J → ∞, we are sampling from the exact p(f∗ |y, z). Mapping these samples through g gives samples from p(σ∗ |y, z). After one O(n3 ) and one O(J) operation, a draw from (19) takes O(1) operations. 4.3 Warping Function The warping function, g, maps fi , a GP function value, to σi , a standard deviation. Since fi can take any value in R, and σi can take any non-negative real value, g : R → R+ . For each fi to correspond to a unique deviation, g must also be one-to-one. We use K g(x, ω) = aj log[exp[bj (x + cj )] + 1], j=1 5 aj , bj > 0. (20) This is monotonic, positive, infinitely differentiable, asymptotic towards zero as x → −∞, and K tends to ( j=1 aj bj )x as x → ∞. In practice, it is useful to add a small constant to (20), to avoid rare situations where the parameters ω are trained to make g extremely small for certain inputs, at the expense of a good overall fit; this can happen when the parameters ω are learned by optimizing a likelihood. A suitable constant could be one tenth the absolute value of the smallest nonzero observation. By inferring the parameters of the warping function, or distributions of these parameters, we are learning a transformation which will best model σt with a Gaussian process. The more flexible the warping function, the more potential there is to improve the GCPV fit – in other words, the better we can estimate the ‘perfect’ transformation. To test the importance of this flexibility, we also try a simple unparametrized warping function, g(x) = ex . In related work, Goldberg et al. [20] place a GP prior on the log noise level in a standard GP regression model on observations, except for inference they use Gibbs sampling, and a high level of ‘jitter’ for conditioning. Once g is trained, we can infer the marginal distribution of each σ: F (a) = Φ(g −1 (a)), for a ∈ R. This suggests an alternate way to initialize g: we can initialize F as a mixture of Gaussians, and then map through Φ−1 to find g −1 . Since mixtures of Gaussians are dense in the set of probability distributions, we could in principle find the ‘perfect’ g using an infinite mixture of Gaussians [21]. 5 Experiments In our experiments, we predict the latent standard deviations σ of observations y at times t, and also σ∗ at unobserved times t∗ . To do this, we use two versions of GCPV. The first variant, which we simply refer to as GCPV, uses the warping function (20) with K = 1, and squared exponential covariance function, k(t, t ) = A exp(−(t−t )2 /l2 ), with A = 1. The second variant, which we call GP-EXP, uses the unparametrized warping function ex , and the same covariance function, except the amplitude A is a trained hyperparameter. The other hyperparameter l is called the lengthscale of the covariance function. The greater l, the greater the covariance between σt and σt+a for a ∈ R. We train hyperparameters by maximizing the Laplace approximate log marginal likelihood (17). We then sample from p(f∗ |y) using the Laplace approximation (18). We also do this using MCMC (19) with J = 10000, after discarding a previous 10000 samples of p(f |y) as burn-in. We pass 2 these samples of f∗ |y through g and g 2 to draw from p(σ∗ |y) and p(σ∗ |y), and compute the sample mean and variance of σ∗ |y. We use the sample mean as a point predictor, and the sample variance for error bounds on these predictions, and we use 10000 samples to compute these quantities. For GCPV we use Laplace and MCMC for inference, but for GP-EXP we only use Laplace. We compare predictions to GARCH(1,1), which has been shown in extensive and recent reviews to be competitive with other GARCH variants, and more sophisticated models [5, 6, 7]. GARCH(p,q) specifies y(t) ∼ p 2 2 N (0, σ 2 (t)), and lets the variance be a deterministic function of the past: σt = a0 + i=1 ai yt−i + q 2 j=1 bj σt−j . We use the Matlab Econometrics Toolbox implementation of GARCH, where the parameters a0 , ai and bj are estimated using a constrained maximum likelihood. We make forecasts of volatility, and we predict historical volatility. By ‘historical volatility’ we mean the volatility at observed time points, or between these points. Uncovering historical volatility is important. It could, for instance, be used to study what causes fluctuations in the stock market, or to understand physical systems. To evaluate our model, we use the Mean Squared Error (MSE) between the true variance, or proxy for the truth, and the predicted variance. Although likelihood has advantages, we are limited in space, and we wish to harmonize with the econometrics literature, and other assessments of volatility models, where MSE is the standard. In a similar assessment of volatility models, Brownlees et al. [7] found that MSE and quasi-likelihood rankings were comparable. When the true variance is unknown we follow Brownlees et al. [7] and use squared observations as a proxy for the truth, to compare our model to GARCH.2 The more observations, the more reliable these performance estimates will be. However, not many observations (e.g. 100) are needed for a stable ranking of competing models; in Brownlees et al. [7], the rankings derived from high frequency squared observations are similar to those derived using daily squared observations. 2 Since each observation y is assumed to have zero mean and variance σ 2 , E[y 2 ] = σ 2 . 6 5.1 Simulations We simulate observations from N (0, σ 2 (t)), using σ(t) = sin(t) cos(t2 ) + 1, at t = (0, 0.02, 0.04, . . . , 4) . We call this data set TRIG. We also simulate using a standard deviation that jumps from 0.1 to 7 and back, at times t = (0, 0.1, 0.2, . . . , 6) . We call this data set JUMP. To forecast, we use all observations up until the current time point, and make 1, 7, and 30 step ahead predictions. So, for example, in TRIG we start by observing t = 0, and make forecasts at t = 0.02, 0.14, 0.60. Then we observe t = 0, 0.02 and make forecasts at t = 0.04, 0.16, 0.62, and so on, until all data points have been observed. For historical volatility, we predict the latent σt at the observation times, which is safe since we are comparing to the true volatility, which is not used in training; the results are similar if we interpolate. Figure 1 panels a) and b) show the true volatility for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP Laplace, and GARCH(1,1) predictions of historical volatility. Table 1 shows the results for forecasting and historical volatility. In panel a) we see that GCPV more accurately captures the dependencies between σ at different times points than GARCH: if we manually decrease the lengthscale in the GCPV covariance function, we can replicate the erratic GARCH behaviour, which inaccurately suggests that the covariance between σt and σt+a decreases quickly with increases in a. We also see that GCPV with an unparametrized exponential warping function tends to overestimates peaks and underestimate troughs. In panel b), the volatility is extremely difficult to reconstruct or forecast – with no warning it will immediately and dramatically increase or decrease. This behaviour is not suited to a smooth squared exponential covariance function. Nevertheless, GCPV outperforms GARCH, especially in regions of low volatility. We also see this in panel a) for t ∈ (1.5, 2). GARCH is known to respond slowly to large returns, and to overpredict volatility [22]. In JUMP, the greater the peaks, and the smaller the troughs, the more GARCH suffers, while GCPV is mostly robust to these changes. 5.2 Financial Data The returns on the daily exchange rate between the Deutschmark (DM) and the Great Britain Pound (GBP) from 1984 to 1992 have become a benchmark for assessing the performance of GARCH models [8, 9, 10]. This exchange data, which we refer to as DMGBP, can be obtained from www.datastream.com, and the returns are calculated as rt = log(Pt+1 /Pt ), where Pt is the number of DM to GBP on day t. The returns are assumed to have a zero mean function. We use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility forecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992 (659 trading days). Every 7 days, we retrain the parameters of GCPV and GARCH. Every time we retrain parameters, we predict historical volatility over the past 120 days. The average MSE for these historical predictions is given in Table 1, although they should be observed with caution; unlike with the simulations, the DMGBP historical predictions are trained using the same data they are assessed on. In Figure 1c), we see that the GARCH one day ahead forecasts are lifted above the GCPV forecasts, but unlike in the simulations, they are now operating on a similar lengthscale. This suggests that GARCH could still be overpredicting volatility, but that GCPV has adapted its estimation of how σt and σt+a correlate with one another. Since GARCH is suited to this financial data set, it is reassuring that GCPV predictions have a similar time varying structure. Overall, GCPV and GARCH are competitive with one another for forecasting currency exchange returns, as seen in Table 1. Moreover, a learned warping function g outperforms an unparametrized one, and a full Laplace solution is comparable to using MCMC for inference, in accuracy and speed. This is also true for the simulations. Therefore we recommend whichever is more convenient to implement. 6 Discussion We defined a copula process, and as an example, developed a stochastic volatility model, GCPV, which can outperform GARCH. With GCPV, the volatility σt is distributed as a Gaussian Copula Process, which separates the modelling of the dependencies between volatilities at different times from their marginal distributions – arguably the most useful property of a copula. Further, GCPV fits the marginals in the Gaussian copula process by learning a warping function. If we had simply chosen an unparametrized exponential warping function, we would incorrectly be assuming that the log 7 Table 1: MSE for predicting volatility. Data set Model Historical 1 step 7 step 30 step TRIG GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.0953 0.0760 0.193 0.938 0.588 0.622 0.646 1.04 0.951 0.979 1.36 1.79 1.71 1.76 1.15 5.12 JUMP GCPV (LA) GCPV (MCMC) GP-EXP GARCH 0.588 1.21 1.43 1.88 0.891 0.951 1.76 1.58 1.38 1.37 6.95 3.43 1.35 1.35 14.7 5.65 GCPV (LA) GCPV (MCMC) GP-EXP GARCH 2.43 2.39 2.52 2.83 3.00 3.00 3.20 3.03 3.08 3.08 3.46 3.12 3.17 3.17 5.14 3.32 ×103 DMGBP ×10−9 TRIG JUMP DMGBP 20 DMGBP 0.015 600 Probability Density 3 1 Volatility Volatility Volatility 15 2 10 0.01 0.005 5 0 0 1 2 Time (a) 3 4 0 0 2 4 0 6 Time (b) 0 200 400 Days (c) 600 400 200 0 0 0.005 σ (d) 0.01 Figure 1: Predicting volatility and learning its marginal pdf. For a) and b), the true volatility, and GCPV (MCMC), GCPV (LA), GP-EXP, and GARCH predictions, are shown respectively by a thick green line, a dashed thick blue line, a dashed black line, a cyan line, and a red line. a) shows predictions of historical volatility for TRIG, where the shade is a 95% confidence interval about GCPV (MCMC) predictions. b) shows predictions of historical volatility for JUMP. In c), a black line and a dashed red line respectively show GCPV (LA) and GARCH one day ahead volatility forecasts for DMGBP. In d), a black line and a dashed blue line respectively show the GCPV learned marginal pdf of σt in DMGBP and a Gamma(4.15,0.00045) pdf. volatilities are marginally Gaussian distributed. Indeed, for the DMGBP data, we trained the warping function g over a 120 day period, and mapped its inverse through the univariate standard Gaussian cdf Φ, and differenced, to estimate the marginal probability density function (pdf) of σt over this period. The learned marginal pdf, shown in Figure 1d), is similar to a Gamma(4.15,0.00045) distribution. However, in using a rolling window to retrain the parameters of g, we do not assume that the marginals of σt are stationary; we have a time changing warping function. While GARCH is successful, and its simplicity is attractive, our model is also simple and has a number of advantages. We can effortlessly handle missing data, we can easily incorporate covariates other than time (like interest rates) in our covariance function, and we can choose from a rich class of covariance functions – squared exponential, Brownian motion, Mat´ rn, periodic, etc. In fact, the e volatility of high frequency intradaily returns on equity indices and currency exchanges is cyclical [23], and GCPV with a periodic covariance function is uniquely well suited to this data. And the parameters of GCPV, like the covariance function lengthscale, or the learned warping function, provide insight into the underlying source of volatility, unlike the parameters of GARCH. Finally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are being used. With our copula process one can learn the dependencies between arbitrarily many random variables independently of their marginal distributions. We hope the Gaussian Copula Process Volatility model will encourage other applications of copula processes. More generally, we hope our work will help bring together the machine learning and econometrics communities. Acknowledgments: Thanks to Carl Edward Rasmussen and Ferenc Husz´ r for helpful conversaa tions. AGW is supported by an NSERC grant. 8 References [1] Paul Embrechts, Alexander McNeil, and Daniel Straumann. Correlation and dependence in risk management: Properties and pitfalls. In Risk Management: Value at risk and beyond, pages 176–223. Cambridge University Press, 1999. [2] David X. Li. On default correlation: A copula function approach. Journal of Fixed Income, 9(4):43–54, 2000. [3] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics, second edition, 2006. [4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31 (3):307–327, 1986. [5] Ser-Huang Poon and Clive W.J. Granger. Practical issues in forecasting volatility. Financial Analysts Journal, 61(1):45–56, 2005. [6] Peter Reinhard Hansen and Asger Lunde. A forecast comparison of volatility models: Does anything beat a GARCH(1,1). Journal of Applied Econometrics, 20(7):873–889, 2005. [7] Christian T. Brownlees, Robert F. Engle, and Bryan T. Kelly. A practical guide to volatility forecasting through calm and storm, 2009. Available at SSRN: http://ssrn.com/abstract=1502915. [8] T. Bollerslev and E. Ghysels. Periodic autoregressive conditional heteroscedasticity. Journal of Business and Economic Statistics, 14:139–151, 1996. [9] B.D. McCullough and C.G. Renfro. Benchmarks and software standards: A case study of GARCH procedures. Journal of Economic and Social Measurement, 25:59–71, 1998. [10] C. Brooks, S.P. Burke, and G. Persand. Benchmarks and the accuracy of GARCH model estimation. International Journal of Forecasting, 17:45–56, 2001. [11] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006. ` [12] Abe Sklar. Fonctions de r´ partition a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8: e 229–231, 1959. [13] P Deheuvels. Caract´ isation compl` te des lois extrˆ mes multivari´ s et de la convergence des types e e e e extrˆ mes. Publications de l’Institut de Statistique de l’Universit´ de Paris, 23:1–36, 1978. e e [14] G Kimeldorf and A Sampson. Uniform representations of bivariate distributions. Communications in Statistics, 4:617–627, 1982. [15] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes. In NIPS, 2003. [16] Sebastian Jaimungal and Eddie K.H. Ng. Kernel-based Copula processes. In ECML PKDD, 2009. [17] A. Rodr´guez, D.B. Dunson, and A.E. Gelfand. Latent stick-breaking processes. Journal of the American ı Statistical Association, 105(490):647–659, 2010. [18] Jarno Vanhatalo, Pasi Jylanki, and Aki Vehtari. Gaussian process regression with Student-t likelihood. In NIPS, 2009. [19] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical Slice Sampling. In AISTATS, 2010. [20] Paul W. Goldberg, Christopher K.I. Williams, and Christopher M. Bishop. Regression with inputdependent noise: A Gaussian process treatment. In NIPS, 1998. [21] Carl Edward Rasmussen. The Infinite Gaussian Mixture Model. In NIPS, 2000. [22] Ruey S. Tsay. Analysis of Financial Time Series. John Wiley & Sons, 2002. [23] Torben G. Andersen and Tim Bollerslev. Intraday periodicity and volatility persistence in financial markets. Journal of Empirical Finance, 4(2-3):115–158, 1997. 9
2 0.72042948 281 nips-2010-Using body-anchored priors for identifying actions in single images
Author: Leonid Karlinsky, Michael Dinerstein, Shimon Ullman
Abstract: This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance. 1
3 0.64558381 81 nips-2010-Evaluating neuronal codes for inference using Fisher information
Author: Haefner Ralf, Matthias Bethge
Abstract: Many studies have explored the impact of response variability on the quality of sensory codes. The source of this variability is almost always assumed to be intrinsic to the brain. However, when inferring a particular stimulus property, variability associated with other stimulus attributes also effectively act as noise. Here we study the impact of such stimulus-induced response variability for the case of binocular disparity inference. We characterize the response distribution for the binocular energy model in response to random dot stereograms and find it to be very different from the Poisson-like noise usually assumed. We then compute the Fisher information with respect to binocular disparity, present in the monocular inputs to the standard model of early binocular processing, and thereby obtain an upper bound on how much information a model could theoretically extract from them. Then we analyze the information loss incurred by the different ways of combining those inputs to produce a scalar single-neuron response. We find that in the case of depth inference, monocular stimulus variability places a greater limit on the extractable information than intrinsic neuronal noise for typical spike counts. Furthermore, the largest loss of information is incurred by the standard model for position disparity neurons (tuned-excitatory), that are the most ubiquitous in monkey primary visual cortex, while more information from the inputs is preserved in phase-disparity neurons (tuned-near or tuned-far) primarily found in higher cortical regions. 1
4 0.59255999 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches for minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is hard to compute, and so is the projection operator that approximates it, we describe another second-order retraction that can be computed efficiently, with run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, given rank-one gradients. We use this algorithm, LORETA, to learn a matrixform similarity measure over pairs of documents represented as high dimensional vectors. LORETA improves the mean average precision over a passive- aggressive approach in a factorized model, and also improves over a full model trained over pre-selected features using the same memory requirements. LORETA also showed consistent improvement over standard methods in a large (1600 classes) multi-label image classification task. 1
5 0.56736356 91 nips-2010-Fast detection of multiple change-points shared by many signals using group LARS
Author: Jean-philippe Vert, Kevin Bleakley
Abstract: We present a fast algorithm for the detection of multiple change-points when each is frequently shared by members of a set of co-occurring one-dimensional signals. We give conditions on consistency of the method when the number of signals increases, and provide empirical evidence to support the consistency results. 1
6 0.56399173 96 nips-2010-Fractionally Predictive Spiking Neurons
7 0.56288671 53 nips-2010-Copula Bayesian Networks
8 0.56156683 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
9 0.55377752 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
10 0.55196106 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
11 0.55142719 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
12 0.55039984 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
13 0.54895246 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
14 0.54099602 228 nips-2010-Reverse Multi-Label Learning
15 0.54090452 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
16 0.53943449 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
17 0.53808331 155 nips-2010-Learning the context of a category
18 0.53735679 257 nips-2010-Structured Determinantal Point Processes
19 0.53678602 42 nips-2010-Boosting Classifier Cascades
20 0.53661197 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication