nips nips2004 nips2004-50 knowledge-graph by maker-knowledge-mining

50 nips-2004-Dependent Gaussian Processes


Source: pdf

Author: Phillip Boyle, Marcus Frean

Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nz Abstract Gaussian processes are usually parameterised in terms of their covariance functions. [sent-4, score-0.443]

2 However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. [sent-5, score-0.261]

3 An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. [sent-6, score-0.647]

4 Using this, we extend Gaussian processes to handle multiple, coupled outputs. [sent-7, score-0.407]

5 A Gaussian process (GP), as a set of jointly Gaussian random variables, is completely characterised by a covariance matrix with entries determined by a covariance function. [sent-12, score-0.361]

6 Traditionally, such models have been specified by parameterising the covariance function (i. [sent-13, score-0.281]

7 a function specifying the covariance of output values given any two input vectors). [sent-15, score-0.423]

8 In general this needs to be a positive definite function to ensure positive definiteness of the covariance matrix. [sent-16, score-0.226]

9 Most GP implementations model only a single output variable. [sent-17, score-0.284]

10 Attempts to handle multiple outputs generally involve using an independent model for each output - a method known as multi-kriging [18] - but such models cannot capture the structure in outputs that covary. [sent-18, score-1.004]

11 As an example, consider the two tightly coupled outputs shown at the top of Figure 2, in which one output is simply a shifted version of the other. [sent-19, score-0.689]

12 Here we have detailed knowledge of output 1, but sampling of output 2 is sparse. [sent-20, score-0.484]

13 A model that treats the outputs as independent cannot exploit their obvious similarity - intuitively, we should make predictions about output 2 using what we learn from both output 1 and 2. [sent-21, score-0.905]

14 co-kriging [3]) but are problematic in that it is not clear how covariance functions should be defined [5]. [sent-24, score-0.182]

15 Although there are many known positive definite autocovariance functions (e. [sent-25, score-0.155]

16 Gaussians and many others [1, 9]), it is difficult to define cross-covariance functions that result in positive definite covariance matrices. [sent-27, score-0.22]

17 Contrast this to neural network modelling, where the handling of multiple outputs is routine. [sent-28, score-0.341]

18 An alternative to directly parameterising covariance functions is to treat GPs as the outputs of stable linear filters. [sent-29, score-0.626]

19 For a linear filter, the output in response to an input x(t) is ∞ y(t) = h(t) x(t) = −∞ h(t − τ )x(τ )dτ , where h(t) defines the impulse response of the filter and denotes convolution. [sent-30, score-0.35]

20 Provided the linear filter is stable and x(t) is Gaussian white noise, the output process y(t) is necessarily a Gaussian process. [sent-31, score-0.451]

21 It is also possible to characterise p-dimensional stable linear filters, with M -inputs and N -outputs, by a set of M × N impulse responses. [sent-32, score-0.153]

22 In general, the resulting N outputs are dependent Gaussian processes. [sent-33, score-0.621]

23 Now we can model multiple dependent outputs by parameterising the set of impulse responses for a multiple output linear filter, and inferring the parameter values from data that we observe. [sent-34, score-1.336]

24 Instead of specifying and parameterising positive definite covariance functions, we now specify and parameterise impulse responses. [sent-35, score-0.589]

25 The only restriction is that the filter be linear and stable, and this is achieved by requiring the impulse responses to be absolutely integrable. [sent-36, score-0.193]

26 Constructing GPs by stimulating linear filters with Gaussian noise is equivalent to constructing GPs through kernel convolutions. [sent-37, score-0.176]

27 A Gaussian process V (s) can be constructed over a region S by convolving a continuous white noise process X(s) with a smoothing kernel h(s), V (s) = h(s) X(s) for s ∈ S, [7]. [sent-38, score-0.397]

28 To this can be added a second white noise source, representing measurement uncertainty, and together this gives a model for observations Y . [sent-39, score-0.288]

29 The convolution approach has been used to formulate flexible nonstationary covariance functions [13, 12]. [sent-41, score-0.332]

30 Furthermore, this idea can be extended to model multiple dependent output processes by assuming a single common latent process [7]. [sent-42, score-1.098]

31 For example, two dependent processes V 1 (s) and V2 (s) are constructed from a shared dependence on X(s) for s ∈ S0 , as follows V1 (s) = h1 (s − λ)X(λ)dλ S0 ∪S1 and V2 (s) = h2 (s − λ)X(λ)dλ S0 ∪S2 where S = S0 ∪ S1 ∪ S2 is a union of disjoint subspaces. [sent-43, score-0.612]

32 V1 (s) is dependent on X(s), s ∈ S1 but not X(s), s ∈ S2 . [sent-44, score-0.353]

33 Similarly, V2 (s) is dependent on X(s), s ∈ S2 but not X(s), s ∈ S1 . [sent-45, score-0.353]

34 This allows V1 (s) and V2 (s) to possess independent components. [sent-46, score-0.123]

35 In this paper, we model multiple outputs somewhat differently to [7]. [sent-47, score-0.383]

36 Instead of assuming a single latent process defined over a union of subspaces, we assume multiple latent processes, each defined over p . [sent-48, score-0.285]

37 Some outputs may be dependent through a shared reliance on common latent processes, and some outputs may possess unique, independent features through a connection to a latent process that affects no other output. [sent-49, score-1.193]

38 2 Two Dependent Outputs Consider two outputs Y1 (s) and Y2 (s) over a region p , where s ∈ p . [sent-50, score-0.301]

39 We have N1 observations of output 1 and N2 observations of output 2, giving us data D1 = {s1,i , y1,i }N1 i=1 and D2 = {s2,i , y2,i }N2 . [sent-51, score-0.566]

40 As shown in Figure 1(b), we can model each output as the linear sum of three stationary Gaussian processes. [sent-53, score-0.324]

41 One of these (V ) arises from a noise source unique to that output, under convolution with a kernel h. [sent-54, score-0.29]

42 A second (U ) is similar, but arises from a separate noise source X0 that influences both outputs (although via different kernels, k). [sent-55, score-0.436]

43 The output Y is the sum of two Gaussian white noise processes, one of which has been convolved ( ) with a kernel (h). [sent-59, score-0.53]

44 (b) The model for two dependent outputs Y1 and Y2 . [sent-60, score-0.663]

45 All of X0 , X1 , X2 and the “noise” contributions are independent Gaussian white noise sources. [sent-61, score-0.285]

46 Notice that if X 0 is forced to zero Y1 and Y2 become independent processes as in (a) - we use this as a control model. [sent-62, score-0.355]

47 1 The k1 , k2 , h1 , h2 are parameterised Gaussian kernels where k1 (s) = v1 exp − 2 sT A1 s , 1 1 k2 (s) = v2 exp − 2 (s − µ)T A2 (s − µ) , and hi (s) = wi exp − 2 sT Bi s . [sent-63, score-0.436]

48 Note that k2 (s) is offset from zero by µ to allow modelling of outputs that are coupled and translated relative to one another. [sent-64, score-0.516]

49 Y We wish to derive the set of functions Cij (d) that define the autocovariance (i = j) and cross-covariance (i = j) between the outputs i and j, for a given separation d between Y arbitrary inputs sa and sb . [sent-65, score-0.385]

50 Y Given Cij (d) then, we can construct the covariance matrices C11 , C12 , C21 , and C22 as follows  Y  Y Cij (si,1 − sj,1 ) · · · Cij (si,1 − sj,Nj )   . [sent-67, score-0.15]

51 Y Cij (si,Ni − sj,1 ) ··· Y Cij (si,Ni − sj,Nj ) Together these define the positive definite symmetric covariance matrix C for the combined output data D: C11 C12 C= (2) C21 C22 We define a set of hyperparameters Θ that parameterise {v1 , v2 , w1 , w2 , A1 , A2 , B1 , B2 , µ, σ1 , σ2 }. [sent-76, score-0.609]

52 Learning a model now corresponds to either maximising the likelihood L, or maximising the posterior probability P (Θ | D). [sent-78, score-0.146]

53 The predictive distribution at a point s on output i given Θ and D is Gaussian with mean 2 y and variance σy given by ˆ ˆ y = kT C−1 y ˆ 2 σy = κ − kT C−1 k ˆ and where 2 2 2 κ = CY (0) = vi + wi + σi ii and Y Y k = Ci1 (s − s1,1 ) . [sent-80, score-0.356]

54 Ci2 (s − s2,N2 ) T Example 1 - Strongly dependent outputs over 1d input space Consider two outputs, observed over a 1d input space. [sent-87, score-0.621]

55 We generated N = 48 data points by taking N1 = 32 samples from output 1 and N2 = 16 samples from output 2. [sent-91, score-0.624]

56 The samples from output 1 were linearly spaced in the interval [−1, 1] and those from output 2 were uniformly spaced in the region [−1, −0. [sent-92, score-0.661]

57 All samples were taken under additive Gaussian noise, σ = 0. [sent-95, score-0.108]

58 The resulting dependent model is shown in Figure 2 along with an independent (control) model with no coupling (see Figure 1). [sent-98, score-0.602]

59 Observe that the dependent model has learned the coupling and translation between the outputs, and has filled in output 2 where samples are missing. [sent-99, score-0.792]

60 The control model cannot achieve such infilling as it is consists of two independent Gaussian processes. [sent-100, score-0.169]

61 2 Example 2 - Strongly dependent outputs over 2d input space Consider two outputs, observed over a 2d input space. [sent-102, score-0.621]

62 Output 2 − independent model Output 1 − independent model True function Model mean 0. [sent-107, score-0.244]

63 4 Figure 2: Strongly dependent outputs where output 2 is simply a translated version of output 1, with independent Gaussian noise, σ = 0. [sent-167, score-1.253]

64 We generated 117 data points by taking 81 samples from output 1 and 36 samples from output 2. [sent-173, score-0.624]

65 Both sets of samples formed uniform lattices over the region [−0. [sent-174, score-0.103]

66 The dependent model is shown in Figure 3 along with an independent control model. [sent-181, score-0.522]

67 The dependent model has filled in output 2 where samples are missing. [sent-182, score-0.707]

68 Again, the control model cannot achieve such in-filling as it is consists of two independent Gaussian processes. [sent-183, score-0.169]

69 3 Time Series Forecasting Consider the observation of multiple time series, where some of the series lead or predict the others. [sent-184, score-0.338]

70 We simulated a set of three time series for 100 steps each (figure 4) where series 3 was positively coupled to a lagged version of series 1 (lag = 0. [sent-185, score-1.035]

71 5) and negatively coupled to a lagged version of series 2 (lag = 0. [sent-186, score-0.564]

72 Given the 300 observations, we built a dependent GP model of the three time series and compared them with independent GP models. [sent-188, score-0.705]

73 The dependent GP model incorporated a prior belief that series 3 was coupled to series 1 and 2, with the lags unknown. [sent-189, score-1.003]

74 The independent GP model assumed no coupling between its outputs, and consisted of three independent GP models. [sent-190, score-0.287]

75 We queried the models for forecasts of the future 10 values of series 3. [sent-191, score-0.23]

76 It is clear from figure 4 that the dependent GP model does a far better job at forecasting the dependent series 3. [sent-192, score-1.189]

77 The independent model becomes inaccurate after just a few time steps into the future. [sent-193, score-0.122]

78 This inaccuracy is expected as knowledge of series 1 and 2 is required to accurately predict series 3. [sent-194, score-0.495]

79 The Figure 3: Strongly dependent outputs where output 2 is simply a copy of output 1, with independent Gaussian noise. [sent-195, score-1.185]

80 Output 2 is modelled well only by the dependent model dependent GP model performs well as it has learned that series 3 is positively coupled to a lagged version of series 1 and negatively coupled to a lagged version of series 2. [sent-199, score-2.194]

81 4 Multiple Outputs and Non-stationary Kernels The convolution framework described here for constructing GPs can be extended to build models capable of modelling N -outputs, each defined over a p-dimensional input space. [sent-200, score-0.256]

82 In general, we can define a model where we assume M -independent Gaussian white noise processes X1 (s) . [sent-201, score-0.475]

83 The autocovariance (i = j) and cross-covariance m=1 n=1 (i = j) functions between output processes i and j become M U Cij (d) = kmi (s)kmj (s + d)ds m=1 (3) p and the matrix defined by equation 2 is extended in the obvious way. [sent-208, score-0.626]

84 We require kernels that are absolutely integrable, −∞ . [sent-210, score-0.162]

85 It would seem that an absolutely integrable kernel would be easier to define and parameterise than a positive Y definite function. [sent-215, score-0.356]

86 2 1 0 −1 −2 Series 1 −3 2 1 0 −1 −2 Series 2 −3 2 1 0 −1 −2 −3 Series 3 0 1 2 3 4 5 6 7 8 Figure 4: Three coupled time series, where series 1 and series 2 predict series 3. [sent-217, score-0.873]

87 Forecasting for series 3 begins after 100 time steps where t = 7. [sent-218, score-0.23]

88 The dependent model forecast is shown with a solid line, and the independent (control) forecast is shown with a broken line. [sent-220, score-0.671]

89 The dependent model does a far better job at forecasting the next 10 steps of series 3 (black dots). [sent-221, score-0.836]

90 5 Conclusion We have shown how the Gaussian Process framework can be extended to multiple output variables without assuming them to be independent. [sent-222, score-0.354]

91 Multiple processes can be handled by inferring convolution kernels instead of covariance functions. [sent-223, score-0.586]

92 This makes it easy to construct the required positive definite covariance matrices for covarying outputs. [sent-224, score-0.188]

93 However the framework developed here is more general than this, as it can model data that arises from multiple sources, only some of which are shared. [sent-226, score-0.15]

94 Our examples show the infilling of sparsely sampled regions that becomes possible in a model that permits coupling between outputs. [sent-227, score-0.127]

95 Another application is the forecasting of dependent time series. [sent-228, score-0.523]

96 Our example shows how learning couplings between multiple time series may aid in forecasting, particularly when the series to be forecast is dependent on previous or current values of other series. [sent-229, score-0.984]

97 Dependent Gaussian processes should be particularly valuable in cases where one output is expensive to sample, but covaries strongly with a second that is cheap. [sent-230, score-0.533]

98 By inferring both the coupling and the independent aspects of the data, the cheap observations can be used as a proxy for the expensive ones. [sent-231, score-0.252]

99 A review of gaussian random fields and correlation functions. [sent-233, score-0.198]

100 Monte carlo implementation of gaussian process models for bayesian regression and classification. [sent-293, score-0.372]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dependent', 0.353), ('outputs', 0.268), ('output', 0.242), ('cij', 0.242), ('series', 0.23), ('processes', 0.228), ('gaussian', 0.198), ('gp', 0.195), ('forecasting', 0.17), ('covariance', 0.15), ('coupled', 0.148), ('gps', 0.142), ('parameterise', 0.131), ('parameterising', 0.131), ('lagged', 0.114), ('impulse', 0.108), ('white', 0.103), ('noise', 0.102), ('asmussen', 0.098), ('forecast', 0.098), ('ibbs', 0.098), ('wellington', 0.098), ('ac', 0.091), ('absolutely', 0.085), ('autocovariance', 0.085), ('coupling', 0.085), ('convolution', 0.085), ('independent', 0.08), ('kernels', 0.077), ('ay', 0.077), ('multiple', 0.073), ('bi', 0.072), ('samples', 0.07), ('wi', 0.066), ('cii', 0.065), ('eal', 0.065), ('illiams', 0.065), ('integrable', 0.065), ('paciorek', 0.065), ('parameterised', 0.065), ('victoria', 0.065), ('nonstationary', 0.065), ('lling', 0.065), ('exp', 0.064), ('strongly', 0.063), ('modelling', 0.063), ('lter', 0.061), ('process', 0.061), ('latent', 0.06), ('maximised', 0.057), ('maximising', 0.052), ('positively', 0.052), ('lag', 0.048), ('hyperparameters', 0.048), ('vi', 0.048), ('control', 0.047), ('inferring', 0.046), ('convolved', 0.046), ('stable', 0.045), ('carlo', 0.044), ('monte', 0.044), ('toronto', 0.044), ('possess', 0.043), ('model', 0.042), ('lled', 0.041), ('job', 0.041), ('negatively', 0.041), ('nite', 0.041), ('observations', 0.041), ('ai', 0.04), ('stationary', 0.04), ('extended', 0.039), ('kt', 0.038), ('additive', 0.038), ('positive', 0.038), ('constructing', 0.037), ('bayesian', 0.037), ('spaced', 0.037), ('translated', 0.037), ('dots', 0.037), ('kernel', 0.037), ('phd', 0.037), ('hi', 0.036), ('arises', 0.035), ('predict', 0.035), ('modelled', 0.035), ('region', 0.033), ('functions', 0.032), ('build', 0.032), ('regression', 0.032), ('thesis', 0.032), ('version', 0.031), ('handle', 0.031), ('source', 0.031), ('predictions', 0.031), ('union', 0.031), ('specifying', 0.031), ('ab', 0.031), ('station', 0.028), ('barnett', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 50 nips-2004-Dependent Gaussian Processes

Author: Phillip Boyle, Marcus Frean

Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1

2 0.20160221 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

Author: Anton Schwaighofer, Volker Tresp, Kai Yu

Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1

3 0.11801826 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

Author: Liam Paninski

Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.

4 0.11543468 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

Author: Peter Sollich, Christopher Williams

Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.

5 0.10770315 124 nips-2004-Multiple Alignment of Continuous Time Series

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili

Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufficient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to define the state transition probabilities. We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. Both the time and scale transition probability distributions are given by multinomials:  dk , if a − b = 1  1  k  d2 , if a − b = 2   k . p (τi = a|τi−1 = b) = . .  k d , if a − b = J  τ  Jτ  0, otherwise p(φi = a|φi−1  s0 , if D(a, b) = 0   s1 , if D(a, b) = 1 = b) =  s1 , if D(a, b) = −1  0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s

6 0.083940603 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

7 0.080153689 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

8 0.079591349 148 nips-2004-Probabilistic Computation in Spiking Populations

9 0.077018902 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

10 0.073825024 168 nips-2004-Semigroup Kernels on Finite Sets

11 0.073548004 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

12 0.070934139 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

13 0.066503696 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.063801073 28 nips-2004-Bayesian inference in spiking neurons

15 0.062032439 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

16 0.060270198 166 nips-2004-Semi-supervised Learning via Gaussian Processes

17 0.059245437 175 nips-2004-Stable adaptive control with online learning

18 0.056824327 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

19 0.055664368 136 nips-2004-On Semi-Supervised Classification

20 0.054625943 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.202), (1, -0.041), (2, -0.029), (3, -0.011), (4, -0.088), (5, -0.081), (6, -0.056), (7, 0.013), (8, 0.101), (9, -0.024), (10, 0.125), (11, 0.003), (12, 0.11), (13, 0.032), (14, 0.075), (15, -0.047), (16, 0.003), (17, -0.02), (18, 0.243), (19, -0.142), (20, -0.144), (21, 0.045), (22, -0.104), (23, 0.002), (24, 0.058), (25, -0.015), (26, 0.189), (27, 0.017), (28, 0.002), (29, -0.045), (30, 0.055), (31, 0.035), (32, 0.012), (33, 0.007), (34, 0.06), (35, -0.023), (36, -0.162), (37, -0.132), (38, 0.004), (39, -0.072), (40, -0.0), (41, 0.074), (42, -0.068), (43, 0.02), (44, -0.05), (45, -0.027), (46, 0.051), (47, -0.113), (48, 0.038), (49, -0.087)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98182184 50 nips-2004-Dependent Gaussian Processes

Author: Phillip Boyle, Marcus Frean

Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1

2 0.77001941 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

Author: Anton Schwaighofer, Volker Tresp, Kai Yu

Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1

3 0.66444695 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

Author: Peter Sollich, Christopher Williams

Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.

4 0.63295436 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

Author: Liam Paninski

Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.

5 0.51204777 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

6 0.47715029 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

7 0.47524166 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

8 0.43865103 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography

9 0.43424523 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

10 0.42301619 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

11 0.4209719 158 nips-2004-Sampling Methods for Unsupervised Learning

12 0.42025998 148 nips-2004-Probabilistic Computation in Spiking Populations

13 0.41042089 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

14 0.40449488 124 nips-2004-Multiple Alignment of Continuous Time Series

15 0.39327878 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

16 0.3762776 168 nips-2004-Semigroup Kernels on Finite Sets

17 0.36441106 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

18 0.35554075 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

19 0.34083351 136 nips-2004-On Semi-Supervised Classification

20 0.33673239 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.053), (15, 0.715), (26, 0.027), (33, 0.079), (39, 0.02), (50, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97511631 50 nips-2004-Dependent Gaussian Processes

Author: Phillip Boyle, Marcus Frean

Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1

2 0.96483815 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

Author: Tao Xiong, Jieping Ye, Qi Li, Ravi Janardan, Vladimir Cherkassky

Abstract: Linear Discriminant Analysis (LDA) is a well-known method for feature extraction and dimension reduction. It has been used widely in many applications such as face recognition. Recently, a novel LDA algorithm based on QR Decomposition, namely LDA/QR, has been proposed, which is competitive in terms of classification accuracy with other LDA algorithms, but it has much lower costs in time and space. However, LDA/QR is based on linear projection, which may not be suitable for data with nonlinear structure. This paper first proposes an algorithm called KDA/QR, which extends the LDA/QR algorithm to deal with nonlinear data by using the kernel operator. Then an efficient approximation of KDA/QR called AKDA/QR is proposed. Experiments on face image data show that the classification accuracy of both KDA/QR and AKDA/QR are competitive with Generalized Discriminant Analysis (GDA), a general kernel discriminant analysis algorithm, while AKDA/QR has much lower time and space costs. 1

3 0.95543504 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

Author: Le Lu, Gregory D. Hager, Laurent Younes

Abstract: Visual action recognition is an important problem in computer vision. In this paper, we propose a new method to probabilistically model and recognize actions of articulated objects, such as hand or body gestures, in image sequences. Our method consists of three levels of representation. At the low level, we first extract a feature vector invariant to scale and in-plane rotation by using the Fourier transform of a circular spatial histogram. Then, spectral partitioning [20] is utilized to obtain an initial clustering; this clustering is then refined using a temporal smoothness constraint. Gaussian mixture model (GMM) based clustering and density estimation in the subspace of linear discriminant analysis (LDA) are then applied to thousands of image feature vectors to obtain an intermediate level representation. Finally, at the high level we build a temporal multiresolution histogram model for each action by aggregating the clustering weights of sampled images belonging to that action. We discuss how this high level representation can be extended to achieve temporal scaling invariance and to include Bi-gram or Multi-gram transition information. Both image clustering and action recognition/segmentation results are given to show the validity of our three tiered representation.

4 0.94681668 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

Author: Joris M. Mooij, Hilbert J. Kappen

Abstract: We introduce a computationally efficient method to estimate the validity of the BP method as a function of graph topology, the connectivity strength, frustration and network size. We present numerical results that demonstrate the correctness of our estimates for the uniform random model and for a real-world network (“C. Elegans”). Although the method is restricted to pair-wise interactions, no local evidence (zero “biases”) and binary variables, we believe that its predictions correctly capture the limitations of BP for inference and MAP estimation on arbitrary graphical models. Using this approach, we find that BP always performs better than MF. Especially for large networks with broad degree distributions (such as scale-free networks) BP turns out to significantly outperform MF. 1

5 0.90868878 92 nips-2004-Kernel Methods for Implicit Surface Modeling

Author: Joachim Giesen, Simon Spalinger, Bernhard Schölkopf

Abstract: We describe methods for computing an implicit model of a hypersurface that is given only by a finite sampling. The methods work by mapping the sample points into a reproducing kernel Hilbert space and then determining regions in terms of hyperplanes. 1

6 0.90688384 197 nips-2004-Two-Dimensional Linear Discriminant Analysis

7 0.89681792 183 nips-2004-Temporal-Difference Networks

8 0.82015514 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

9 0.76425219 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

10 0.75140738 148 nips-2004-Probabilistic Computation in Spiking Populations

11 0.74143142 168 nips-2004-Semigroup Kernels on Finite Sets

12 0.74124956 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

13 0.73348457 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

14 0.70732218 30 nips-2004-Binet-Cauchy Kernels

15 0.70618582 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

16 0.70557177 178 nips-2004-Support Vector Classification with Input Data Uncertainty

17 0.70369512 94 nips-2004-Kernels for Multi--task Learning

18 0.70332938 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

19 0.69425708 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

20 0.69026768 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation