jmlr jmlr2011 jmlr2011-17 knowledge-graph by maker-knowledge-mining

17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes


Source: pdf

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Under a latent function interpretation of the convolution transform we establish dependencies between output variables. [sent-12, score-0.276]

2 , 2008) is known in the geostatistics literature as the linear model of coregionalization (LMC) (Journel and Huijbregts, 1978; Goovaerts, 1997). [sent-44, score-0.327]

3 In the LMC, the covariance function is expressed as the sum of Kronecker products between coregionalization matrices and a set of underlying covariance functions. [sent-45, score-0.472]

4 The correlations across the outputs are expressed in the coregionalization matrices, while the underlying covariance functions express the correlation between different data points. [sent-46, score-0.487]

5 In the linear model of coregionalization each output can be thought of as an instantaneous mixing of the underlying signals/processes. [sent-51, score-0.459]

6 An alternative approach to constructing covariance functions for multiple outputs employs convolution processes (CP). [sent-52, score-0.35]

7 To obtain a CP in the single output case, the output of a given process is convolved with a smoothing kernel function. [sent-53, score-0.393]

8 For example, a white noise process may be convolved with a smoothing kernel to obtain a covariance function (Barry and Ver Hoef, 1996; Ver Hoef and Barry, 1998). [sent-54, score-0.293]

9 Ver Hoef and Barry (1998) and then Higdon (2002) noted that if a single input process was convolved with different smoothing kernels to produce different outputs, then correlation between the outputs could be expressed. [sent-55, score-0.331]

10 In this paper, we propose different efficient approximations for the full covariance matrix involved in the multiple output convolution process. [sent-62, score-0.372]

11 Many of the ideas for constructing covariance functions for multiple outputs have first appeared within the geostatistical literature, where they are known as linear models of coregionalization (LMC). [sent-81, score-0.517]

12 1 The Linear Model of Coregionalization The term linear model of coregionalization refers to models in which the outputs are expressed as linear combinations of independent random functions. [sent-84, score-0.399]

13 In a LMC each output function, fd (x), is expressed as (Journel and Huijbregts, 1978) Q fd (x) = ad,q uq (x). [sent-87, score-0.455]

14 We can group together the base processes that share latent functions (Journel and Huijbregts, 1978; Goovaerts, 1997), allowing us to express a given output as Q Rq ai ui (x), d,q q fd (x) = q=1 i=1 1461 (2) ´ A LVAREZ AND L AWRENCE R q where the functions ui (x) i=1 , i = 1, . [sent-90, score-0.391]

15 , Rq , represent the latent functions that share the same q covariance function kq (x, x′ ). [sent-93, score-0.358]

16 The cross covariance between any two functions fd (x) and fd′ (x) is given in terms of the covariance functions for ui (x) q Q Rq Rq Q ′ ′ ai ai ′ ,q′ cov[ui (x), ui ′ (x′ )]. [sent-98, score-0.332]

17 The covariance matrix for fd is obtained expressing Equation (3) as Q Rq Q bq ′ Kq , d,d ai ai ′ ,q Kq = d,q d cov[fd , fd′ ] = q=1 i=1 q=1 where Kq ∈ ℜN ×N has entries given by computing kq (x, x′ ) for all combinations from X. [sent-104, score-0.516]

18 We can now write the 1 D covariance matrix for the joint process over f as Q Q Aq A⊤ ⊗ Kq q Kf,f = Bq ⊗ Kq , = (4) q=1 q=1 where the symbol ⊗ denotes the Kronecker product, Aq ∈ ℜD×Rq has entries ai and Bq = Aq A⊤ ∈ q d,q ℜD×D has entries bq ′ and is known as the coregionalization matrix. [sent-109, score-0.483]

19 The covariance matrix Kf,f d,d is positive semi-definite as long as the coregionalization matrices Bq are positive semi-definite and kq (x, x′ ) is a valid covariance function. [sent-110, score-0.645]

20 By definition, coregionalization matrices Bq fulfill the positive semi-definiteness requirement. [sent-111, score-0.296]

21 The covariance functions for the latent processes, kq (x, x′ ), can simply be chosen from the wide variety of covariance functions (reproducing kernels) that are 2. [sent-112, score-0.446]

22 The linear model of coregionalization represents the covariance function as a product of the contributions of two covariance functions. [sent-116, score-0.472]

23 Firstly we sample a set of independent processes from the covariance functions given by kq (x, x′ ), taking Rq independent samples for each kq (x, x′ ). [sent-119, score-0.478]

24 1 I NTRINSIC C OREGIONALIZATION M ODEL A simplified version of the LMC, known as the intrinsic coregionalization model (ICM) (Goovaerts, 1997), assumes that the elements bq ′ of the coregionalization matrix Bq can be written as bq ′ = d,d d,d υd,d′ bq . [sent-125, score-0.933]

25 In other words, as a scaled version of the elements bq which do not depend on the particular output functions fd (x). [sent-126, score-0.349]

26 Using this form for bq ′ , Equation (3) can be expressed as d,d Q Q bq kq (x, x′ ). [sent-127, score-0.371]

27 ′ ′ υd,d′ bq kq (x, x ) = υd,d′ cov[fd (x), fd′ (x )] = q=1 q=1 The covariance matrix for f takes the form Kf,f = Υ ⊗ K, (5) where Υ ∈ ℜD×D , with entries υd,d′ , and K = Q bq Kq is an equivalent valid covariance funcq=1 tion. [sent-128, score-0.547]

28 The intrinsic coregionalization model can also be seen as a linear model of coregionalization where we have Q = 1. [sent-129, score-0.636]

29 In such case, Equation (4) takes the form Kf,f = A1 A⊤ ⊗ K1 = B1 ⊗ K1 , 1 (6) where the coregionalization matrix B1 has elements b1 ′ = R1 ai ai ′ ,1 . [sent-130, score-0.296]

30 As pointed out by Goovaerts (1997), the ICM is much more restrictive than the LMC since it assumes that each basic covariance kq (x, x′ ) contributes equally to the construction of the autocovariances and cross covariances for the outputs. [sent-132, score-0.29]

31 The functions uq (x) are considered to be latent factors and the semiparametric name comes from the fact that it is combining a nonparametric model, this is a Gaussian process, with a parametric linear mixing of the functions uq (x). [sent-145, score-0.296]

32 The intrinsic coregionalization model has been employed in Bonilla et al. [sent-150, score-0.34]

33 It can be shown that if the outputs are considered to be noise-free, prediction using the intrinsic coregionalization model under an isotopic data case is equivalent to independent prediction over each output (Helterbrand and Cressie, 1994). [sent-163, score-0.537]

34 The intrinsic coregionalization model has been also used in Osborne et al. [sent-167, score-0.34]

35 Under the same independence assumptions used in the LMC, the covariance between fd (x) and fd′ (x′ ) follows Q Rq ′ cov fd (x), fd′ (x ) = X q=1 i=1 Gi (x − z) d,q X Gi ′ ,q (x′ − z′ )kq (z, z′ )dz′ dz. [sent-215, score-0.44]

36 d (8) Specifying Gi (x − z) and kq (z, z′ ) in (8), the covariance for the outputs fd (x) can be constructed d,q indirectly. [sent-216, score-0.52]

37 Note that if the smoothing kernels are taken to be the Dirac delta function such that, Gi (x − z) = ai δ(x − z), d,q d,q where δ(·) is the Dirac delta function, the double integral is easily solved and the linear model of coregionalization is recovered. [sent-217, score-0.319]

38 The traditional approach to convolution processes in statistics and signal processing is to assume 2 that the latent functions uq (z) are independent white Gaussian noise processes, kq (z, z′ ) = σq δ(z − ′ ). [sent-220, score-0.448]

39 d,q d In general, though, we can consider any type of latent process, for example, we could assume GPs for the latent functions with general covariances kq (z, z′ ). [sent-222, score-0.396]

40 As well as this covariance across outputs, the covariance between the latent function, ui (z), and q any given output, fd (x), can be computed, cov fd (x), ui (z) = q X Gi (x − z′ )kq (z′ , z)dz′ . [sent-223, score-0.625]

41 d,q (9) Additionally, we can corrupt each of the outputs of the convolutions with an independent process (which could also include a noise term), wd (x), to obtain yd (x) = fd (x) + wd (x). [sent-224, score-0.334]

42 This framework can be extended for the multiple e output case, expressing the outputs as fd (x) = Gd (x, z)γ(dz). [sent-246, score-0.383]

43 A simple general purpose kernel for multiple outputs based on the convolution integral can be constructed assuming that the kernel smoothing function, Gd,q (x), and the covariance for the latent function, kq (x, x′ ), follow both a Gaussian form. [sent-251, score-0.576]

44 1467 ´ A LVAREZ AND L AWRENCE where Sd,q is a variance coefficient that depends both on the output d and the latent function q and Pd is the precision matrix associated to the particular output d. [sent-257, score-0.285]

45 The covariance function for the latent process is expressed as kq (x, x′ ) = N (x − x′ |0, Λ−1 ), q with Λq the precision matrix of the latent function q. [sent-258, score-0.455]

46 As we have mentioned before, the main focus of this paper is to present some efficient approximations for the multiple output convolved Gaussian Process. [sent-290, score-0.376]

47 We compare the performance of the intrinsic coregionalization model (Section 2. [sent-296, score-0.34]

48 We randomly select D = 50 genes from replica 1 for training a full multiple output GP model based on either the LMC framework or the convolved GP framework. [sent-309, score-0.634]

49 The corresponding 50 genes of replica 2 are used for testing and results are presented in terms of the standardized mean square error (SMSE) and the mean standardized log loss (MSLL) as defined in Rasmussen and Williams (2006). [sent-310, score-0.417]

50 We also repeated the experiment selecting the 50 genes for training from replica 2 and the corresponding 50 genes of replica 1 for testing. [sent-313, score-0.524]

51 1469 ´ A LVAREZ AND L AWRENCE We are interested in a reduced representation of the data so we assume that Q = 1 and Rq = 1, for the LMC and the convolved multiple output GP in Equations (2) and (8), respectively. [sent-317, score-0.329]

52 (2008) and assume an incomplete Cholesky decomposition for B1 = LL⊤ , where L ∈ ℜ50×1 and as the basic covariance kq (x, x′ ) we assume the squared exponential covariance function (p. [sent-319, score-0.349]

53 For the convolved multiple output GP we employ the covariance described in Section 3, Equation (12), with the appropriate scaling factors. [sent-321, score-0.417]

54 It can be seen that the convolved multiple output covariance (appearing as CMOC in the table), outperforms the LMC covariance both in terms of SMSE and MSLL. [sent-344, score-0.505]

55 Figures 1(a) and 1(c) show the response of the LMC and Figures 1(b) and 1(d) show the response of the convolved multiple output covariance. [sent-347, score-0.329]

56 The linear model of coregionalization is driven by a latent function with a length-scale that is shared across the outputs. [sent-349, score-0.393]

57 On the other hand, dueto the non-instantaneous mixing of the latent function, the convolved multiple output framework, allows the description of each output using its own length-scale, which gives an added flexibility for describing the data. [sent-351, score-0.557]

58 CMOC outperforms the linear model of coregionalization for both genes in terms of SMSE and MSLL. [sent-353, score-0.378]

59 Again, CMOC outperforms the linear model of coregionalization for both genes and in terms of SMSE and MSLL. [sent-357, score-0.378]

60 The training data comes from replica 1 and the testing data from replica 2. [sent-391, score-0.36]

61 For the intrinsic coregionalization model, we would fix the value of Q = 1 and increase the value of R1 . [sent-396, score-0.34]

62 Effectively, we would be increasing the rank of the coregionalization matrix B1 , meaning that more latent functions sampled from the same covariance function are being used to explain the data. [sent-397, score-0.481]

63 In practice though, we will see in the experimental section, that both the linear model of coregionalization and the convolved multiple output GPs can perform equally well in some data sets. [sent-422, score-0.625]

64 However, the convolved covariance could offer an explanation of the data through a simpler model or converge to the LMC, if needed. [sent-423, score-0.293]

65 The difference with Figure 1 is that now the training data comes from replica 2 while the testing data comes from replica 1. [sent-473, score-0.36]

66 When possible, we first compare the convolved multiple output GP method against the intrinsic model of coregionalization and the semiparametric latent factor model. [sent-690, score-0.83]

67 We generate N = 500 observation points for each output and use 200 observation points (per output) for training the full and the approximated multiple output GP and the remaining 300 observation points for testing. [sent-700, score-0.276]

68 From the multiple output point of view, each school represents one output and the exam score of each student a particular instantiation of that output or D = 139. [sent-848, score-0.394]

69 , 2008), the intrinsic coregionalization model, the semiparametric latent factor model and convolved multiple output GPs. [sent-924, score-0.83]

70 , 2008) o Intrinsic coregionalization model (R1 = 1) Intrinsic coregionalization model (R1 = 2) Intrinsic coregionalization model (R1 = 5) Semiparametric latent factor model (Q = 2) Semiparametric latent factor model (Q = 5) Convolved Multiple Outputs GPs (Q = 1, Rq = 1) Explained variance (%) 31. [sent-927, score-1.082]

71 The value of R1 in the multi-task GP and in the intrinsic coregionalization model indicates the rank of the matrix B1 in Equation (6). [sent-946, score-0.34]

72 The value of Rq in the convolved multiple output GP refers to the number of latent functions that share the same number of parameters (see Equation 8). [sent-948, score-0.426]

73 For the semiparametric latent factor model, all the latent functions use covariance functions with Gaussian forms. [sent-956, score-0.346]

74 For the convolved multiple output covariance result, the kernel employed was introduced in Section 3. [sent-959, score-0.417]

75 Results for ICM with R1 = 1, SLFM with Q = 2 and the convolved covariance are similar within the standard deviations. [sent-963, score-0.293]

76 The convolved GP was able to recover the best performance using only one latent function (Q = 1). [sent-964, score-0.302]

77 With 1000 iterations DTC with 5 inducing points offers a speed up factor of 24 times over the ICM with R1 = 1 and a speed up factor of 137 over the full convolved multiple output method. [sent-986, score-0.477]

78 In the table, CMOGP stands for convolved multiple outputs GP. [sent-1012, score-0.371]

79 To summarize this example, we have shown that the convolved multiple output GP offers a similar performance to the ICM and SLFM methods. [sent-1016, score-0.329]

80 Moreover, this example involved a relatively high-input high-output dimensional data set, for which the convolved covariance has not been used before in the literature. [sent-1018, score-0.293]

81 We compare results of independent GPs, ordinary cokriging, the intrinsic coregionalization model, the semiparametric latent factor model and the convolved multiple output covariance. [sent-1036, score-0.83]

82 Different cokriging methods assume that each output can be decomposed as a sum of a residual component with zero mean and non-zero covariance function and a trend component. [sent-1042, score-0.321]

83 Whichever cokriging method is used implies using the values of the covariance for the residual component in the equations for the prediction, making explicit the need for a positive semidefinite covariance function. [sent-1045, score-0.291]

84 248, 249 Goovaerts, 1997) Intrinsic coregionalization model (R1 = 2) Semiparametric latent factor model (Q = 2) Convolved Multiple Outputs GPs (Q = 2, Rq = 1 ) Average Mean absolute error 0. [sent-1047, score-0.393]

85 For the intrinsic coregionalization model and the semiparametric latent factor model we use a Gaussian covariance with different lengthscales along each input dimension. [sent-1059, score-0.589]

86 For the convolved multiple output covariance, we use the covariance described in Section 3. [sent-1060, score-0.417]

87 A common algorithm to fit the linear model of coregionalization minimizes some error measure between a sample or experimental covariance matrix obtained from the data and the particular matrix obtained from the form chosen for the linear model of coregionalization (Goulard and Voltz, 1992). [sent-1063, score-0.68]

88 Also in Table 7, we present results using the intrinsic coregionalization with a rank two (R1 = 2) for B1 , the semiparametric latent factor model with two latent functions (Q = 2) and the convolved multiple output covariance with two latent functions (Q = 2 and Rq = 1). [sent-1069, score-1.112]

89 In fact, the linear model of coregionalization employed is constructed using variograms as basic tools that account for the dependencies in the input space. [sent-1083, score-0.296]

90 We also include in the figure the results for the convolved multiple output GP (CM2), semiparametric latent factor model (S2), intrinsic coregionalization model (IC2), ordinary cokriging (CO) and independent GPs (IND). [sent-1102, score-0.945]

91 In the table, CMOGP stands for convolved multiple outputs GP. [sent-1127, score-0.371]

92 The training times for DTC with 200 inducing points and PITC with 200 inducing points, which are the first methods that reach the performance of the full GP, are less than any of the times of the full GP methods. [sent-1150, score-0.326]

93 58 Table 9: Standardized mean square error (SMSE), mean standardized log loss (MSLL) and training time per iteration (TTPI) for the gene expression data for 1000 outputs using the efficient approximations for the convolved multiple output GP. [sent-1207, score-0.716]

94 This pattern repeats itself when the training data comes from replica 1 or from replica 2. [sent-1212, score-0.36]

95 Performances for DTC, FITC and PITC are shown in Table 10 (last six rows), which compare favourably with the performances for the linear model of coregionalization in Table 2 and close to the performances for the CMOC. [sent-1220, score-0.366]

96 The training data comes from replica 1 and the testing data from replica 2. [sent-1266, score-0.36]

97 Then we illustrated how the linear model of coregionalization can be interpreted as an instantaneous mixing of latent functions, in contrast to a convolved multiple output framework, where the mixing 1491 ´ A LVAREZ AND L AWRENCE FBgn0010531 MSLL −1. [sent-1277, score-0.828]

98 The training data comes now from replica 2 and the testing data from replica 1. [sent-1327, score-0.36]

99 As a byproduct of seeing the linear model of coregionalization as a particular case of the convolved GPs, we can extend all the approximations to work under the linear model of coregionalization regime. [sent-1367, score-0.844]

100 Linear coregionalization model: Tools for estimation and choice of cross-variogram matrix. [sent-1555, score-0.296]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pitc', 0.313), ('coregionalization', 0.296), ('smse', 0.263), ('fitc', 0.258), ('msll', 0.252), ('dtc', 0.238), ('convolved', 0.205), ('lmc', 0.181), ('kq', 0.173), ('replica', 0.165), ('fd', 0.156), ('rq', 0.138), ('gp', 0.133), ('awrence', 0.121), ('inducing', 0.12), ('cmoc', 0.115), ('cokriging', 0.115), ('omputationally', 0.11), ('onvolved', 0.11), ('outputs', 0.103), ('bq', 0.099), ('icm', 0.098), ('lvarez', 0.098), ('latent', 0.097), ('gps', 0.094), ('output', 0.094), ('rocesses', 0.093), ('covariance', 0.088), ('bonilla', 0.088), ('convolution', 0.085), ('utput', 0.084), ('exam', 0.082), ('genes', 0.082), ('ultiple', 0.077), ('alvarez', 0.077), ('slfm', 0.077), ('fficient', 0.072), ('cadmium', 0.071), ('semiparametric', 0.064), ('standardized', 0.061), ('gene', 0.061), ('aussian', 0.061), ('goovaerts', 0.055), ('kf', 0.055), ('kfd', 0.055), ('qui', 0.05), ('uq', 0.049), ('neil', 0.047), ('approximations', 0.047), ('multitask', 0.046), ('intrinsic', 0.044), ('processes', 0.044), ('barry', 0.044), ('hoef', 0.044), ('ver', 0.042), ('rasmussen', 0.04), ('cov', 0.04), ('yy', 0.039), ('higdon', 0.038), ('mauricio', 0.038), ('figures', 0.038), ('mixing', 0.037), ('expression', 0.037), ('ten', 0.035), ('performances', 0.035), ('stands', 0.033), ('pollutant', 0.033), ('instantaneous', 0.032), ('likelihood', 0.031), ('geostatistics', 0.031), ('predictive', 0.03), ('training', 0.03), ('multiple', 0.03), ('lawrence', 0.03), ('gaussian', 0.03), ('covariances', 0.029), ('yd', 0.029), ('full', 0.028), ('cmogp', 0.027), ('shef', 0.027), ('snelson', 0.027), ('christopher', 0.027), ('editors', 0.025), ('locations', 0.025), ('mean', 0.024), ('conditional', 0.024), ('aq', 0.024), ('transcription', 0.024), ('diag', 0.023), ('cressie', 0.023), ('swiss', 0.023), ('titsias', 0.023), ('wd', 0.023), ('secs', 0.023), ('osborne', 0.023), ('kernels', 0.023), ('approximation', 0.023), ('spatial', 0.022), ('boyle', 0.022), ('calder', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes

2 0.055334948 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency

3 0.055157959 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

4 0.049859911 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks

5 0.048810355 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: Gaussian process regression is a flexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be parallelized for faster computation. Moreover, the method can be adaptive to non-stationary features because of its local nature and, in particular, its use of different hyperparameters of the covariance function for different local regions. We illustrate application of the method and demonstrate its advantages over existing methods using two synthetic data sets and two real spatial data sets. Keywords: domain decomposition, boundary value problem, Gaussian process regression, parallel computation, spatial prediction

6 0.045646295 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

7 0.045606814 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

8 0.045208622 12 jmlr-2011-Bayesian Co-Training

9 0.042938638 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

10 0.04127143 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

11 0.037923846 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

12 0.037341654 54 jmlr-2011-Learning Latent Tree Graphical Models

13 0.036913268 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

14 0.035425935 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

15 0.031501208 105 jmlr-2011-lp-Norm Multiple Kernel Learning

16 0.029971862 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

17 0.029319255 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

18 0.028183358 55 jmlr-2011-Learning Multi-modal Similarity

19 0.026380431 66 jmlr-2011-Multiple Kernel Learning Algorithms

20 0.025816219 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.146), (1, -0.083), (2, -0.038), (3, -0.017), (4, 0.02), (5, -0.047), (6, -0.028), (7, 0.017), (8, -0.025), (9, 0.004), (10, 0.095), (11, 0.029), (12, 0.02), (13, -0.024), (14, 0.027), (15, 0.041), (16, 0.068), (17, -0.003), (18, 0.09), (19, 0.216), (20, -0.126), (21, 0.123), (22, 0.061), (23, -0.012), (24, -0.012), (25, 0.105), (26, -0.126), (27, 0.005), (28, 0.142), (29, 0.002), (30, -0.113), (31, 0.12), (32, -0.027), (33, -0.126), (34, -0.096), (35, -0.035), (36, -0.262), (37, 0.194), (38, 0.108), (39, 0.09), (40, 0.002), (41, -0.082), (42, 0.126), (43, 0.153), (44, 0.412), (45, -0.016), (46, -0.103), (47, -0.057), (48, 0.299), (49, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94625658 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes

2 0.34580168 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

Author: Tony Jebara

Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming

3 0.26630387 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks

4 0.26580408 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classification on all the samples. Experiments on toy data and several real world data sets illustrate the benefits of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing

5 0.25871813 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: Gaussian process regression is a flexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be parallelized for faster computation. Moreover, the method can be adaptive to non-stationary features because of its local nature and, in particular, its use of different hyperparameters of the covariance function for different local regions. We illustrate application of the method and demonstrate its advantages over existing methods using two synthetic data sets and two real spatial data sets. Keywords: domain decomposition, boundary value problem, Gaussian process regression, parallel computation, spatial prediction

6 0.20962481 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

7 0.2078415 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

8 0.19694068 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

9 0.19235654 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

10 0.18973503 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

11 0.17496409 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

12 0.1703466 54 jmlr-2011-Learning Latent Tree Graphical Models

13 0.16907251 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

14 0.1587313 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

15 0.15472011 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

16 0.14891048 35 jmlr-2011-Forest Density Estimation

17 0.13515964 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

18 0.13390176 50 jmlr-2011-LPmade: Link Prediction Made Easy

19 0.13353063 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

20 0.13174303 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.035), (6, 0.028), (8, 0.305), (9, 0.025), (10, 0.023), (24, 0.04), (31, 0.07), (32, 0.026), (33, 0.011), (40, 0.099), (41, 0.017), (60, 0.013), (65, 0.018), (66, 0.012), (67, 0.011), (70, 0.026), (73, 0.063), (78, 0.058), (90, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.6946646 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes

2 0.57434732 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

3 0.43386233 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP configurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU

4 0.35010442 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation

5 0.34864539 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

6 0.34754491 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

7 0.34625292 35 jmlr-2011-Forest Density Estimation

8 0.34320718 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

9 0.34315091 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

10 0.33868212 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

11 0.33526912 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

12 0.33493015 12 jmlr-2011-Bayesian Co-Training

13 0.33439696 48 jmlr-2011-Kernel Analysis of Deep Networks

14 0.33181581 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.33119699 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

16 0.32947856 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

17 0.32894683 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

18 0.32859752 66 jmlr-2011-Multiple Kernel Learning Algorithms

19 0.32825229 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

20 0.32802445 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization