nips nips2001 nips2001-95 knowledge-graph by maker-knowledge-mining

95 nips-2001-Infinite Mixtures of Gaussian Process Experts


Source: pdf

Author: Carl E. Rasmussen, Zoubin Ghahramani

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. [sent-8, score-0.477]

2 Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. [sent-9, score-0.425]

3 The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. [sent-11, score-0.103]

4 First, because inference requires inversion of an the number of training data points, they are computationally impractical for large datasets. [sent-16, score-0.156]

5 Second, the covariance function is commonly assumed to be stationary, limiting the modeling flexibility. [sent-17, score-0.103]

6 For example, if the noise variance is different in different parts of the input space, or if the function has a discontinuity, a stationary covariance function will not be adequate. [sent-18, score-0.277]

7 Goldberg et al [1998] discussed the case of input dependent noise variance. [sent-19, score-0.095]

8 These methods are based on selecting a projection of the covariance matrix onto a smaller subspace (e. [sent-21, score-0.103]

9 There have also been attempts at deriving more complex covariance functions [Gibbs 1997] although it can be difficult to decide a priori on a covariance function of sufficient complexity which guarantees positive definiteness. [sent-24, score-0.206]

10 In this paper we will simultaneously address both the problem of computational complexity and the deficiencies in covariance functions using a divide and conquer strategy inspired by the Mixture of Experts (ME) architecture [Jacobs et al, 1991]. [sent-25, score-0.131]

11 In this model the input space is (probabilistically) divided by a gating network into regions within which specific separate experts make predictions. [sent-26, score-0.86]

12 Of course, as in the ME, the learning of the experts and the gating network are intimately coupled. [sent-28, score-0.86]

13 Unfortunately, it may be (practically and statistically) difficult to infer the appropriate number of experts for a particular dataset. [sent-29, score-0.463]

14 In the current paper we sidestep this difficult problem by using an infinite number of experts and employing a gating network related to the Dirichlet Process, to specify a spatially varying Dirichlet Process. [sent-30, score-0.86]

15 An infinite number of experts may also in many cases be more faithful to our prior expectations about complex real-word datasets. [sent-31, score-0.473]

16 Integrating over the posterior distribution for the parameters is carried out using a Markov Chain Monte Carlo approach. [sent-32, score-0.085]

17 In his approach both the experts and the gating network were implemented with GPs; the gating network being a softmax of GPs. [sent-34, score-1.285]

18 2 Infinite GP mixtures The traditional ME likelihood does not apply when the experts are non-parametric. [sent-36, score-0.513]

19    © ¦ ¤¢   #    ¤ ¨ where and are inputs and outputs (boldface denotes vectors), are the parameters are the discrete indicator of expert , are the parameters of the gating network and variables assigning data points to experts. [sent-38, score-0.808]

20 There is a joint distribution corresponding to every possible assignment of data points to experts; therefore the likelihood is a sum over (exponentially many) assignments: (1) a5234©¨ ¦ED¢£`Y ©BUA(" ' 0RWP¦VA) ' RQIH#  G8 ! [sent-40, score-0.175]

21    © ¦ ¤¢ 24¨ ED£CBA@ §1£9$76¨ §1£¡  f '© a a a d '¢  g`6ReeR© Dcb@ Given the configuration , the distribution factors into the product, over experts, of the joint Gaussian distribution of all data points assigned to each expert. [sent-42, score-0.224]

22 Whereas the original ME formulation used expectations of assignment variables called responsibilities, this is inadequate for inference in the mixture of GP experts. [sent-43, score-0.144]

23 We defer discussion of the second term defining the gating network to the next section. [sent-46, score-0.425]

24 As discussed, the first term being the likelihood given the indicators factors into independent terms for each expert. [sent-47, score-0.114]

25 Thus, for the GP expert, we compute the above conditional density by simply evaluating the GP on the data assigned to it. [sent-49, score-0.171]

26 Although this equation looks computationally expensive, we can keep track of the inverse covariance matrices and reuse them for consecutive Gibbs updates by performing rank one updates (since Gibbs sampling changes at most one indicator at a time). [sent-50, score-0.353]

27 We are free to choose any valid covariance function for the experts. [sent-51, score-0.103]

28 In our simulations we employed the following Gaussian covariance function: § A9 B@@§  H 1s7! [sent-52, score-0.103]

29 6 controlling the signal variance, controlling the noise variance, with hyperparameters and controlling the length scale or (inverse) relevance of the -th dimension of in if , o. [sent-57, score-0.4]

30 0 P 3   G u © u ¢ F Gu  u A F & 3 The Gating network The gating network assigns probability to different experts based entirely on the input. [sent-62, score-0.92]

31 We will derive a gating network based on the Dirichlet Process which can be defined as the limit of a Dirichlet distribution when the number of classes tends to infinity. [sent-63, score-0.456]

32 The standard Dirichlet Process is not input dependent, but we will modify it to serve as a gating mechanism. [sent-64, score-0.365]

33 We start from a symmetric Dirichlet distribution on proportions: d iS # h U i  p9 R¢  h U¢ f %d b `W YW V    ¦ S a Q¢ sr Q t&q;#  S TR¢ i q 79 g(8eca¦8XETU gTQ a a e© d R£¡  U © # U where is the (positive) concentration parameter. [sent-65, score-0.124]

34 These length scales allow dimensions of space to be more or less relevant to the gating network classification. [sent-71, score-0.503]

35 0 We Gibbs sample from the indicator variables by multiplying the input-dependent Dirichlet process prior eq. [sent-72, score-0.272]

36 Gibbs sampling in an infinite model requires that the indicator variables can take on values that no other indicator variable has already taken, thereby creating new experts. [sent-75, score-0.347]

37 In this approach hyperparameters for new experts are sampled from their prior and the likelihood is evaluated based on these. [sent-77, score-0.678]

38 If all data points are assigned to a single GP, the likelihood calculation will still be cubic in the number of data points (per Gibbs sweep over all indicators). [sent-83, score-0.342]

39 We can reduce the computational complexity by introducing the constraint that no GP expert can have more than 2 max data points assigned to it. [sent-84, score-0.278]

40 U The hyperparameter controls the prior probability of assigning a data point to a new expert, and therefore influences the total number of experts used to model the data. [sent-86, score-0.507]

41 As in Rasmussen [2000], we give a vague inverse gamma prior to , and sample from its posterior using Adaptive Rejection Sampling (ARS) [Gilks & Wild, 1992]. [sent-87, score-0.255]

42 U U Finally we need to do inference for the parameters of the gating function. [sent-89, score-0.426]

43 Given a set of indicator variables one could use standard methods from kernel classification to optimize the kernel widths in different directions. [sent-90, score-0.322]

44 These methods typically optimize the leave-oneout pseudo-likelihood (ie the product of the conditionals), since computing the likelihood in a model defined purely from conditional distributions as in eq. [sent-91, score-0.082]

45 4 The Algorithm D The individual GP experts are given a stationary Gaussian covariance function, with a single length scale per dimension, a signal variance and a noise variance, i. [sent-94, score-0.785]

46 (where is the dimension of the input) hyperparameters per expert, eq. [sent-96, score-0.197]

47 The signal and noise variances are given inverse gamma priors with hyper-hypers and (separately for the two variances). [sent-98, score-0.162]

48 This serves to couple the hyperparameters between experts, and allows the priors on and (which are used when evaluating auxiliary classes) to adapt. [sent-99, score-0.166]

49 Finally we give vague independent log normal priors to the lenght scale paramters and . [sent-100, score-0.119]

50 The algorithm for learning an infinite mixture of GP experts consists of the following steps:  ' 1. [sent-103, score-0.49]

51 Initialize indicator variables to a single value (or a few values if individual GPs are to be kept small for computational reasons). [sent-104, score-0.19]

52 Do Hybrid Monte Carlo (HMC) [Duane et al, 1987] for hyperparameters of the GP covariance function, , for each expert in turn. [sent-108, score-0.413]

53 On the right hand plot we show samples from the posterior distribution for the iMGPE of the (noise free) function evaluated intervals of 1 ms. [sent-118, score-0.195]

54 We have jittered the points in the plot along the time dimension by adding uniform ms noise, so that the density can be seen more easily. [sent-119, score-0.225]

55 Sample the gating kernel widths, ; we use the Metropolis method to sample from the pseudo-posterior with a Gaussian proposal fit at the current 3 3 7. [sent-122, score-0.462]

56 5 Simulations on a simple real-world data set To illustrate our algorithm, we used the motorcycle dataset, fig. [sent-124, score-0.125]

57 We noticed that the raw data is discretized into bins of size g; accordingly we cut off the prior for the noise variance at . [sent-127, score-0.182]

58 5a § ¨¦ 3 3 5 9§ ¦ The model is able to capture the general shape of the function and also the input-dependent nature of the noise (fig. [sent-128, score-0.097]

59 Whereas the medians of the predictive distributions agree to a large extent (left hand plot), we see a huge difference in the predictive distributions (right hand). [sent-134, score-0.177]

60 The homoscedastic GP cannot capture the very tight distribution for ms offered by iMGPE. [sent-135, score-0.109]

61 Note that the predictive distribution of the function is multimodal, for example, around time 35 ms. [sent-137, score-0.097]

62 Multimodal predictive distributions could in principle be obtained from an ordinary GP by integrating over hyperparameters, however, in a mixture of GP’s model they can arise naturally. [sent-138, score-0.155]

63 The predictive distribution of the function appears to have significant mass around g which seems somewhat artifactual. [sent-139, score-0.165]

64 The   x 3 x 3  © x tv  © © x 3 The Gaussian fit uses the derivative and Hessian of the log posterior wrt the log length scales. [sent-141, score-0.203]

65 The data have been sorted from left-toright according to the value of the time variable (since the data is not equally spaced in time the axis of this matrix cannot be aligned with the plot in fig. [sent-145, score-0.133]

66 The right hand plot shows a histogram over the 100 samples of the number of GP experts used to model the data. [sent-147, score-0.545]

67 Gaussian processes had zero mean a priori, which coupled with the concentration of data around zero may explain the posterior mass at zero. [sent-148, score-0.286]

68 It would be more natural to treat the GP means as separate hyperparameters controlled by a hyper-hyperparameter (centered at zero) and do inference on them, rather than fix them all at 0. [sent-149, score-0.227]

69 Although the scatter of data from the predictive distribution for iMGPE looks somewhat messy with multimodality etc, it is important to note that it assigns high density to regions that seem probable. [sent-150, score-0.168]

70 The motorcycle data appears to have roughly three regions: a flat low-noise region, followed by a curved region, and a flat high noise region. [sent-151, score-0.223]

71 (left) shows the number of times two data points were assigned to the same expert. [sent-155, score-0.134]

72 A clearly defined block captures the initial flat region and a few other points that lie near the g line; the middle block captures the curved region, with a more gradual transition to the last flat region. [sent-156, score-0.182]

73 A histogram of the number of GP experts used shows that the posterior distribution of number of needed GPs has a broad peak between and , where less than 3 occupied experts is very unlikely, and above becoming progressively less likely. [sent-157, score-1.023]

74 Note that it never uses just a single GP to model the data which accords with the intuition that a single stationary covariance function would be inadequate. [sent-158, score-0.257]

75 We should point out that the model is not trying to do model selection between finite GP mixtures, but rather always assumes that there are infinitely many available, most of which contribute with small mass 4 to a diffuse density in the background. [sent-159, score-0.105]

76 4 10 frequency auto correlation coefficient log number of occupied experts log gating kernel width log Dirichlet concentration frequency 10 1 0. [sent-167, score-1.232]

77 5 log (base 10) gating function kernel width 5 0 −0. [sent-170, score-0.532]

78 5 1 log (base 10) Dirichlet process concentration xx 7tx Figure 3: The left hand plot shows the auto-correlation for various parameters of the model based on iterations. [sent-172, score-0.33]

79 The right hand plots show the distribution of the (log) kernel width and (log) Dirichlet concentration parameter , based on samples from the posterior. [sent-173, score-0.284]

80 x 7x U 3 3 3 73 3 3 x and the concentration parameter of the Dirichlet process. [sent-174, score-0.093]

81 The posterior 6 kernel width lies between and ; comparing to the scale of the inputs these are quite short distances, corresponding to rapid transitions between experts (as opposed to lengthy intervals with multiple active experts). [sent-175, score-0.604]

82 There are four ways in which the infinite mixture of GP experts differs from, and we believe, improves upon the model presented by Tresp. [sent-178, score-0.49]

83 First, in his model, although a gating network divides up the input space, each GP expert predicts on the basis of all of the data. [sent-179, score-0.569]

84 Data that was not assigned to a GP expert can therefore spill over into predictions of a GP, which will lead to bias near region boundaries especially for experts with long length scales. [sent-180, score-0.728]

85 Second, if there are experts, Tresp’s model has GPs (the experts, noise models, and separate gating functions) each of which requires computations over the entire dataset resulting in computations. [sent-181, score-0.499]

86 In our model since the experts divide up the data points, if there are experts equally dividing the data an iteration takes computations (each of Gibbs updates requires a rank-one computation for each of the experts and the Hybrid Monte Carlo takes times ). [sent-182, score-1.401]

87 Inference for the gating length scale parameters is if the full Hessian is used, but can be reduced to for a diagonal approximation, or Hybrid Monte Carlo if the input dimension is large. [sent-186, score-0.441]

88 Third, by going to the Dirichlet process infinite limit, we allow the model to infer the number of components required to capture the data. [sent-187, score-0.097]

89 Finally, in our model the GP hyperparameters are not fixed but are instead inferred from the data. [sent-188, score-0.166]

90 R¡¢      ¢     x §   §     ¢3   £ ¤         §  ¢    ¡   9 ¡  ¢    §   9 §  ¢      9 ¡  ¢     We have defined the gating network prior implicitly in terms of the conditional distribution of an indicator variable given all the other indicator variables. [sent-189, score-0.805]

91 Specifically, the distribution of this indicator variable is an input-dependent Dirichlet process with counts given by local estimates of the data density in each class eq. [sent-190, score-0.276]

92 If indeed there does not exist a single consistent joint distribution the Gibbs sampler may converge to different distributions depending on the order of sampling. [sent-193, score-0.087]

93 We are encouraged by the preliminary results obtained on the motorcycle data. [sent-194, score-0.091]

94 We have argued here that single iterations of the MCMC inference are computationally tractable even for large data sets, experiments will show whether mixing is sufficiently rapid to allow practical application. [sent-196, score-0.185]

95 We hope that the extra flexibility of the effective covariance function will turn out to improve performance. [sent-197, score-0.103]

96 Also, the automatic choice of the number of experts may make the model advantageous for practical modeling tasks. [sent-198, score-0.435]

97 The computational problem in doing inference and prediction using Gaussian Processes arises out of the unrealistic assumption that a single covariance function captures the behavior of the data over its entire range. [sent-200, score-0.291]

98 This leads to a cumbersome matrix inversion over the entire data set. [sent-201, score-0.1]

99 Instead we find that by making a more realistic assumption, that the data can be modeled by an infinite mixture of local Gaussian processes, the computational problem also decomposes into smaller matrix inversions. [sent-202, score-0.089]

100 Markov chain sampling methods for Dirichlet process mixture models. [sent-248, score-0.207]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.453), ('experts', 0.435), ('gating', 0.365), ('dirichlet', 0.199), ('gibbs', 0.199), ('hyperparameters', 0.166), ('expert', 0.144), ('indicator', 0.134), ('rasmussen', 0.126), ('gps', 0.108), ('covariance', 0.103), ('concentration', 0.093), ('imgpe', 0.091), ('motorcycle', 0.091), ('nite', 0.076), ('indicators', 0.075), ('mass', 0.068), ('occupied', 0.068), ('noise', 0.068), ('vague', 0.067), ('williams', 0.067), ('predictive', 0.066), ('plot', 0.065), ('kernel', 0.065), ('stationary', 0.064), ('inference', 0.061), ('chain', 0.061), ('network', 0.06), ('assigned', 0.057), ('mixture', 0.055), ('occupation', 0.054), ('posterior', 0.054), ('gaussian', 0.054), ('log', 0.052), ('carlo', 0.052), ('monte', 0.052), ('sampling', 0.051), ('width', 0.05), ('ms', 0.049), ('region', 0.047), ('hybrid', 0.047), ('nips', 0.046), ('duane', 0.046), ('hmc', 0.046), ('silverman', 0.046), ('length', 0.045), ('hand', 0.045), ('conditional', 0.043), ('points', 0.043), ('variance', 0.042), ('regression', 0.042), ('process', 0.04), ('gilks', 0.04), ('wild', 0.04), ('tresp', 0.039), ('mixtures', 0.039), ('likelihood', 0.039), ('prior', 0.038), ('processes', 0.037), ('density', 0.037), ('inverse', 0.036), ('conditionals', 0.036), ('multimodal', 0.036), ('acceleration', 0.036), ('xx', 0.035), ('integrating', 0.034), ('data', 0.034), ('entire', 0.034), ('rejection', 0.034), ('goldberg', 0.034), ('scales', 0.033), ('mixing', 0.033), ('sample', 0.032), ('dataset', 0.032), ('etc', 0.032), ('inversion', 0.032), ('cubic', 0.032), ('sweep', 0.032), ('captures', 0.031), ('distribution', 0.031), ('dimension', 0.031), ('controlling', 0.03), ('seeger', 0.03), ('curved', 0.03), ('jacobs', 0.03), ('widths', 0.03), ('variances', 0.03), ('cult', 0.029), ('capture', 0.029), ('computationally', 0.029), ('exibility', 0.029), ('dif', 0.028), ('infer', 0.028), ('single', 0.028), ('variables', 0.028), ('distances', 0.028), ('divide', 0.028), ('gamma', 0.028), ('joint', 0.028), ('al', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

Author: Carl E. Rasmussen, Zoubin Ghahramani

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

2 0.20862679 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

Author: Ronan Collobert, Samy Bengio, Yoshua Bengio

Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1

3 0.17367113 183 nips-2001-The Infinite Hidden Markov Model

Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite— consider, for example, symbols being possible words appearing in English text.

4 0.15827315 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

Author: Lehel Csató, Manfred Opper, Ole Winther

Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.

5 0.15120786 154 nips-2001-Products of Gaussians

Author: Christopher Williams, Felix V. Agakov, Stephen N. Felderhof

Abstract: Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. Below we consider PoE models in which each expert is a Gaussian. Although the product of Gaussians is also a Gaussian, if each Gaussian has a simple structure the product can have a richer structure. We examine (1) Products of Gaussian pancakes which give rise to probabilistic Minor Components Analysis, (2) products of I-factor PPCA models and (3) a products of experts construction for an AR(l) process. Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. In this paper we consider PoE models in which each expert is a Gaussian. It is easy to see that in this case the product model will also be Gaussian. However, if each Gaussian has a simple structure, the product can have a richer structure. Using Gaussian experts is attractive as it permits a thorough analysis of the product architecture, which can be difficult with other models , e.g. models defined over discrete random variables. Below we examine three cases of the products of Gaussians construction: (1) Products of Gaussian pancakes (PoGP) which give rise to probabilistic Minor Components Analysis (MCA), providing a complementary result to probabilistic Principal Components Analysis (PPCA) obtained by Tipping and Bishop (1999); (2) Products of I-factor PPCA models; (3) A products of experts construction for an AR(l) process. Products of Gaussians If each expert is a Gaussian pi(xI8 i ) '

6 0.13543627 43 nips-2001-Bayesian time series classification

7 0.12104672 79 nips-2001-Gaussian Process Regression with Mismatched Models

8 0.11560779 58 nips-2001-Covariance Kernels from Bayesian Generative Models

9 0.10610874 35 nips-2001-Analysis of Sparse Bayesian Learning

10 0.09479022 21 nips-2001-A Variational Approach to Learning Curves

11 0.091702871 61 nips-2001-Distribution of Mutual Information

12 0.081749439 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

13 0.080651924 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

14 0.08043021 46 nips-2001-Categorization by Learning and Combining Object Parts

15 0.079195179 107 nips-2001-Latent Dirichlet Allocation

16 0.074726716 164 nips-2001-Sampling Techniques for Kernel Methods

17 0.070779637 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

18 0.069727011 190 nips-2001-Thin Junction Trees

19 0.06907545 84 nips-2001-Global Coordination of Local Linear Models

20 0.065838419 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.234), (1, 0.026), (2, -0.038), (3, -0.099), (4, -0.134), (5, -0.035), (6, 0.119), (7, 0.024), (8, -0.01), (9, -0.006), (10, 0.068), (11, 0.072), (12, -0.023), (13, -0.311), (14, -0.035), (15, 0.014), (16, 0.042), (17, 0.035), (18, 0.147), (19, -0.053), (20, 0.039), (21, -0.117), (22, 0.039), (23, 0.081), (24, 0.024), (25, -0.017), (26, -0.146), (27, -0.178), (28, 0.188), (29, -0.006), (30, 0.085), (31, 0.032), (32, 0.136), (33, 0.003), (34, -0.091), (35, 0.098), (36, -0.071), (37, -0.018), (38, -0.0), (39, 0.103), (40, 0.08), (41, -0.119), (42, 0.128), (43, -0.029), (44, 0.049), (45, -0.011), (46, 0.095), (47, 0.12), (48, -0.018), (49, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94228047 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

Author: Carl E. Rasmussen, Zoubin Ghahramani

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

2 0.76188481 154 nips-2001-Products of Gaussians

Author: Christopher Williams, Felix V. Agakov, Stephen N. Felderhof

Abstract: Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. Below we consider PoE models in which each expert is a Gaussian. Although the product of Gaussians is also a Gaussian, if each Gaussian has a simple structure the product can have a richer structure. We examine (1) Products of Gaussian pancakes which give rise to probabilistic Minor Components Analysis, (2) products of I-factor PPCA models and (3) a products of experts construction for an AR(l) process. Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. In this paper we consider PoE models in which each expert is a Gaussian. It is easy to see that in this case the product model will also be Gaussian. However, if each Gaussian has a simple structure, the product can have a richer structure. Using Gaussian experts is attractive as it permits a thorough analysis of the product architecture, which can be difficult with other models , e.g. models defined over discrete random variables. Below we examine three cases of the products of Gaussians construction: (1) Products of Gaussian pancakes (PoGP) which give rise to probabilistic Minor Components Analysis (MCA), providing a complementary result to probabilistic Principal Components Analysis (PPCA) obtained by Tipping and Bishop (1999); (2) Products of I-factor PPCA models; (3) A products of experts construction for an AR(l) process. Products of Gaussians If each expert is a Gaussian pi(xI8 i ) '

3 0.58057165 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

Author: Ronan Collobert, Samy Bengio, Yoshua Bengio

Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1

4 0.55219311 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

Author: Lehel Csató, Manfred Opper, Ole Winther

Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.

5 0.53020346 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

Abstract: This paper proposes an approach to classification of adjacent segments of a time series as being either of classes. We use a hierarchical model that consists of a feature extraction stage and a generative classifier which is built on top of these features. Such two stage approaches are often used in signal and image processing. The novel part of our work is that we link these stages probabilistically by using a latent feature space. To use one joint model is a Bayesian requirement, which has the advantage to fuse information according to its certainty. The classifier is implemented as hidden Markov model with Gaussian and Multinomial observation distributions defined on a suitably chosen representation of autoregressive models. The Markov dependency is motivated by the assumption that successive classifications will be correlated. Inference is done with Markov chain Monte Carlo (MCMC) techniques. We apply the proposed approach to synthetic data and to classification of EEG that was recorded while the subjects performed different cognitive tasks. All experiments show that using a latent feature space results in a significant improvement in generalization accuracy. Hence we expect that this idea generalizes well to other hierarchical models.

6 0.51732153 79 nips-2001-Gaussian Process Regression with Mismatched Models

7 0.47750363 35 nips-2001-Analysis of Sparse Bayesian Learning

8 0.47116274 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

9 0.46564007 183 nips-2001-The Infinite Hidden Markov Model

10 0.45012343 61 nips-2001-Distribution of Mutual Information

11 0.4249545 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

12 0.39720824 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

13 0.38244671 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

14 0.37853509 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

15 0.35990041 21 nips-2001-A Variational Approach to Learning Curves

16 0.34269881 68 nips-2001-Entropy and Inference, Revisited

17 0.33581573 76 nips-2001-Fast Parameter Estimation Using Green's Functions

18 0.33078164 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

19 0.32069412 108 nips-2001-Learning Body Pose via Specialized Maps

20 0.31997526 107 nips-2001-Latent Dirichlet Allocation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.052), (17, 0.026), (19, 0.032), (27, 0.122), (30, 0.061), (34, 0.2), (38, 0.029), (59, 0.074), (72, 0.072), (79, 0.069), (83, 0.044), (91, 0.132)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89728999 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

Author: Roman Genov, Gert Cauwenberghs

Abstract: A mixed-signal paradigm is presented for high-resolution parallel innerproduct computation in very high dimensions, suitable for efficient implementation of kernels in image processing. At the core of the externally digital architecture is a high-density, low-power analog array performing binary-binary partial matrix-vector multiplication. Full digital resolution is maintained even with low-resolution analog-to-digital conversion, owing to random statistics in the analog summation of binary products. A random modulation scheme produces near-Bernoulli statistics even for highly correlated inputs. The approach is validated with real image data, and with experimental results from a CID/DRAM analog array prototype in 0.5 m CMOS. ¢

same-paper 2 0.85793519 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

Author: Carl E. Rasmussen, Zoubin Ghahramani

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

3 0.72192067 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

4 0.72019541 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

Author: Hilbert J. Kappen, Wim Wiegerinck

Abstract: The Cluster Variation method is a class of approximation methods containing the Bethe and Kikuchi approximations as special cases. We derive two novel iteration schemes for the Cluster Variation Method. One is a fixed point iteration scheme which gives a significant improvement over loopy BP, mean field and TAP methods on directed graphical models. The other is a gradient based method, that is guaranteed to converge and is shown to give useful results on random graphs with mild frustration. We conclude that the methods are of significant practical value for large inference problems. 1

5 0.71594012 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

Author: Si Wu, Shun-ichi Amari

Abstract: This study investigates a population decoding paradigm, in which the estimation of stimulus in the previous step is used as prior knowledge for consecutive decoding. We analyze the decoding accuracy of such a Bayesian decoder (Maximum a Posteriori Estimate), and show that it can be implemented by a biologically plausible recurrent network, where the prior knowledge of stimulus is conveyed by the change in recurrent interactions as a result of Hebbian learning. 1

6 0.7147885 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

7 0.71459806 56 nips-2001-Convolution Kernels for Natural Language

8 0.71333396 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

9 0.7125802 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

10 0.71203053 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

11 0.71165502 58 nips-2001-Covariance Kernels from Bayesian Generative Models

12 0.7116468 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

13 0.71053284 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

14 0.70761555 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

15 0.70696366 84 nips-2001-Global Coordination of Local Linear Models

16 0.70687246 121 nips-2001-Model-Free Least-Squares Policy Iteration

17 0.7068336 188 nips-2001-The Unified Propagation and Scaling Algorithm

18 0.70673966 155 nips-2001-Quantizing Density Estimators

19 0.70644885 8 nips-2001-A General Greedy Approximation Algorithm with Applications

20 0.70624614 183 nips-2001-The Infinite Hidden Markov Model