nips nips2008 nips2008-249 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chao Yuan, Claus Neubauer
Abstract: Mixture of Gaussian processes models extended a single Gaussian process with ability of modeling multi-modal data and reduction of training complexity. Previous inference algorithms for these models are mostly based on Gibbs sampling, which can be very slow, particularly for large-scale data sets. We present a new generative mixture of experts model. Each expert is still a Gaussian process but is reformulated by a linear model. This breaks the dependency among training outputs and enables us to use a much faster variational Bayesian algorithm for training. Our gating network is more flexible than previous generative approaches as inputs for each expert are modeled by a Gaussian mixture model. The number of experts and number of Gaussian components for an expert are inferred automatically. A variety of tests show the advantages of our method. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Mixture of Gaussian processes models extended a single Gaussian process with ability of modeling multi-modal data and reduction of training complexity. [sent-4, score-0.153]
2 We present a new generative mixture of experts model. [sent-6, score-0.39]
3 Each expert is still a Gaussian process but is reformulated by a linear model. [sent-7, score-0.567]
4 This breaks the dependency among training outputs and enables us to use a much faster variational Bayesian algorithm for training. [sent-8, score-0.344]
5 Our gating network is more flexible than previous generative approaches as inputs for each expert are modeled by a Gaussian mixture model. [sent-9, score-0.916]
6 The number of experts and number of Gaussian components for an expert are inferred automatically. [sent-10, score-0.778]
7 Secondly, the cost of training is O(N 3 ), where N is the size of the training set, which can be too expensive for large data sets. [sent-17, score-0.15]
8 Mixture of GP experts models were proposed to tackle the above problems (Rasmussen & Ghahramani [1]; Meeds & Osindero [2]). [sent-18, score-0.248]
9 In this paper, we propose a new generative mixture of Gaussian processes model for regression problems and apply variational Bayesian methods to train it. [sent-24, score-0.363]
10 Each Gaussian process expert is described by a linear model, which breaks the dependency among training outputs and makes variational inference feasible. [sent-25, score-0.911]
11 The distribution of inputs for each expert is modeled by a Gaussian mixture model (GMM). [sent-26, score-0.717]
12 Thus, our gating network can handle missing inputs and is more flexible than single Gaussian-based gating models [2-4]. [sent-27, score-0.409]
13 The number of experts and the number of components for each GMM are automatically inferred. [sent-28, score-0.248]
14 Training using variational methods is much faster than using MCMC. [sent-29, score-0.15]
15 It ele2 gantly models the dependency among data with a Gaussian distribution: P (Y) = N (Y|0, K+σn I), 1 t p L z ql αy C y x mlc αx m0 R0 Rlc r vl S θl γl Il a b Figure 1: The graphical model representation for the proposed mixture of experts model. [sent-35, score-1.062]
16 It consists of a hyperparameter set Θ = {L, αy , C, αx , m0 , R0 , r, S, θ 1:L , I1:L , a, b} and a parameter set Ψ = {p, ql , mlc , Rlc , vl , γl | l = 1, 2, . [sent-36, score-0.667]
17 The local expert is a GP linear model to predict output y from input x; the gating network is a GMM for input x. [sent-43, score-0.796]
18 Step 3, to sample one data point x and y, we sequentially sample expert indicator t, cluster indicator z, x and y. [sent-47, score-0.703]
19 The mixture of experts (MoE) framework offers a natural solution for multi-modality problems (Jacobs et al. [sent-61, score-0.364]
20 Early MoE work used linear experts [3, 4, 11, 12] and some of them were neatly trained via variational methods [4, 11, 12]. [sent-63, score-0.422]
21 Tresp [13] proposed a mixture of GPs model that can be trained fast using the EM algorithm. [sent-65, score-0.14]
22 However, hyperparameters including the number of experts needed to be specified and the training complexity issue was not addressed. [sent-66, score-0.453]
23 By introducing the Dirichlet process mixture (DPM) prior, infinite mixture of GPs models are able to infer the number of experts, both hyperparameters and parameters via Gibbs sampling [1, 2]. [sent-67, score-0.399]
24 However, these models are trained by MCMC methods, which demand expensive training and testing time (as collected samples are usually combined to give predictive distributions). [sent-68, score-0.184]
25 It consists of the local expert part and gating network part, which are covered in Sections 3. [sent-72, score-0.726]
26 3, we describe how to perform variational inference of this model. [sent-76, score-0.15]
27 1 Local Gaussian process expert A local Gaussian process expert is specified by the following linear model given the expert indicator t = l (where l = 1 : L) and other related variables: T P (y|x, t = l, vl , θ l , Il , γl ) = N (y|vl φl (x), γl−1 ). [sent-78, score-2.072]
28 (1) This linear model is symbolized by the inner product of the weight vector vl and a nonlinear feature vector φl (x). [sent-79, score-0.357]
29 φl (x) is a vector of kernel functions between a test input x and a subset of training inputs: [kl (x, xIl1 ), kl (x, xIl2 ), . [sent-80, score-0.186]
30 The active set Il denotes the indices of selected M training samples. [sent-84, score-0.209]
31 3; for now let us assume that we use the whole training set as the active set. [sent-86, score-0.209]
32 vl has a Gaussian distribution N (vl |0, U−1 ) with l 2 0 mean and inverse covariance Ul . [sent-87, score-0.444]
33 Ul is set to Kl + σhl I, where Kl is a M × M kernel matrix 2 consisting of kernel functions between training samples in the active set. [sent-88, score-0.234]
34 If we set 2 σhl = 0 and γl = σ1 , the joint distribution of the training outputs Y, assuming they are from 2 nl 2 the same expert l, can be proved to be N (Y|0, Kl + σnl I). [sent-99, score-0.701]
35 , P (y1:N |x1:N , t1:N , v1:L , θ 1:L , I1:L , γ1:L ) = N n=1 P (yn |xn , tn = l, vl , θ l , Il , γl ). [sent-103, score-0.431]
36 This makes the variational inference of the mixture of Gaussian processes feasible. [sent-104, score-0.307]
37 2 Gating network A gating network determines which expert to use based on input x. [sent-106, score-0.792]
38 We consider a generative gating network, where expert indicator t is generated by a categorical distribution P (t = l) = pl . [sent-107, score-0.87]
39 Given expert indicator t = l, we assume that x follows a Gaussian mixture model (GMM) with C components. [sent-115, score-0.697]
40 Each component (cluster) is modeled by a Gaussian distribution P (x|t = l, z = c, mlc , Rlc ) = N (x|mlc , R−1 ). [sent-116, score-0.214]
41 z is the cluster indicator which has a categorical distribution lc P (z = c|t = l, ql ) = qlc . [sent-117, score-0.333]
42 In addition, we give mlc a Gaussian prior N (mlc |m0 , R−1 ), Rlc a 0 Wishart prior W(Rlc |r, S) and ql a symmetric Dirichlet prior Dir(ql |αx /C, αx /C, . [sent-118, score-0.31]
43 In previous generative gating networks [2-4], the expert indicator also acts as the cluster indicator (or t = z) such that inputs for an expert can only have one Gaussian distribution. [sent-122, score-1.472]
44 In comparison, our model is more flexible by modeling inputs x for each expert as a Gaussian mixture distribution. [sent-123, score-0.694]
45 3 Variational inference Variational EM algorithm Given a set of training data D = {(xn , yn ) | n = 1 : N }, the task of learning is to estimate unknown hyperparameters and infer posterior distribution of parameters. [sent-129, score-0.32]
46 This problem is nicely addressed by the variational EM algorithm. [sent-130, score-0.182]
47 Parameters Ψ, expert indicators T = {t1:N } and cluster indicators Z = {z1:N } are treated as hidden variables, denoted by Ω = {Ψ, T, Z}. [sent-132, score-0.749]
48 m0 and R0 are set to be the mean and inverse covariance of the training inputs, respectively. [sent-135, score-0.139]
49 To compute the distribution for a hidden variable ωi , we need to compute the posterior mean of log P (D, Ω|Θ) over all hidden variables except ωi : log P (D, Ω|Θ) Ω/ωi . [sent-148, score-0.182]
50 During iteration, if a cluster c for expert l does not have a single training sample supporting it (Q(tn = l, zn = c) > 0), this cluster and its associated parameters mlc and Rlc will be removed. [sent-154, score-0.962]
51 Similarly, we remove an expert l if no Q(tn = l) > 0. [sent-155, score-0.53]
52 For better efficiency, we do not select the active sets I1:L in each M-step; instead, we fix I1:L during the EM algorithm and only update I1:L once when the EM algorithm converges. [sent-160, score-0.16]
53 Initialization Without proper initialization, variational methods can be easily trapped into local optima. [sent-162, score-0.15]
54 Our method is based on the assumption that the combined data including x and y for an expert are usually distributed locally in the combined d + 1 dimensional space. [sent-165, score-0.53]
55 Therefore, clustering methods such as k-mean can be used to cluster data, one cluster for one expert. [sent-166, score-0.142]
56 Secondly, we cluster all training data into two clusters and train one expert per cluster. [sent-169, score-0.706]
57 Different experts represent different local portions of training data in different scales. [sent-171, score-0.323]
58 Active set selection We now address the problem of selecting active set Il of size M in defining the feature vector φl for expert l. [sent-176, score-0.664]
59 The posterior distribution Q(vl ) can be proved to be T 2 Gaussian with inverse covariance Ul = γl n Tnl φl (xn )φl (xn ) + Kl + σhl I and mean µl = U−1 γl n Tnl yn φl (xn ). [sent-177, score-0.179]
60 Thus, for small data sets, the active set can be set to the full training set (M = N ). [sent-180, score-0.209]
61 With Il fixed, we run the variational EM algorithm and obtain Q(Ω) and Θ. [sent-183, score-0.15]
62 Since Q(vl ) is Gaussian, vl is always µl at the optimal point and thus this optimization is equivalent to maximizing the determinant of the inverse covariance 2 Tnl φl (xn )φl (xn )T + Kl + σhl I|. [sent-188, score-0.396]
63 Looking for the global optimal active set with size M is not feasible. [sent-194, score-0.134]
64 4 In a summary, the variational EM algorithm with active set selection proceeds as follows. [sent-202, score-0.284]
65 During initialization, training data are clustered and assigned to each expert by the k-mean clustering algorithm noted above; the data assigned to each expert is used for randomly selecting the active set and then training the linear model. [sent-203, score-1.344]
66 During each iteration, we run variational EM to update parameters and hyperparameters; when the EM algorithm converges, we update the active set and Q(vl ) for each expert. [sent-204, score-0.284]
67 In this way, these pseudo-inputs X can be viewed as hyperparameters and can be optimized in the same variational EM algorithm without resorting to a separate update for active sets as we do. [sent-210, score-0.44]
68 (4) The first approximation uses the results from the variational inference. [sent-215, score-0.15]
69 Note that expert indicators T and cluster indicators Z are integrated out. [sent-216, score-0.711]
70 (4) can be easily computed using standard predictive algorithm for mixture of linear experts. [sent-221, score-0.176]
71 Using these 400 points as training data, our method found two experts that fit the data nicely. [sent-231, score-0.323]
72 In general, expert one represents the last two functions while expert two represents the first two functions. [sent-234, score-1.06]
73 Note that the GP for expert one appears to fit the data of the first function comparably well to that of expert two. [sent-242, score-1.06]
74 However, the gating network does not support this: the means of the GMM for expert one does not cover the region of the first function. [sent-243, score-0.756]
75 We did not plot the mean of the predictive distribution as this data set has multiple modes in the output dimension. [sent-246, score-0.16]
76 Larger active sets did not give appreciably better results. [sent-248, score-0.16]
77 Our algorithm yielded two experts with the first expert modeling the majority of the points and the second expert only depicting the beginning part. [sent-251, score-1.308]
78 5 100 50 50 0 0 data for expert 1 GP for expert 1 −50 m for expert 1 data for expert 2 GP for expert 2 −100 m for expert 2 mean of experts posterior samples −50 −100 −150 0 20 40 60 80 100 10 20 30 40 50 Figure 2: Test results for toy data (left) and motorcycle data (right). [sent-256, score-3.617]
79 Each data point is assigned to an expert l based on its posterior probability Q(tn = l) and is referred to as “data for expert l”. [sent-257, score-1.118]
80 The means of the GMM for each expert are also shown at the bottom as “m for expert l”. [sent-258, score-1.06]
81 In the right figure, the mean of the predictive distribution is shown as a solid line and samples drawn from the predictive distribution are shown as dots (100 samples for each of the 45 horizontal locations). [sent-259, score-0.241]
82 We also plot the mean of the predictive distribution (4) in Fig. [sent-260, score-0.137]
83 However, our results have more artifacts at input > 40 because that region shares the same std = 23. [sent-268, score-0.143]
84 However, we are interested in the inverse kinematics problem: given the end point position, we want to estimate the joint angles. [sent-280, score-0.156]
85 Since this problem involves predicting two correlated outputs at the same time, we used an independent set of local experts for each output but let these two outputs share the same gating network. [sent-283, score-0.499]
86 This is expected as we use more powerful GP experts vs. [sent-291, score-0.248]
87 We followed the standard DELVE testing framework: for the Boston data, there are two tests each using 128 training examples; for both Kin-8nm and Pumadyn-32nm data, there are four tests, each using 1024 training examples. [sent-300, score-0.187]
88 Left: illustration of the robot kinematics (adapted from [12]). [sent-311, score-0.137]
89 The mean of the Gaussian distribution with the highest probability was fed into the forward kinematics to obtain the estimated end point position. [sent-317, score-0.2]
90 Right: the second residue plot using the mean of the Gaussian distribution with the second highest probability only for region B. [sent-321, score-0.209]
91 Both residue plots are needed to check whether both modalities are detected correctly. [sent-324, score-0.147]
92 Our method (vmgp) is compared with a single Gaussian process trained using a maximum a posteriori method (gp), a bagged version of MARS (mars), a multi-layer perceptron trained using hybrid MCMC (mlp) and a committee of mixtures of linear experts (me) [11]. [sent-354, score-0.389]
93 set compromised the results, suggesting that for these high dimensional data sets, a large number of training examples are required; and for the present training sets, each training example carries information not represented by others. [sent-355, score-0.225]
94 We started with ten experts and found an average of 2, 1 and 2. [sent-356, score-0.248]
95 Finally, to test how our active set selection algorithm performs, we conducted a standard test for sparse GPs: 7168 samples from Pumadyn-32nm were used for training and the remaining 1024 were for testing. [sent-362, score-0.278]
96 5 Conclusions We present a new mixture of Gaussian processes model and apply variational Bayesian method to train it. [sent-370, score-0.337]
97 This can be improved by using a smaller M for an expert with a smaller number of supporting training samples. [sent-380, score-0.605]
98 (A-1) c The first term in (A-1) is the posterior probability for expert t∗ = l and it is the sum of P (t∗ = l, z ∗ = c|x∗ ) = |t = (t P PP (x (x |t l,=z l ,= c)P c )P=(tl, z= l=, zc) = c ) , P z = ∗ ∗ ∗ l ∗ ∗ ∗ ∗ ∗ ∗ ∗ (A-2) c where P (t∗ = l, z ∗ = c) = pl qlc . [sent-384, score-0.678]
99 The second term in (A-1) is the predictive probability for y ∗ given expert l, which is Gaussian. [sent-385, score-0.59]
100 Bayesian model search for mixture models based on optimizing variational bounds. [sent-411, score-0.266]
wordName wordTfidf (topN-words)
[('expert', 0.53), ('vl', 0.357), ('experts', 0.248), ('mlc', 0.191), ('il', 0.173), ('rlc', 0.17), ('ul', 0.167), ('gmm', 0.167), ('gating', 0.165), ('variational', 0.15), ('active', 0.134), ('hyperparameters', 0.13), ('hl', 0.13), ('ql', 0.119), ('mixture', 0.116), ('tnl', 0.106), ('residue', 0.102), ('gp', 0.101), ('kinematics', 0.093), ('gaussian', 0.092), ('em', 0.076), ('kl', 0.076), ('training', 0.075), ('delve', 0.074), ('tn', 0.074), ('cluster', 0.071), ('maxil', 0.064), ('predictive', 0.06), ('posterior', 0.058), ('mixtures', 0.056), ('mars', 0.056), ('indicators', 0.055), ('std', 0.053), ('indicator', 0.051), ('rasmussen', 0.05), ('gps', 0.05), ('inputs', 0.048), ('pl', 0.048), ('breaks', 0.045), ('motorcycle', 0.045), ('modalities', 0.045), ('robot', 0.044), ('sparse', 0.044), ('outputs', 0.043), ('qlc', 0.042), ('standardised', 0.042), ('vmgp', 0.042), ('processes', 0.041), ('xn', 0.041), ('inverse', 0.039), ('initialization', 0.038), ('hidden', 0.038), ('process', 0.037), ('mlp', 0.037), ('moe', 0.037), ('stds', 0.037), ('modality', 0.037), ('boston', 0.037), ('tests', 0.037), ('toy', 0.036), ('forward', 0.035), ('gibbs', 0.035), ('input', 0.035), ('angles', 0.034), ('svens', 0.034), ('jacobs', 0.034), ('mcmc', 0.034), ('yn', 0.034), ('arm', 0.033), ('meeds', 0.032), ('nicely', 0.032), ('wishart', 0.032), ('dependency', 0.031), ('network', 0.031), ('region', 0.03), ('nl', 0.03), ('train', 0.03), ('comparable', 0.029), ('plot', 0.029), ('snelson', 0.029), ('dir', 0.029), ('exible', 0.028), ('secondly', 0.028), ('williams', 0.028), ('categorical', 0.027), ('bishop', 0.027), ('advances', 0.026), ('mit', 0.026), ('sets', 0.026), ('generative', 0.026), ('samples', 0.025), ('artifacts', 0.025), ('mean', 0.025), ('trained', 0.024), ('end', 0.024), ('dirichlet', 0.024), ('zn', 0.024), ('factorized', 0.023), ('modes', 0.023), ('distribution', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 249 nips-2008-Variational Mixture of Gaussian Process Experts
Author: Chao Yuan, Claus Neubauer
Abstract: Mixture of Gaussian processes models extended a single Gaussian process with ability of modeling multi-modal data and reduction of training complexity. Previous inference algorithms for these models are mostly based on Gibbs sampling, which can be very slow, particularly for large-scale data sets. We present a new generative mixture of experts model. Each expert is still a Gaussian process but is reformulated by a linear model. This breaks the dependency among training outputs and enables us to use a much faster variational Bayesian algorithm for training. Our gating network is more flexible than previous generative approaches as inputs for each expert are modeled by a Gaussian mixture model. The number of experts and number of Gaussian components for an expert are inferred automatically. A variety of tests show the advantages of our method. 1
2 0.12140499 32 nips-2008-Bayesian Kernel Shaping for Learning Control
Author: Jo-anne Ting, Mrinal Kalakrishnan, Sethu Vijayakumar, Stefan Schaal
Abstract: In kernel-based regression learning, optimizing each kernel individually is useful when the data density, curvature of regression surfaces (or decision boundaries) or magnitude of output noise varies spatially. Previous work has suggested gradient descent techniques or complex statistical hypothesis methods for local kernel shaping, typically requiring some amount of manual tuning of meta parameters. We introduce a Bayesian formulation of nonparametric regression that, with the help of variational approximations, results in an EM-like algorithm for simultaneous estimation of regression and kernel parameters. The algorithm is computationally efficient, requires no sampling, automatically rejects outliers and has only one prior to be specified. It can be used for nonparametric regression with local polynomials or as a novel method to achieve nonstationary regression with Gaussian processes. Our methods are particularly useful for learning control, where reliable estimation of local tangent planes is essential for adaptive controllers and reinforcement learning. We evaluate our methods on several synthetic data sets and on an actual robot which learns a task-level control law. 1
3 0.11816907 138 nips-2008-Modeling human function learning with Gaussian processes
Author: Thomas L. Griffiths, Chris Lucas, Joseph Williams, Michael L. Kalish
Abstract: Accounts of how people learn functional relationships between continuous variables have tended to focus on two possibilities: that people are estimating explicit functions, or that they are performing associative learning supported by similarity. We provide a rational analysis of function learning, drawing on work on regression in machine learning and statistics. Using the equivalence of Bayesian linear regression and Gaussian processes, we show that learning explicit rules and using similarity can be seen as two views of one solution to this problem. We use this insight to define a Gaussian process model of human function learning that combines the strengths of both approaches. 1
4 0.11491076 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
Author: Ben Calderhead, Mark Girolami, Neil D. Lawrence
Abstract: Identification and comparison of nonlinear dynamical system models using noisy and sparse experimental data is a vital task in many fields, however current methods are computationally expensive and prone to error due in part to the nonlinear nature of the likelihood surfaces induced. We present an accelerated sampling procedure which enables Bayesian inference of parameters in nonlinear ordinary and delay differential equations via the novel use of Gaussian processes (GP). Our method involves GP regression over time-series data, and the resulting derivative and time delay estimates make parameter inference possible without solving the dynamical system explicitly, resulting in dramatic savings of computational time. We demonstrate the speed and statistical accuracy of our approach using examples of both ordinary and delay differential equations, and provide a comprehensive comparison with current state of the art methods. 1
5 0.08900921 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
Author: Lukas Kroc, Ashish Sabharwal, Bart Selman
Abstract: We show that an important and computationally challenging solution space feature of the graph coloring problem (COL), namely the number of clusters of solutions, can be accurately estimated by a technique very similar to one for counting the number of solutions. This cluster counting approach can be naturally written in terms of a new factor graph derived from the factor graph representing the COL instance. Using a variant of the Belief Propagation inference framework, we can efficiently approximate cluster counts in random COL problems over a large range of graph densities. We illustrate the algorithm on instances with up to 100, 000 vertices. Moreover, we supply a methodology for computing the number of clusters exactly using advanced techniques from the knowledge compilation literature. This methodology scales up to several hundred variables. 1
6 0.087350838 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
7 0.08558242 216 nips-2008-Sparse probabilistic projections
8 0.084118657 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
9 0.08034271 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
10 0.079160191 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
11 0.075522944 146 nips-2008-Multi-task Gaussian Process Learning of Robot Inverse Dynamics
12 0.074882209 62 nips-2008-Differentiable Sparse Coding
13 0.072034225 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
14 0.069376074 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning
15 0.06729579 101 nips-2008-Human Active Learning
16 0.066274114 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
17 0.064937547 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
18 0.064123899 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
19 0.063041255 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
20 0.060792629 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
topicId topicWeight
[(0, -0.183), (1, -0.009), (2, 0.053), (3, 0.025), (4, 0.06), (5, -0.08), (6, -0.015), (7, 0.14), (8, -0.032), (9, 0.037), (10, 0.048), (11, 0.059), (12, 0.092), (13, -0.043), (14, 0.024), (15, -0.033), (16, -0.143), (17, 0.017), (18, 0.032), (19, 0.0), (20, 0.043), (21, -0.13), (22, 0.09), (23, 0.077), (24, 0.073), (25, -0.029), (26, 0.071), (27, -0.038), (28, -0.094), (29, 0.032), (30, 0.001), (31, -0.04), (32, -0.041), (33, 0.042), (34, -0.01), (35, 0.076), (36, -0.08), (37, -0.008), (38, 0.023), (39, 0.077), (40, -0.094), (41, -0.11), (42, -0.047), (43, 0.017), (44, 0.08), (45, 0.014), (46, -0.031), (47, -0.052), (48, 0.084), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.92885947 249 nips-2008-Variational Mixture of Gaussian Process Experts
Author: Chao Yuan, Claus Neubauer
Abstract: Mixture of Gaussian processes models extended a single Gaussian process with ability of modeling multi-modal data and reduction of training complexity. Previous inference algorithms for these models are mostly based on Gibbs sampling, which can be very slow, particularly for large-scale data sets. We present a new generative mixture of experts model. Each expert is still a Gaussian process but is reformulated by a linear model. This breaks the dependency among training outputs and enables us to use a much faster variational Bayesian algorithm for training. Our gating network is more flexible than previous generative approaches as inputs for each expert are modeled by a Gaussian mixture model. The number of experts and number of Gaussian components for an expert are inferred automatically. A variety of tests show the advantages of our method. 1
2 0.70750362 138 nips-2008-Modeling human function learning with Gaussian processes
Author: Thomas L. Griffiths, Chris Lucas, Joseph Williams, Michael L. Kalish
Abstract: Accounts of how people learn functional relationships between continuous variables have tended to focus on two possibilities: that people are estimating explicit functions, or that they are performing associative learning supported by similarity. We provide a rational analysis of function learning, drawing on work on regression in machine learning and statistics. Using the equivalence of Bayesian linear regression and Gaussian processes, we show that learning explicit rules and using similarity can be seen as two views of one solution to this problem. We use this insight to define a Gaussian process model of human function learning that combines the strengths of both approaches. 1
3 0.63444626 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
Author: Ben Calderhead, Mark Girolami, Neil D. Lawrence
Abstract: Identification and comparison of nonlinear dynamical system models using noisy and sparse experimental data is a vital task in many fields, however current methods are computationally expensive and prone to error due in part to the nonlinear nature of the likelihood surfaces induced. We present an accelerated sampling procedure which enables Bayesian inference of parameters in nonlinear ordinary and delay differential equations via the novel use of Gaussian processes (GP). Our method involves GP regression over time-series data, and the resulting derivative and time delay estimates make parameter inference possible without solving the dynamical system explicitly, resulting in dramatic savings of computational time. We demonstrate the speed and statistical accuracy of our approach using examples of both ordinary and delay differential equations, and provide a comprehensive comparison with current state of the art methods. 1
4 0.63138217 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
Author: Mauricio Alvarez, Neil D. Lawrence
Abstract: We present a sparse approximation approach for dependent output Gaussian processes (GP). Employing a latent function framework, we apply the convolution process formalism to establish dependencies between output variables, where each latent function is represented as a GP. Based on these latent functions, we establish an approximation scheme using a conditional independence assumption between the output processes, leading to an approximation of the full covariance which is determined by the locations at which the latent functions are evaluated. We show results of the proposed methodology for synthetic data and real world applications on pollution prediction and a sensor network. 1
5 0.63003618 32 nips-2008-Bayesian Kernel Shaping for Learning Control
Author: Jo-anne Ting, Mrinal Kalakrishnan, Sethu Vijayakumar, Stefan Schaal
Abstract: In kernel-based regression learning, optimizing each kernel individually is useful when the data density, curvature of regression surfaces (or decision boundaries) or magnitude of output noise varies spatially. Previous work has suggested gradient descent techniques or complex statistical hypothesis methods for local kernel shaping, typically requiring some amount of manual tuning of meta parameters. We introduce a Bayesian formulation of nonparametric regression that, with the help of variational approximations, results in an EM-like algorithm for simultaneous estimation of regression and kernel parameters. The algorithm is computationally efficient, requires no sampling, automatically rejects outliers and has only one prior to be specified. It can be used for nonparametric regression with local polynomials or as a novel method to achieve nonstationary regression with Gaussian processes. Our methods are particularly useful for learning control, where reliable estimation of local tangent planes is essential for adaptive controllers and reinforcement learning. We evaluate our methods on several synthetic data sets and on an actual robot which learns a task-level control law. 1
6 0.56930208 146 nips-2008-Multi-task Gaussian Process Learning of Robot Inverse Dynamics
7 0.54658592 233 nips-2008-The Gaussian Process Density Sampler
8 0.53004092 216 nips-2008-Sparse probabilistic projections
9 0.51231962 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
10 0.50512397 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning
11 0.46956536 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
12 0.46757048 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
13 0.45807105 101 nips-2008-Human Active Learning
14 0.4574587 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
15 0.45521283 185 nips-2008-Privacy-preserving logistic regression
16 0.44500107 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
17 0.43873537 31 nips-2008-Bayesian Exponential Family PCA
18 0.43397921 105 nips-2008-Improving on Expectation Propagation
19 0.42814294 29 nips-2008-Automatic online tuning for fast Gaussian summation
20 0.42040235 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
topicId topicWeight
[(6, 0.064), (7, 0.092), (12, 0.029), (28, 0.155), (57, 0.126), (59, 0.012), (60, 0.276), (63, 0.024), (77, 0.032), (78, 0.016), (83, 0.082)]
simIndex simValue paperId paperTitle
1 0.93973339 38 nips-2008-Bio-inspired Real Time Sensory Map Realignment in a Robotic Barn Owl
Author: Juan Huo, Zhijun Yang, Alan F. Murray
Abstract: The visual and auditory map alignment in the Superior Colliculus (SC) of barn owl is important for its accurate localization for prey behavior. Prism learning or Blindness may interfere this alignment and cause loss of the capability of accurate prey. However, juvenile barn owl could recover its sensory map alignment by shifting its auditory map. The adaptation of this map alignment is believed based on activity dependent axon developing in Inferior Colliculus (IC). A model is built to explore this mechanism. In this model, axon growing process is instructed by an inhibitory network in SC while the strength of the inhibition adjusted by Spike Timing Dependent Plasticity (STDP). We test and analyze this mechanism by application of the neural structures involved in spatial localization in a robotic system. 1
same-paper 2 0.7650609 249 nips-2008-Variational Mixture of Gaussian Process Experts
Author: Chao Yuan, Claus Neubauer
Abstract: Mixture of Gaussian processes models extended a single Gaussian process with ability of modeling multi-modal data and reduction of training complexity. Previous inference algorithms for these models are mostly based on Gibbs sampling, which can be very slow, particularly for large-scale data sets. We present a new generative mixture of experts model. Each expert is still a Gaussian process but is reformulated by a linear model. This breaks the dependency among training outputs and enables us to use a much faster variational Bayesian algorithm for training. Our gating network is more flexible than previous generative approaches as inputs for each expert are modeled by a Gaussian mixture model. The number of experts and number of Gaussian components for an expert are inferred automatically. A variety of tests show the advantages of our method. 1
3 0.72400331 118 nips-2008-Learning Transformational Invariants from Natural Movies
Author: Charles Cadieu, Bruno A. Olshausen
Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the first layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the first layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can influence representations at lower levels as a by-product of inference in a graphical model. 1
4 0.64893502 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
5 0.64239472 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
Author: Mehmet K. Muezzinoglu, Alexander Vergara, Ramon Huerta, Thomas Nowotny, Nikolai Rulkov, Henry Abarbanel, Allen Selverston, Mikhail Rabinovich
Abstract: The odor transduction process has a large time constant and is susceptible to various types of noise. Therefore, the olfactory code at the sensor/receptor level is in general a slow and highly variable indicator of the input odor in both natural and artificial situations. Insects overcome this problem by using a neuronal device in their Antennal Lobe (AL), which transforms the identity code of olfactory receptors to a spatio-temporal code. This transformation improves the decision of the Mushroom Bodies (MBs), the subsequent classifier, in both speed and accuracy. Here we propose a rate model based on two intrinsic mechanisms in the insect AL, namely integration and inhibition. Then we present a MB classifier model that resembles the sparse and random structure of insect MB. A local Hebbian learning procedure governs the plasticity in the model. These formulations not only help to understand the signal conditioning and classification methods of insect olfactory systems, but also can be leveraged in synthetic problems. Among them, we consider here the discrimination of odor mixtures from pure odors. We show on a set of records from metal-oxide gas sensors that the cascade of these two new models facilitates fast and accurate discrimination of even highly imbalanced mixtures from pure odors. 1
6 0.63841903 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
7 0.63831264 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
8 0.63664687 200 nips-2008-Robust Kernel Principal Component Analysis
9 0.63475829 62 nips-2008-Differentiable Sparse Coding
10 0.63330889 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
11 0.63251114 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
12 0.63153881 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
13 0.63110542 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
14 0.6310302 194 nips-2008-Regularized Learning with Networks of Features
15 0.62997812 248 nips-2008-Using matrices to model symbolic relationship
16 0.62965542 66 nips-2008-Dynamic visual attention: searching for coding length increments
17 0.62896681 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
18 0.62756509 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
19 0.62746197 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
20 0.62698877 95 nips-2008-Grouping Contours Via a Related Image