nips nips2002 nips2002-110 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
Reference: text
sentIndex sentText sentNum sentScore
1 dk Abstract In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). [sent-5, score-0.319]
2 Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. [sent-6, score-0.578]
3 We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. [sent-7, score-0.203]
4 1 Introduction Tipping’s relevance vector machine (RVM) both achieves a sparse solution like the support vector machine (SVM) [2, 3] and the probabilistic predictions of Bayesian kernel machines based upon a Gaussian process (GP) priors over functions [4, 5, 6, 7, 8]. [sent-12, score-0.412]
5 Sparsity is interesting both with respect to fast training and predictions and ease of interpretation of the solution. [sent-13, score-0.12]
6 It seems that Tipping’s relevance vector machine takes the best of both worlds. [sent-17, score-0.131]
7 It is a GP with a covariance matrix spanned by a small number of basis functions making the computational expensive matrix inversion operation go from O(N 3 ), where N is the number of training examples to O(M 2 N ) (M being the number of basis functions). [sent-18, score-0.734]
8 Simulation studies have shown very sparse solutions M N and good test performance [1]. [sent-19, score-0.093]
9 However, starting the RVM learning with as many basis functions as examples, i. [sent-20, score-0.259]
10 one basis function in each training input point, leads to the same complexity as for Gaussian processes (GP) since in the initial step no basis functions are removed. [sent-22, score-0.631]
11 [1] an incremental learning strategy that starts with only a single basis function and adds basis functions along the iterations, and to formalize it very recently [9]. [sent-24, score-0.545]
12 The total number of basis functions is kept low because basis functions are also removed. [sent-25, score-0.518]
13 In this paper we formalize this strategy using straightforward expectation-maximization (EM) [10] arguments to prove that the scheme is the guaranteed convergence to a local maximum of the likelihood of the model parameters. [sent-26, score-0.133]
14 This can be achieved by numerical approximations to matrix inversion [11] and suboptimal projections onto finite subspaces of basis functions without having an explicit parametric form of such basis functions [12, 13]. [sent-28, score-0.585]
15 The rest of the paper is organized as follows: In section 2 we present the extended linear models in a Bayesian perspective, the regression model and the standard EM approach. [sent-33, score-0.119]
16 Section 5 gives results for the Mackey-Glass time-series and preliminary results on the MNIST hand-written digits database. [sent-36, score-0.083]
17 2 Regression An extended linear model is built by transforming the input space by an arbitrary set of basis functions φj : RD → R that performs a non-linear transformation of the D-dimensional input space. [sent-38, score-0.259]
18 A linear model is applied to the transformed space whose dimension is equal to the number of basis functions M : M y(xi ) = j=1 ωj φj (xi ) = Φ(xi ) · ω (1) where Φ(xi ) ≡ [φ1 (xi ), . [sent-39, score-0.259]
19 The output of the model is thus a linear superposition of completely general basis functions. [sent-46, score-0.205]
20 While it is possible to optimize the parameters of the basis functions for the problem at hand [1, 16], we will in this paper assume that they are given. [sent-47, score-0.259]
21 The simplest possible regression learning scenario can be described as follows: a set of N input-target training pairs {xi , ti }N are assumed to be independent and contaminated i=1 with Gaussian noise of variance σ 2 . [sent-48, score-0.283]
22 The likelihood of the parameters ω is given by ω p(t|ω , σ 2 ) = 2πσ 2 −N/2 exp − 1 t − Φω 2σ 2 2 (2) where t = (t1 , . [sent-49, score-0.089]
23 In general, the implied prior over functions is a very complicated distribution. [sent-54, score-0.084]
24 However, choosing a Gaussian prior on the weights the prior over functions also becomes Gaussian, i. [sent-55, score-0.142]
25 For the specific choice of a factorized distribution with variance α−1 : j p(ωj |αj ) = 1 αj 2 exp − αj ωj 2π 2 (3) α the prior over functions p(y|α ) is N (0, ΦA−1 ΦT ), i. [sent-58, score-0.128]
26 a Gaussian process with covariance function given by M Cov(xi , xj ) = k=1 1 φk (xi )φk (xj ) αk (4) where α = (α0 , . [sent-60, score-0.075]
27 We can now see how sparseness in terms of the basis vectors may arise: if α−1 = 0 the kth basis vector k Φk ≡ [φk (x1 ), . [sent-67, score-0.386]
28 Associating a basis function with each input point may thus lead to a model with a sparse representations in the inputs, i. [sent-73, score-0.228]
29 the solution is only spanned by a subset of all input points. [sent-75, score-0.081]
30 This is exactly the idea behind the relevance vector machine, introduced by Tipping [17]. [sent-76, score-0.1]
31 We will see in the following how this also leads to a lower computational complexity than using a regular Gaussian process kernel. [sent-77, score-0.087]
32 The posterior distribution over the weights–obtained through Bayes rule–is a Gaussian distribution ω ωα p(t|ω , σ 2 )p(ω |α ) ω ωµ p(ω |t, α , σ 2 ) = = N (ω |µ , Σ) (5) α p(t|α , σ 2 ) µ where N (t|µ , Σ) is a Gaussian distribution with mean µ and covariance Σ evaluated at t. [sent-78, score-0.137]
33 The mean and covariance are given by µ = σ −2 ΣΦT t Σ = (σ −2 ΦT Φ + A)−1 (6) (7) The uncertainty about the optimal value of the weights captured by the posterior distribution (5) can be used to build probabilistic predictions. [sent-79, score-0.195]
34 The computational complexity of making predictions is thus O(M 2 P + M 3 + M 2 N ), where M is the number of selected basis functions (RVs) and P is the number of predictions. [sent-81, score-0.383]
35 The likelihood distribution over the training targets (2) can be “marginalized” with respect to the weights to obtain the marginal likelihood, which is also a Gaussian distribution α p(t|α , σ 2 ) = ω ωα ω p(t|ω , σ 2 ) p(ω |α ) dω = N (t|0, σ 2 I + ΦA−1 ΦT ) . [sent-84, score-0.347]
36 For regression, the E-step is exact (the lower bound on the marginal likelihood is made equal to the marginal likelihood) and consists in estimating the mean and variance (6) and (7) of the posterior distribution of the weights (5). [sent-90, score-0.56]
37 Although it is suboptimal in the EM sense, we have never observed it decrease the lower bound on the marginal log-likelihood. [sent-95, score-0.155]
38 N − j γj (12) In the optimization process many αj grow to infinity, which effectively deletes the corresponding weight and basis function. [sent-98, score-0.208]
39 A serious limitation of the EM algorithm and variants for problems of this type is that the complexity of computing the covariance of the weights (7) in the E-step is O(M 3 +M 2 N ). [sent-101, score-0.154]
40 At least in the first iteration where no basis functions have been deleted M = N and we are facing the same kind of complexity explosion that limits the applicability of Gaussian processes to large training set. [sent-102, score-0.486]
41 This has lead Tipping [1] to consider a constructive or incremental training paradigm where one basis function is added before each E-step and since basis functions are removed in the M-step, it turns out in practice that the total number of basis functions and the complexity remain low [9]. [sent-103, score-0.891]
42 In the following section we introduce a new algorithm that formalizes this procedure that can be proven to increase the marginal likelihood in each step. [sent-104, score-0.212]
43 3 Subspace EM We introduce an incremental approach to the EM algorithm, the Subspace EM (SSEM), that can be directly applied to training models like the RVM that rely on a linear superposition of completely general basis functions, both for classification and for regression. [sent-105, score-0.349]
44 where all the basis functions are present with finite α values, we start with a fully pruned model with all αj set to infinity. [sent-108, score-0.259]
45 The active set at iteration n, Rn , contains the indices of the basis vectors with α less than infinity: R1 = 1 Rn = {i | i ∈ Rn−1 ∧ αi ≤ L} ∪ {n} (13) where L is a finite very large number arbitrarily defined. [sent-111, score-0.268]
46 At iteration n the E-step is taken only in the subspace spanned by the weights whose indexes are in Rn . [sent-115, score-0.299]
47 This helps reducing the computational complexity of the M-step to O(M 3 ), where M is the number of relevance vectors. [sent-116, score-0.154]
48 Since the initial value of αj is infinity for all j, for regression the E-step yields always an equality between the log marginal likelihood and its lower bound. [sent-117, score-0.393]
49 At any step n, the posterior can be exactly projected on to the space spanned by the weights ω j such that j ∈ Rn , because the αk = ∞ for all k not in Rn . [sent-118, score-0.176]
50 Hence in the regression case, the SSEM never decreases the log marginal likelihood. [sent-119, score-0.276]
51 (L is a very large number) Set n = 1 Update the set of active indexes Rn Perform an E-step in subspace ωj such that j ∈ Rn Perform the M-step for all αj such that j ∈ Rn If visited all basis functions, end, else go to 2. [sent-122, score-0.4]
52 Log marginal likelihood as a function of the elapsed CPU time (left) and corresponding number of relevance vectors (right) for both SSEM and EM. [sent-132, score-0.312]
53 We perform one EM step for each time a new basis function is added to the active set. [sent-133, score-0.269]
54 Once all the examples have been visited, we switch to the batch EM algorithm on the active set until some convergence criteria has been satisfied, for example until the relative increase in the likelihood is smaller than a certain threshold. [sent-134, score-0.391]
55 In practice some 50 of these batch EM iterations are enough. [sent-135, score-0.201]
56 Here, we will discuss the adaptive TAP mean field approach–initially proposed for Gaussian processes [8]–that are readily translated to RVMs. [sent-137, score-0.124]
57 The mean field approach has the appealing features that it retains the computational efficiency of RVMs, is exact for the regression and reduces to the Laplace approximation in the limit where all the variability comes from the prior distribution. [sent-138, score-0.25]
58 We consider binary t = ±1 classification using the probit likelihood with ’input’ noise σ 2 p(t|y(x)) = erf t y(x) σ , (14) √ 2 x where Dz ≡ e−z /2 dz/ 2π and erf(x) ≡ −∞ Dz is an error function (or cumulative Gaussian distribution). [sent-139, score-0.198]
59 The advantage of using this sigmoid rather than the commonly used 0/1-logistic is that we under the mean field approximation can derive an analytical expression for the predictive distribution p(t∗ |x∗ , t) = p(t∗ |y)p(y|x∗ , t)dy needed for making Bayesian predictions. [sent-140, score-0.196]
60 Both a variational and the advanced mean field approach– used here–make a Gaussian approximation for p(y|x∗ , t) [8] with mean and variance given 2 2 by regression results y∗ and σ∗ − σ 2 , and y∗ and σ∗ given by eqs. [sent-141, score-0.324]
61 This leads ˆ to the following approximation for the predictive distribution p(t∗ |x∗ , t) = erf t∗ y∗ y p(y|x∗ , t) dy = erf t∗ σ σ∗ . [sent-143, score-0.328]
62 The sufficient statistics of the weights are written in terms of a set of O(N ) mean field parameters µ = A−1 ΦT τ Σ = where τi ≡ ∂ c ∂yi T A + Φ ΩΦ (16) −1 (17) c ln Z(yi , Vic + σ 2 ) and c Z(yi , Vic + σ 2 ) ≡ c p(ti |yi + z Vic + σ 2 ) Dz = erf ti c yi c + σ2 Vi . [sent-146, score-0.345]
63 (18) c The last equality holds for the likelihood eq. [sent-147, score-0.117]
64 (14) and yi and Vic are the mean and variance c of the so called cavity field. [sent-148, score-0.223]
65 The distinction between the different approximation schemes is solely in the variance V ic : Vic = 0 is the Laplace approximation, Vic = ΦA−1 ΦT ii is the so called naive mean field theory and an improved estimate is available from the adaptive TAP mean field theory [8]. [sent-150, score-0.205]
66 Lastly, the diagonal matrix Ω is the equivalent of the noise variance in the regression model (compare ∂τi ∂τ eqs. [sent-151, score-0.163]
67 5 Simulations We illustrate the performance of the SSEM for regression on the Mackey-Glass chaotic time series, which is well-known for its strong non-linearity. [sent-156, score-0.119]
68 We use Gaussian basis functions of fixed variance ν 2 = 10. [sent-162, score-0.303]
69 We perform prediction experiments for different sizes of the training set. [sent-164, score-0.148]
70 We perform in each case 10 repetitions with different partitions of the data sets into training and test. [sent-165, score-0.141]
71 We compare the test error, the number of RVs selected and the computer time needed for the batch and the SSEM method. [sent-166, score-0.241]
72 We present the results obtained with the growth method relative to the results obtained with the batch method in figure 3. [sent-167, score-0.248]
73 As expected, the relative Mackey−Glass data 3 growth Classification on MNIST digits batch Ete /Ete Tcpugrowth/Tcpubatch NRVgrowth/NRVbatch 2. [sent-168, score-0.301]
74 5 0 0 500 1000 1500 2000 Number of training examples −0. [sent-175, score-0.115]
75 5 0 200 400 Iteration 600 800 Figure 3: Left: Regression, mean values over 10 repetitions of relative test error, number of RVs and computer time for the Mackey-Glass data, up to 2400 training examples and 5804 test examples. [sent-176, score-0.286]
76 Right: Classification, Log marginal likelihood, test and training errors while training on one class against all the others, 60000 training and 10000 test examples. [sent-177, score-0.434]
77 computer time of the growth method compared with the batch method decreases with size of the training set. [sent-178, score-0.325]
78 For a few thousand examples the SSEM method is an order of magnitude faster than the batch method. [sent-179, score-0.266]
79 The batch method proved only to be faster for 100 training examples, and could not be used with data sets of thousands of examples on the machine on which we run the experiments because of its high memory requirements. [sent-180, score-0.374]
80 This is the reason why we only ran the comparison for up to 2400 training example for the Mackey-Glass data set. [sent-181, score-0.077]
81 Our experiments for classification are at the time of sending this paper to press very premature: we choose a very large data set, the MNIST database of handwritten digits [20], with 60000 training and 10000 test images. [sent-182, score-0.17]
82 We train on 13484 examples (the 6742 one’s and another 6742 random non-one digits selected at random from the rest) and we use 800 basis functions for both the batch and Subspace EM. [sent-186, score-0.551]
83 In figure 3 we show the convergence of the SSEM in terms of the log marginal likelihood and the training and test probabilities of error. [sent-187, score-0.363]
84 Under the same conditions the SSEM needed 55 minutes to do the job, while the batch EM needed 186 minutes. [sent-191, score-0.201]
85 The SSEM gives a machine with 28 basis functions and the batch EM one with 31 basis functions. [sent-192, score-0.666]
86 6 Conclusion We have presented a new approach to Bayesian training of linear models, based on a subspace extension of the EM algorithm that we call Subspace EM (SSEM). [sent-193, score-0.169]
87 The new method iteratively builds models from a potentially big library of basis functions. [sent-194, score-0.175]
88 the solution is spanned by small number M of basis functions, which is much smaller than N , the number of examples. [sent-197, score-0.256]
89 A prime example of this is Tipping’s relevance vector machine that typically produces solutions that are sparser than those of support vector machines. [sent-198, score-0.131]
90 For classification, we have presented a mean field approach that is expected to be a very good approximation to the exact inference and contains the widely used Laplace approximation as an extreme case. [sent-200, score-0.203]
91 We have applied the SSEM algorithm to both a large regression and a large classification data sets. [sent-201, score-0.119]
92 Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. [sent-207, score-0.18]
93 [9] Michael Tipping and Anita Faul, “Fast marginal likelihood maximisation for sparse bayesian models,” in International Workshop on Artificial Intelligence and Statistics, 2003. [sent-236, score-0.345]
94 Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. [sent-243, score-0.089]
95 Bartlett, “Sparse greedy gaussian process regression,” in Advances in Neural Information Processing Systems, 2001, number 13, pp. [sent-254, score-0.125]
96 [13] Lehel Csat´ and Manfred Opper, “Sparse representation for gaussian process models,” in o Advances in Neural Information Processing Systems, 2001, number 13, pp. [sent-256, score-0.125]
97 [14] Volker Tresp, “Mixtures of gaussian processes,” in Advances in Neural Information Processing Systems, 2000, number 12, pp. [sent-258, score-0.092]
98 Rasmussen and Zoubin Ghahramani, “Infinite mixtures of gaussian process experts,” in Advances in Neural Information Processing Systems, 2002, number 14. [sent-261, score-0.125]
99 [16] Joaquin Qui˜ onero-Candela and Lars Kai Hansen, “Time series prediction based on the relen vance vector machine with adaptive kernels,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002. [sent-262, score-0.106]
100 Tipping, “The relevance vector machine,” in Advances in Neural Information Processing Systems, 2000, number 12, pp. [sent-264, score-0.1]
wordName wordTfidf (topN-words)
[('ssem', 0.576), ('rvm', 0.219), ('vic', 0.219), ('batch', 0.201), ('em', 0.199), ('basis', 0.175), ('rn', 0.167), ('tipping', 0.161), ('rvs', 0.137), ('marginal', 0.123), ('regression', 0.119), ('erf', 0.109), ('relevance', 0.1), ('cpu', 0.096), ('subspace', 0.092), ('gaussian', 0.092), ('likelihood', 0.089), ('classi', 0.088), ('eld', 0.087), ('carl', 0.087), ('functions', 0.084), ('spanned', 0.081), ('bayesian', 0.08), ('training', 0.077), ('yi', 0.077), ('denmark', 0.077), ('rasmussen', 0.077), ('mnist', 0.073), ('lars', 0.072), ('incremental', 0.067), ('laplace', 0.067), ('processes', 0.066), ('hansen', 0.065), ('nity', 0.064), ('active', 0.063), ('xi', 0.061), ('dz', 0.061), ('cation', 0.059), ('weights', 0.058), ('mean', 0.058), ('winther', 0.055), ('complexity', 0.054), ('sparse', 0.053), ('digits', 0.053), ('mackay', 0.05), ('ole', 0.048), ('jqc', 0.048), ('growth', 0.047), ('approximation', 0.045), ('variance', 0.044), ('formalize', 0.044), ('kai', 0.044), ('lyngby', 0.044), ('tap', 0.044), ('joaquin', 0.044), ('cavity', 0.044), ('predictions', 0.043), ('ti', 0.043), ('covariance', 0.042), ('test', 0.04), ('gp', 0.04), ('prediction', 0.04), ('indexes', 0.038), ('opper', 0.038), ('examples', 0.038), ('kernel', 0.037), ('analytical', 0.037), ('posterior', 0.037), ('dy', 0.036), ('kth', 0.036), ('series', 0.035), ('chris', 0.035), ('percent', 0.035), ('inversion', 0.035), ('log', 0.034), ('repetitions', 0.033), ('informatics', 0.033), ('manfred', 0.033), ('process', 0.033), ('alex', 0.032), ('seconds', 0.032), ('visited', 0.032), ('jj', 0.032), ('suboptimal', 0.032), ('perform', 0.031), ('update', 0.031), ('machine', 0.031), ('preliminary', 0.03), ('superposition', 0.03), ('iteration', 0.03), ('predictive', 0.029), ('equality', 0.028), ('exact', 0.028), ('advances', 0.028), ('michael', 0.028), ('faster', 0.027), ('sparsity', 0.027), ('williams', 0.027), ('making', 0.027), ('inference', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
2 0.1711008 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
Author: Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero Candela, Roderick Murray-Smith
Abstract: We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. -step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form , the prediction of at time is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction. ¡ % # ¢ ¡ ¢ ¡¨ ¦ ¤ ¢ $
4 0.11710223 10 nips-2002-A Model for Learning Variance Components of Natural Images
Author: Yan Karklin, Michael S. Lewicki
Abstract: We present a hierarchical Bayesian model for learning efficient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefficients with a distribution that is adapted to the variance structure of the coefficients of an efficient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.
5 0.10988826 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
Author: Anton Schwaighofer, Volker Tresp
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
6 0.10016929 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
7 0.097390942 41 nips-2002-Bayesian Monte Carlo
8 0.094272122 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
9 0.092823543 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
10 0.089375563 120 nips-2002-Kernel Design Using Boosting
11 0.085860729 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
12 0.084026329 45 nips-2002-Boosted Dyadic Kernel Discriminants
13 0.080519609 135 nips-2002-Learning with Multiple Labels
14 0.079900056 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
15 0.077732049 6 nips-2002-A Formulation for Minimax Probability Machine Regression
16 0.077359565 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
17 0.07619039 90 nips-2002-Feature Selection in Mixture-Based Clustering
18 0.076128498 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
19 0.074065238 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
20 0.071876176 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
topicId topicWeight
[(0, -0.248), (1, -0.081), (2, 0.022), (3, 0.035), (4, 0.047), (5, 0.012), (6, -0.098), (7, 0.05), (8, 0.067), (9, 0.006), (10, 0.045), (11, -0.092), (12, 0.156), (13, 0.061), (14, 0.1), (15, -0.039), (16, -0.043), (17, 0.001), (18, 0.105), (19, 0.141), (20, 0.101), (21, 0.001), (22, -0.01), (23, -0.024), (24, 0.036), (25, -0.055), (26, -0.04), (27, 0.041), (28, 0.044), (29, 0.081), (30, 0.045), (31, 0.12), (32, 0.061), (33, 0.005), (34, -0.169), (35, -0.022), (36, -0.133), (37, -0.067), (38, 0.001), (39, -0.126), (40, 0.012), (41, -0.084), (42, -0.087), (43, -0.035), (44, -0.13), (45, 0.043), (46, -0.056), (47, -0.021), (48, 0.026), (49, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.95099849 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
2 0.8010788 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
Author: Anton Schwaighofer, Volker Tresp
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
3 0.75975537 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
4 0.71405768 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting
Author: Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero Candela, Roderick Murray-Smith
Abstract: We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. -step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form , the prediction of at time is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction. ¡ % # ¢ ¡ ¢ ¡¨ ¦ ¤ ¢ $
5 0.62910813 6 nips-2002-A Formulation for Minimax Probability Machine Regression
Author: Thomas Strohmann, Gregory Z. Grudic
Abstract: We formulate the regression problem as one of maximizing the minimum probability, symbolized by Ω, that future predicted outputs of the regression model will be within some ±ε bound of the true regression function. Our formulation is unique in that we obtain a direct estimate of this lower probability bound Ω. The proposed framework, minimax probability machine regression (MPMR), is based on the recently described minimax probability machine classification algorithm [Lanckriet et al.] and uses Mercer Kernels to obtain nonlinear regression models. MPMR is tested on both toy and real world data, verifying the accuracy of the Ω bound, and the efficacy of the regression models. 1
6 0.52715611 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
7 0.51194513 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
8 0.47289506 41 nips-2002-Bayesian Monte Carlo
9 0.47007248 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
10 0.45802185 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages
11 0.45548001 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
12 0.45174631 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
13 0.45123863 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
14 0.44791558 150 nips-2002-Multiple Cause Vector Quantization
15 0.44739279 138 nips-2002-Manifold Parzen Windows
16 0.44038737 87 nips-2002-Fast Transformation-Invariant Factor Analysis
17 0.43221447 115 nips-2002-Informed Projections
18 0.4288533 124 nips-2002-Learning Graphical Models with Mercer Kernels
19 0.42703962 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement
20 0.42002445 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
topicId topicWeight
[(11, 0.012), (23, 0.015), (42, 0.094), (51, 0.238), (54, 0.1), (55, 0.069), (57, 0.017), (67, 0.015), (68, 0.026), (74, 0.102), (92, 0.053), (98, 0.191)]
simIndex simValue paperId paperTitle
1 0.92410147 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA
Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski
Abstract: Missing data is common in real-world datasets and is a problem for many estimation techniques. We have developed a variational Bayesian method to perform Independent Component Analysis (ICA) on high-dimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative density model. Modeling the distributions of the independent sources with mixture of Gaussians allows sources to be estimated with different kurtosis and skewness. The variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overfitting problems. This allows direct probability estimation of missing values in the high dimensional space and avoids dimension reduction preprocessing which is not feasible with missing data.
2 0.86511356 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,
same-paper 3 0.84861064 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
4 0.80660546 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor
Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1
5 0.71184695 41 nips-2002-Bayesian Monte Carlo
Author: Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.
6 0.70660537 10 nips-2002-A Model for Learning Variance Components of Natural Images
7 0.70580316 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
8 0.70486677 135 nips-2002-Learning with Multiple Labels
9 0.70454538 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
10 0.70062238 46 nips-2002-Boosting Density Estimation
11 0.69737756 92 nips-2002-FloatBoost Learning for Classification
12 0.69726962 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
13 0.69636047 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
14 0.69573516 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
15 0.69391453 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
16 0.69343364 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule
17 0.69297516 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
18 0.69186485 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
19 0.69178414 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
20 0.69177818 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex