nips nips2006 nips2006-64 knowledge-graph by maker-knowledge-mining

64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors


Source: pdf

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. [sent-5, score-0.127]

2 Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. [sent-6, score-0.208]

3 We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. [sent-7, score-0.767]

4 1 Introduction Various emerging quantitative measurement technologies in the life sciences are producing genome, transcriptome and proteome-wide data collections which has motivated the development of data integration methods within an inferential framework. [sent-8, score-0.044]

5 It has been demonstrated that for certain prediction tasks within computational biology synergistic improvements in performance can be obtained via the integration of a number of (possibly heterogeneous) data sources. [sent-9, score-0.139]

6 In [2] six different data representations of proteins were employed for fold recognition of proteins using Support Vector Machines (SVM). [sent-10, score-0.289]

7 It was observed that certain data combinations provided increased accuracy over the use of any single dataset. [sent-11, score-0.04]

8 Likewise in [9] a comprehensive experimental study observed improvements in SVM based gene function prediction when data from both microarray expression and phylogentic profiles were manually combined. [sent-12, score-0.049]

9 More recently protein network inference was shown to be improved when various genomic data sources were integrated [16] and in [1] it was shown that superior prediction accuracy of protein-protein interactions was obtainable when a number of diverse data types were combined in an SVM. [sent-13, score-0.274]

10 Whilst all of these papers exploited the kernel method in providing a means of data fusion within SVM based classifiers it was initially only in [5] that a means of estimating an optimal linear combination of the kernel functions was presented using semi-definite programming. [sent-14, score-0.085]

11 However, the methods developed in [5] are based on binary SVM’s, whilst arguably the majority of important classification problems within computational biology are inherently multiclass. [sent-15, score-0.065]

12 In addition the SVM is non-probabilistic and whilst post hoc methods for obtaining predictive probabilities are available [10] these are not without problems such as overfitting. [sent-17, score-0.444]

13 On the other hand Gaussian Process (GP) methods [11], [8] for classification provide a very natural way to both integrate and infer optimal combinations of multiple heterogeneous datasets via composite covariance functions within the Bayesian framework an idea first proposed in [8]. [sent-18, score-0.333]

14 In this paper it is shown that GP’s can indeed be successfully employed on general classification problems, without recourse to ad hoc binary classification combination schemes, where there are multiple data sources which are also optimally combined employing full Bayesian inference. [sent-19, score-0.556]

15 A large scale example of protein fold prediction [2] is provided where state-of-the-art predictive performance is achieved in a straightforward manner without resorting to any extensive ad hoc engineering of the solution (see [2], [13]). [sent-20, score-0.821]

16 As an additional important by-product of this work inference employing Variational Bayesian (VB) and Expectation Propagation (EP) based approximations for GP classification over multiple classes are studied and assessed in detail. [sent-21, score-0.328]

17 In addition we see that there is no statistically significant practical advantage of EP based approximations over VB approximations in this particular setting. [sent-23, score-0.168]

18 2 Integrating Data with Gaussian Process Priors Let us denote each of J independent (possibly heterogeneous) feature representations, Fj (X), of an object X by xj ∀ j = 1 · · · J . [sent-24, score-0.072]

19 Each distinct data representation of X, Fj (X) = xj , is nonlinearly transformed such that fj (xj ) : Fj → R and a linear model is employed in this new space such that the overall nonlinear transformation is f (X) = j βj fj (xj ). [sent-26, score-0.283]

20 A GP functional prior, over j j all possible responses (classes), is now available where possibly heterogeneous data sources are integrated via the composite covariance function. [sent-31, score-0.293]

21 It is then, in principle, a straightforward matter to perform Bayesian inference with this model and no further recourse to ad hoc binary classifier combination methods or ancillary optimizations to obtain the data combination weights is required. [sent-32, score-0.416]

22 The Gibbs sampler is to be preferred over the Metropolis scheme as no tuning of a proposal distribution is required. [sent-35, score-0.166]

23 As in [3] the auxiliary variables ynk = fk (Xn ) + nk , nk ∼ N (0, 1) are introduced and the N × 1 dimensional vector of target class values associated with each Xn is given as t where each element tn ∈ {1, · · · , K}. [sent-36, score-0.346]

24 The N × K matrix of GP random variables fk (Xn ) is denoted by F. [sent-37, score-0.065]

25 The N × K matrix of auxiliary variables ynk is represented as Y, where the N × 1 dimensional columns are denoted by Y·,k and the corresponding K × 1 dimensional vectors are obtained from the rows of Y as Yn,· . [sent-39, score-0.091]

26 3 MCMC Procedure Samples from the full posterior P (Y, F, Θ1···K , ϕ1···K |X1···N , t, a, b) can be obtained from the following Metropolis-within-Blocked-Gibbs Sampling scheme indexing over all n = 1 · · · N and k = 1 · · · K. [sent-43, score-0.164]

27 An accept-reject strategy can be employed in sampling from the conic truncated Gaussian however this will very quickly become inefficient for problems with moderately large numbers of classes and as such a further Gibbs sampling scheme may be required. [sent-45, score-0.137]

28 Each (i) (i) (i) (i) (i) (i) (i) Σk = Ck (I + Ck )−1 and Ck = j=1 αjk Cjk (θjk ) with the elements of Cjk (θjk ) defined (i) as Cj (xm , xn ; θjk ). [sent-46, score-0.045]

29 A Metropolis sub-sampler is required to obtain samples for the conditional j j (i+1) (i+1) distribution over the composite covariance function parameters P (Θk ) and finally P (ϕk ) is a simple product of Gamma distributions. [sent-47, score-0.293]

30 The predictive likelihood of a test sample X∗ is P (t∗ = k|X∗ , X1···N , t, a, b) which can be obtained by integrating over the posterior and predictive prior such that P (t∗ = k|f∗ )p(f∗ |Ω, X∗ , X1···N )p(Ω|X1···N , t, a, b)df∗ dΩ (7) where Ω = Y, Θ1···K . [sent-48, score-0.706]

31 A recent study has shown that EP is superior to the Laplace approximation for binary classification [4] and that for multi-class classification VB methods are superior to the Laplace approximation [3]. [sent-50, score-0.094]

32 However the comparison between Variational and EP based approximations for the multi-class setting have not been considered in the literature and so we seek to address this issue in the following sections. [sent-51, score-0.084]

33 However given the excellent performance of EP on a number of approximate Bayesian inference problems it is incumbent on us to consider an EP solution here. [sent-54, score-0.098]

34 We should point out that only the top level inference on the GP variables is considered here and the composite covariance function parameters will be obtained using another appropriate type-II maximum likelihood optimization scheme if possible. [sent-55, score-0.379]

35 5 Expectation Propagation with Full Posterior Covariance The required posterior can also be approximated by EP [7]. [sent-57, score-0.12]

36 In this case the multinomial probit likelihood is approximated by a multivariate Gaussian such that p(F|t, X1···N ) ≈ Q(F) = 1 k p(F·,k |X1···N ) n gn (Fn,· ) where gn (Fn,· ) = NFn,· (µn , Λn ), µn is a K × 1 vector and Λn is a full K × K dimensional covariance matrix. [sent-58, score-0.471]

37 To proceed an analytic form for the partition function Zn is required. [sent-60, score-0.086]

38 Indeed for binary classification employing a binomial probit likelihood an elegant EP solution follows due to the analytic form of the partition function [4]. [sent-61, score-0.389]

39 However for the case of multiple classes with a multinomial probit likelihood the partition function no longer has a closed analytic form and further approximations are required to make any progress. [sent-62, score-0.477]

40 There are two strategies which we consider, the first retains the full posterior coupling in the covariance matrices Λn by employing Laplace Propagation (LP) [14] and the second assumes no posterior coupling in Λn by setting this as a diagonal covariance matrix. [sent-63, score-0.673]

41 The second form of approximation has been adopted in [12] when developing a multi-class version of the Informative Vector Machine (IVM) [6]. [sent-64, score-0.047]

42 In the first case where we employ LP an additional significant O(K 3 N 3 ) computational scaling will be incurred however it can be argued that the retention of the posterior coupling is important. [sent-65, score-0.203]

43 For the second case clearly we lose this explicit posterior coupling but, of course, do not incur the expensive computational overhead required of LP. [sent-66, score-0.165]

44 We observed in unreported experiments that there is little of statistical significance lost, in terms of predictive performance, when assuming a factorable form for each pn . [sent-67, score-0.387]

45 LP proceeds by propagating the approximate moments such that ˆ ∂ 2 log pn ˆ µnew ≈ argmax log pn and Λnew ≈ − ˆ n n Fn,· ∂Fn,· ∂FT n,· −1 (10) The required derivatives follow straightforwardly and details are included in the accompanying material. [sent-68, score-0.197]

46 The approximate predictive distribution for a new data point x∗ requires a Monte Carlo estimate employing samples drawn from a K-dimensional multivariate Gaussian for which details are given in the supplementary material2 . [sent-69, score-0.524]

47 6 Expectation Propagation with Diagonal Posterior Covariance By assuming a factorable approximate posterior, as in the variational approximation [3], a distinct simplification of the problem setting follows, where now we assume that gn (Fn,· ) = k NFn,k (µn,k , λn,k ) i. [sent-71, score-0.381]

48 Derivatives of this partition function follow in a straightforward way now allowing the required EP updates to proceed (details in supplementary material). [sent-81, score-0.121]

49 The approximate predictive distribution for a new data point X∗ in this case takes a similar form to that for the Variational approximation [3]. [sent-82, score-0.35]

50 So we have    K u + v λ∗ + µ∗ − µ∗  j k k P (t∗ = k|X∗ , X1···N , t) = Ep(u)p(v) Φ (12)   1 + λ∗ j j=1,j=k where the predictive mean and variance follow in standard form. [sent-83, score-0.263]

51 The VB approximation [3] however only requires a 1-D Monte Carlo integral rather than the 2-D one required here. [sent-85, score-0.092]

52 3 Experiments Before considering the main example of data integration within a large scale protein fold prediction problem we attempt to assess a number of approximate inference schemes for GP multi-class classification. [sent-86, score-0.434]

53 We provide a short comparative study of the Laplace, VB, and both possible EP approximations by employing the Gibbs sampler as the comparative gold standard. [sent-87, score-0.35]

54 For these experiments six multi-class data sets are employed 3 , i. [sent-88, score-0.088]

55 A single radial basis covariance function with one length scale parameter is used in this comparative study. [sent-91, score-0.122]

56 Ten-fold cross validation (CV) was used to estimate the predictive log-likelihood and the percentage predictive error. [sent-92, score-0.581]

57 Within each of the ten folds a further 10 CV routine was employed to select the length-scale of the covariance function. [sent-93, score-0.224]

58 For the Gibbs sampler, after a burn-in of 2000 samples, the following 3000 samples were used for inference, and the predictive error and likelihood were computed from the 3000 post-burn-in samples. [sent-94, score-0.356]

59 For each data set and each method the percentage predictive error and the predictive log-likelihood were estimated in this manner. [sent-95, score-0.581]

60 The summary results given as the mean and standard deviation over the ten folds are shown in Table 1. [sent-96, score-0.051]

61 From those results, we can see that across most data sets used, the predictive log-likelihood obtained from the Laplace approximation is lower than those of the three other methods. [sent-98, score-0.31]

62 In our observations, the predictive performance of VB and the IEP approximation are consistently indistinguishable from the performance achieved from the Gibbs sampler. [sent-99, score-0.349]

63 From the experiments conducted there is no evidence to suggest any difference in predictive performance between IEP & VB methods in the case of multi-way classification. [sent-100, score-0.263]

64 As there is no benefit in choosing an EP based approximation over the Variational one we now select the Variational approximation in that inference over the covariance parameters follows simply by obtaining posterior mean estimates using an importance sampler. [sent-101, score-0.349]

65 We can compare the compute time taken to obtain reasonable predictions from the full MCMC and the approximate Variational scheme [3]. [sent-103, score-0.129]

66 Figure 1 (a) shows the samples of the covariance function parameters Θ drawn from the Metropolis subsampler4 and overlaid in black the corresponding approximate posterior mean estimates obtained from the variational scheme [3]. [sent-104, score-0.506]

67 Best results which are statistically indistinguishable from each other are highlighted in bold. [sent-110, score-0.039]

68 is clear that after 100 calls to the sub-sampler the samples obtained reflect the relevance of the features, however the deterministic steps taken in the variational routine achieve this in just over ten computational steps of equal cost to the Metropolis sub-sampler. [sent-213, score-0.219]

69 Figure 1 (b) shows the predictive error incurred by the classifier and under the MCMC scheme 30,000 CPU seconds are required to achieve the same level of predictive accuracy under the variational approximation obtained in 200 seconds (a factor of 150 times faster). [sent-214, score-0.897]

70 This is due, in part, to the additional level of sampling from the predictive prior which is required when using MCMC to obtain predictive posteriors. [sent-215, score-0.571]

71 Because of these results we now adopt the variational approximation for the following large scale experiment. [sent-216, score-0.231]

72 4 Protein Fold Prediction with GP Based Data Fusion To illustrate the proposed GP based method of data integration a substantial protein fold classification problem originally studied in [2] and more recently in [13] is considered. [sent-217, score-0.287]

73 The task is to devise a predictor of 27 distinct SCOP classes from a set (N = 314) of low homology protein sequences. [sent-218, score-0.158]

74 different data representations (each comprised of around 20 features) are available characterizing (1) Amino Acid composition (AA); (2) Hydrophobicity profile (HP); (3) Polarity (PT); (4) Polarizability (PY); (5) Secondary Structure (SS); (6) Van der Waals volume profile of the protein (VP). [sent-227, score-0.122]

75 In [2] a number of classifier and data combination strategies were employed in devising a multiway classifier from a series of binary SVM’s. [sent-228, score-0.133]

76 In the original work of [2] the best predictive accuracy obtained on an independent set (N = 385) of low sequence similarity proteins was 53%. [sent-229, score-0.303]

77 It was noted after extensive careful manual experimentation by the authors that a combination of Gaussian kernels each composed of the (AA), (SS) and (HP) datasets significantly improved predictive accuracy. [sent-230, score-0.305]

78 More recently in [13] a heavily tuned ad hoc ensemble combination of classifiers raised this performance to 62% the best reported on this problem. [sent-231, score-0.248]

79 We employ the proposed GP based method (Variational approximation) in devising a classifier for this task where now we employ a composite covariance function (shared across all 27 classes), a linear combination of RBF functions for each data set. [sent-232, score-0.371]

80 Figure (2) shows the predictive performance of the GP classifier in terms of percentage prediction accuracy (a) and predictive likelihood on the independent test set (b). [sent-233, score-0.688]

81 We note a significant synergistic increase in performance when all data sets are combined and weighted (MA) where the overall performance accuracy achieved is 62%. [sent-234, score-0.046]

82 Although the 0-1 loss test error is the same for an equal weighting of the data sets (MF) and that obtained using the proposed inference procedure (MA) for (MA) there is an increase in predictive likelihood i. [sent-235, score-0.416]

83 It is interesting to note that the weighting obtained (posterior mean for α) Figure (2. [sent-238, score-0.037]

84 c) weights the (AA) & (SS) with equal importance whilst other data sets play less of a role in performance improvement. [sent-239, score-0.065]

85 5 Conclusions In this paper we have considered the problem of integrating data sets within a classification setting, a common scenario within many bioinformatics problems. [sent-240, score-0.105]

86 We have argued that the GP prior provides an elegant solution to this problem within the Bayesian inference framework. [sent-241, score-0.058]

87 To obtain a computationally practical solution three approximate approaches to multi-class classification with GP priors, i. [sent-242, score-0.04]

88 Laplace, Variational and EP based approximations have been considered. [sent-244, score-0.084]

89 It is found that EP and Variational approximations approach the performance of a Gibbs sampler and indeed their predictive performances are indistinguishable at the 5% level of significance. [sent-245, score-0.502]

90 The full EP (FEP) approximation employing LP has an excessive computational cost and there is little to recommend it in terms of predictive performance over the independent assumption (IEP). [sent-246, score-0.499]

91 Likewise there is little to distinguish between IEP and VB approximations in terms of predictive performance in the multi-class classification setting though further experiments on a larger number of data sets is desirable. [sent-247, score-0.347]

92 This is a highly practical solution to the problem of heterogenous data fusion in the classification setting which employs Bayesian inferen- tial semantics throughout in a consistent manner. [sent-249, score-0.043]

93 We note that on the fold prediction problem the best performance achieved is equaled without resorting to complex and ad hoc data and classifier weighting and combination schemes. [sent-250, score-0.515]

94 Multi-class protein fold recognition using support vector machines and neural networks. [sent-261, score-0.243]

95 Variational Bayesian multinomial probit regression with Gaussian process priors. [sent-264, score-0.168]

96 Protein network inference from multiple genomic data: a supervised approach. [sent-357, score-0.103]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.319), ('predictive', 0.263), ('ep', 0.229), ('vb', 0.219), ('fn', 0.207), ('jk', 0.202), ('variational', 0.184), ('iep', 0.183), ('laplace', 0.163), ('employing', 0.15), ('covariance', 0.122), ('protein', 0.122), ('fold', 0.121), ('sampler', 0.116), ('hoc', 0.116), ('gibbs', 0.115), ('hp', 0.114), ('pe', 0.111), ('pl', 0.111), ('metropolis', 0.109), ('probit', 0.095), ('classi', 0.093), ('tn', 0.092), ('epn', 0.091), ('ynk', 0.091), ('composite', 0.091), ('ad', 0.09), ('approximations', 0.084), ('mcmc', 0.081), ('fj', 0.08), ('ss', 0.08), ('cj', 0.08), ('heterogeneous', 0.08), ('cjk', 0.079), ('aa', 0.078), ('posterior', 0.075), ('multinomial', 0.073), ('xj', 0.072), ('epnk', 0.068), ('factorable', 0.068), ('recourse', 0.068), ('fk', 0.065), ('whilst', 0.065), ('vp', 0.064), ('bayesian', 0.061), ('mf', 0.06), ('ni', 0.06), ('resorting', 0.06), ('ck', 0.059), ('bioinformatics', 0.058), ('inference', 0.058), ('lp', 0.058), ('likelihood', 0.058), ('pn', 0.056), ('percentage', 0.055), ('girolami', 0.054), ('employed', 0.051), ('py', 0.051), ('folds', 0.051), ('scheme', 0.05), ('er', 0.05), ('prediction', 0.049), ('nk', 0.049), ('zn', 0.049), ('integrating', 0.047), ('approximation', 0.047), ('analytic', 0.046), ('soybean', 0.046), ('synergistic', 0.046), ('xn', 0.045), ('incurred', 0.045), ('genomic', 0.045), ('required', 0.045), ('pt', 0.045), ('coupling', 0.045), ('propagation', 0.044), ('integration', 0.044), ('fusion', 0.043), ('gn', 0.042), ('combination', 0.042), ('combinations', 0.04), ('proteins', 0.04), ('partition', 0.04), ('devising', 0.04), ('approximate', 0.04), ('full', 0.039), ('indistinguishable', 0.039), ('employ', 0.038), ('gaussian', 0.038), ('material', 0.038), ('weighting', 0.037), ('six', 0.037), ('supplementary', 0.036), ('abe', 0.036), ('svm', 0.036), ('classes', 0.036), ('samples', 0.035), ('carlo', 0.035), ('monte', 0.035), ('ma', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1

2 0.25234532 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.

3 0.23801225 159 nips-2006-Parameter Expanded Variational Bayesian Methods

Author: Tommi S. Jaakkola, Yuan Qi

Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1

4 0.14980027 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

Author: Matthias Seeger

Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1

5 0.14902875 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1

6 0.14093168 57 nips-2006-Conditional mean field

7 0.12896444 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

8 0.12281594 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

9 0.11026929 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

10 0.10611068 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

11 0.10551544 169 nips-2006-Relational Learning with Gaussian Processes

12 0.092339568 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

13 0.088296615 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

14 0.087152191 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

15 0.085659727 124 nips-2006-Linearly-solvable Markov decision problems

16 0.082331128 116 nips-2006-Learning from Multiple Sources

17 0.081616431 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

18 0.077391364 41 nips-2006-Bayesian Ensemble Learning

19 0.075319797 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

20 0.075251013 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.248), (1, 0.07), (2, 0.013), (3, -0.094), (4, 0.033), (5, 0.193), (6, 0.301), (7, 0.061), (8, -0.115), (9, 0.029), (10, 0.004), (11, 0.132), (12, 0.04), (13, -0.151), (14, -0.016), (15, 0.35), (16, -0.035), (17, -0.043), (18, -0.071), (19, -0.068), (20, -0.004), (21, 0.132), (22, -0.031), (23, -0.033), (24, -0.009), (25, -0.077), (26, -0.064), (27, -0.021), (28, 0.016), (29, 0.003), (30, 0.084), (31, 0.087), (32, 0.091), (33, 0.077), (34, -0.027), (35, 0.025), (36, 0.022), (37, -0.064), (38, 0.062), (39, 0.077), (40, 0.078), (41, -0.058), (42, 0.017), (43, -0.05), (44, -0.068), (45, 0.063), (46, -0.009), (47, -0.052), (48, 0.041), (49, -0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94649947 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1

2 0.74593866 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.

3 0.70595235 159 nips-2006-Parameter Expanded Variational Bayesian Methods

Author: Tommi S. Jaakkola, Yuan Qi

Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1

4 0.50906348 169 nips-2006-Relational Learning with Gaussian Processes

Author: Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S. S. Keerthi

Abstract: Correlation between instances is often modelled via a kernel function using input attributes of the instances. Relational knowledge can further reveal additional pairwise correlations between variables of interest. In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. This approach provides a novel non-parametric Bayesian framework with a data-dependent covariance function for supervised learning tasks. We also apply this framework to semi-supervised learning. Experimental results on several real world data sets verify the usefulness of this algorithm. 1

5 0.48236781 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

Author: Matthias Seeger

Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1

6 0.46822339 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

7 0.4635748 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

8 0.45805022 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

9 0.457827 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

10 0.42995238 57 nips-2006-Conditional mean field

11 0.3693178 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

12 0.3586517 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

13 0.35681054 124 nips-2006-Linearly-solvable Markov decision problems

14 0.35023442 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

15 0.34841558 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

16 0.3372708 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

17 0.31837317 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

18 0.31425828 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

19 0.31002173 41 nips-2006-Bayesian Ensemble Learning

20 0.30769563 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.148), (3, 0.017), (7, 0.058), (9, 0.046), (20, 0.054), (22, 0.057), (44, 0.091), (52, 0.292), (57, 0.07), (65, 0.038), (69, 0.029), (71, 0.019), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86950409 13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

same-paper 2 0.84056848 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1

3 0.80874616 105 nips-2006-Large Margin Component Analysis

Author: Lorenzo Torresani, Kuang-chih Lee

Abstract: Metric learning has been shown to significantly improve the accuracy of k-nearest neighbor (kNN) classification. In problems involving thousands of features, distance learning algorithms cannot be used due to overfitting and high computational complexity. In such cases, previous work has relied on a two-step solution: first apply dimensionality reduction methods to the data, and then learn a metric in the resulting low-dimensional subspace. In this paper we show that better classification performance can be achieved by unifying the objectives of dimensionality reduction and metric learning. We propose a method that solves for the low-dimensional projection of the inputs, which minimizes a metric objective aimed at separating points in different classes by a large margin. This projection is defined by a significantly smaller number of parameters than metrics learned in input space, and thus our optimization reduces the risks of overfitting. Theory and results are presented for both a linear as well as a kernelized version of the algorithm. Overall, we achieve classification rates similar, and in several cases superior, to those of support vector machines. 1

4 0.58917958 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf

Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1

5 0.58466232 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

6 0.5813868 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

7 0.57451046 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

8 0.57276666 193 nips-2006-Tighter PAC-Bayes Bounds

9 0.57226545 175 nips-2006-Simplifying Mixture Models through Function Approximation

10 0.5721854 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

11 0.57120669 20 nips-2006-Active learning for misspecified generalized linear models

12 0.56695855 179 nips-2006-Sparse Representation for Signal Classification

13 0.56650466 169 nips-2006-Relational Learning with Gaussian Processes

14 0.56619471 138 nips-2006-Multi-Task Feature Learning

15 0.56548208 167 nips-2006-Recursive ICA

16 0.56502318 117 nips-2006-Learning on Graph with Laplacian Regularization

17 0.56489199 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

18 0.56477851 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

19 0.56418526 35 nips-2006-Approximate inference using planar graph decomposition

20 0.56373078 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures