nips nips2001 nips2001-58 knowledge-graph by maker-knowledge-mining

58 nips-2001-Covariance Kernels from Bayesian Generative Models


Source: pdf

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. [sent-4, score-0.526]

2 We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. [sent-5, score-0.747]

3 They are based on the use of a kernel function which encodes task prior knowledge in a Bayesian manner. [sent-7, score-0.507]

4 In this paper, we propose the framework of mutual information (MI) kernels for learning covariance kernels from unlabeled task data using Bayesian techniques. [sent-8, score-0.628]

5 We also discuss some general ideas for discriminative semi-supervised learning and kernel design in this context. [sent-10, score-0.497]

6 We note that the Fisher kernel [4] is a special case of a MI kernel. [sent-12, score-0.437]

7 MI kernels for mixture models are discussed in detail. [sent-13, score-0.21]

8 In section 3, we describe an implementation for a MI kernel for variational Bayesian mixtures of factor analyzers models and show results of preliminary experiments. [sent-14, score-0.788]

9 the semi-supervised classification problem, a labeled dataset Dl {(Xl,tl), . [sent-15, score-0.246]

10 ,(Xm ,tm)} as well as an unlabeled set Du = {x m+1 ,""Xm+n} are In given for training, both i. [sent-18, score-0.165]

11 Our aim is to fit models to Du in a Bayesian way, thereby extracting (posterior) information, then use this information to build a covariance kernel K. [sent-25, score-0.582]

12 Afterwards, K will be plugged into a supervised kernel machine, which is trained on the labeled data Dl to perform the classification task. [sent-26, score-0.67]

13 After having chosen a model family {p(xIOn and a prior distribution P(O) over parameters 0, the posterior distribution P(OIDu) ex P(DuIO)P(O), where P(DuIO) = rr::~'~l P(xiIO), encodes all information that Du contains about the latent (i. [sent-30, score-0.218]

14 2 The other learning scenario is supervised classification, using a kernel machine. [sent-33, score-0.512]

15 Such architectures model a smooth latent function y (x) E ~ as a random process, together with a classification noise model P(tly). [sent-34, score-0.168]

16 3 The covariance kernel K specifies the prior distribution for this process: namely, a-priori, y(x) is assumed to be a Gaussian process with zero mean and covariance function K , i. [sent-35, score-0.741]

17 diag a is the matrix with diagonal a and 0 elsewhere. [sent-45, score-0.095]

18 Within the standard discriminative Bayesian classification scenario, unlabeled data cannot be used. [sent-49, score-0.335]

19 However, it is rather straightforward to modify this scenario by introducing the concept of conditional priors (see [6]). [sent-50, score-0.113]

20 If we have a discriminant model family {P(tlx; w a conditional prior P(w 10) allows to encode prior knowledge and assumptions about how information about P(x) (i. [sent-51, score-0.243]

21 about 0) influences our assumptions about a-priori probabilities over discriminants w. [sent-53, score-0.087]

22 However, once such a prior is specified, we can in principle use the same powerful techniques for approximate Bayesian inference that have been developed for supervised discriminative settings. [sent-56, score-0.197]

23 Semi-supervised techniques that can be seen as employing conditional priors include co-training [1], feature selection based on clustering [7] and the Fisher kernel [4]. [sent-57, score-0.542]

24 For a probabilistic kernel technique, P( w 10) is fully specified by a covariance function K(x(1) , X(2) 10) depending on O. [sent-58, score-0.554]

25 The problem is therefore to find covariance kernels which (as GP priors) favour discriminants in some sense compatible with what we have learned about the input distribution P(x). [sent-59, score-0.438]

26 "close" under some distance), their labels (and latent outputs y) should be highly correlated. [sent-62, score-0.113]

27 Thus, one generic way of learning kernels from unlabeled data is to learn a distance between input points from the information about P( x). [sent-63, score-0.335]

28 A frequently used assumption about how classification labels may depend on P(x) is the cluster hypothesis: we assume discriminants whose decision boundaries lie between clusters in P(x) to be a-priori more likely than such that label clusters inconsistently. [sent-64, score-0.547]

29 A general way of encoding this hypothesis is to learn a distance from P(x) which is consistent with clusters in P(x) , i. [sent-65, score-0.153]

30 points within the same cluster are closer under this distance than points from different clusters. [sent-67, score-0.126]

31 Then, a natural kernel function would be K(x(1) , X(2)) = exp( - ,BII¢(x(1)) - ¢(x(2))11 2). [sent-71, score-0.437]

32 3 A natural choice for binary classification is to represent the log odds log(P(t = +1Ix)/P(t = -1Ix)) by y(x) . [sent-73, score-0.16]

33 which immediately gives rise to a covariance kernel, without having to compute an approximate Euclidean embedding. [sent-74, score-0.117]

34 Remark: Our main aim in this paper is to construct kernels that can be learned from unlabeled data only. [sent-75, score-0.341]

35 In contrast to this, the task of learning a kernel from labeled data is somewhat simpler and can be approached in the following generic way: start with a parametric model family {y(x; w)} , with the interpretation that y(x;w) models the log odds log(P(t = +llx)/P(t = -llx)). [sent-76, score-0.631]

36 Fitting these models to labeled data D[ , we obtain a posterior P(wIDI) . [sent-77, score-0.146]

37 Now , a natural covariance kernel for our problem is simply K(x(1),X(2)) = Jy(x(1);w)y(x(2 );w)Q(w)dw, where (say) Q(w) < under EE, then 1<(3 is also a kernel for all (3 > 0 (see e. [sent-78, score-0.991]

38 or the weighted inner product x(1)'VX(2) into the squared-exponential kernel (e. [sent-81, score-0.437]

39 It is easy to show that K in (3) is a valid covariance kernel 7 , and we refer to it as mutual information (MI) kernel. [sent-84, score-0.598]

40 Then, the MI kernel K is the RBF kernel (4) with (3 = 4/(p(4 + pia)). [sent-86, score-0.874]

41 Thus, the RBF kernel is a special case of a MI kernel. [sent-87, score-0.437]

42 The mediator distribution Prned(O), motivated earlier in this section, should ideally encode information about the x generation process, just as the Bayesian posterior P(OIDu). [sent-91, score-0.157]

43 On the other hand , we need to be able to control the influence that information from sources such as unlabeled data Du can have on the kernel (relying too much on such sources results in lack of robustness, see e. [sent-92, score-0.602]

44 Here, we propose model-trust scaling (MTS) , by setting (5) Prned varies with A from the (usually vague) prior P(O) (A = 0) towards the sharp posterior P(OID u) (A = n), rendering the Du information (via the model) more and more influence upon the kernel K. [sent-95, score-0.568]

45 The concrete effect of MTS on the kernel depends on the model family. [sent-96, score-0.437]

46 Example (continued): Again, P(xIO) = N(xIO , (p/2)I) , with a flat prior P(O) = 1 on the mean. [sent-97, score-0.105]

47 Thus, the MI kernel is again the RBF kernel (4) with (3 = 2/(p(2 + A)) . [sent-99, score-0.874]

48 For the more flexible model P(xIO) = N(xIJL , ~), = (JL,~) and the conjugate Jeffreys prior, the MI kernel is computed in [5]. [sent-100, score-0.437]

49 ° If the Bayesian analysis is done with conjugate prior-model pairs, the corresponding MI kernel can be computed easily, and for many of these cases, MTS has a very simple, analytic form (see [5]). [sent-101, score-0.474]

50 For example, applying the Laplace approximation to the computations on a model with flat prior P(O) = 1 results in the Fisher kernel [4]8, see e. [sent-103, score-0.591]

51 However, in this paper we favour another approximation technique (see section 3). [sent-106, score-0.107]

52 2 Mutual Information Kernels for Mixture Models If we apply the MI kernel framework to mixture models P(x 10, 7T") = Ls 7f sP(x lOs), we run into a problem. [sent-108, score-0.55]

53 As mentioned in section 1, we would like our kernel at least partly to encode the cluster hypothesis, i. [sent-109, score-0.564]

54 The fascinating idea of the Fisher kernel has indeed been the main motivation and inspiration for this paper. [sent-113, score-0.437]

55 9This does not mean that we (a-priori) believe they should have different labels, but only that the label (or better: the latent yO) at one of them should not depend strongly on yO at the other. [sent-114, score-0.09]

56 The MI kernel K is defined as before by (3) , based on the new Q. [sent-117, score-0.437]

57 If Prned(O,rr) = ITsPrn ed(Os)Prn ed(rr) (which is true for the cases we will be interested in), we see that the original MI kernel arises as special case WS1,S2 = EPm ed[7fS17fs2]' Now, by choosing W = diag(Epm ed[7fs])s, we arrive at a MI kernel K which (typically) behaves as expected w. [sent-118, score-0.912]

58 cluster separation (see figure 1), but does not exhibit long-range correlations between joined components. [sent-121, score-0.089]

59 In the present work, we restrict ourselves to this diagonal mixture kernel. [sent-122, score-0.134]

60 Note that this kernel can be seen as (normalized) mixture of MI kernels over the component models. [sent-123, score-0.707]

61 Figure 1: Kernel contours on 2-cluster dataset (A = 5,100,30) Figure 1 shows contour plots 10 of the diagonal mixture kernel for VB-MoFA (see section 3), learned on a 500 cases dataset sampled from two Gaussians with equal covariance (see subsection 3. [sent-124, score-0.979]

62 The left and middle plot have the lower cluster's centre as a, with A = 5, A = 100 respectively, the right plot's a lies between the cluster centres, A = 30. [sent-128, score-0.123]

63 The effect of MTS can be seen by comparing left and middle, note the different sharpness of the slopes towards the other cluster and the different sizes and shapes of the "high correlation" regions. [sent-129, score-0.118]

64 As seen on the right, points between clusters have highest correlation with other such inter-cluster points, a feature that may be very useful for successful discrimination. [sent-130, score-0.116]

65 3 Experiments with Mixtures of Factor Analyzers In this section, we describe an implementation of a MI kernel , using variational Bayesian mixtures of factor analyzers (VB-MoFA) [2] as density models. [sent-131, score-0.788]

66 VB-MoFA is a variational approximation to Bayesian analysis on these models, able to deliver the posterior approximations we require for an MI kernel. [sent-133, score-0.258]

67 We employ the diagonal mixture kernel (see subsection 2. [sent-134, score-0.686]

68 Instead of implementing MTS analytically, we compute the VB approximation to the true posterior (i. [sent-136, score-0.11]

69 Plots using the one-step variational approximation (see 3) have a somewhat richer structure. [sent-142, score-0.227]

70 Our first idea was to approximate them by applying the VB technique once more, ending up with what we called one-step variational approximations. [sent-144, score-0.148]

71 Unfortunately, the MI kernel approximation based on these terms cannot be shown to be positive definite anymore l l ! [sent-145, score-0.543]

72 In the remainder of this section we compare the VB-MoFA kernel with the RBF kernel (4) on two datasets, using a Laplace GP classifier (see [10]). [sent-147, score-0.874]

73 In each case we sample a training pool, a kernel dataset Du and a test set (mutually exclusive). [sent-148, score-0.518]

74 The VB-MoFA diagonal mixture kernel is learned on Du. [sent-149, score-0.614]

75 For a given training set size m, a run consists of sampling a training set Dl and a holdout set Dh (both of size m) from the training pool, tuning kernel parameters by validation on D h , then testing on the test set. [sent-150, score-0.527]

76 1 Two Gaussian clusters The dataset is sampled from two 2-d Gaussians with same non-spherical covariance (see figure 1) , one for each class (the Bayes error is 2. [sent-155, score-0.255]

77 We use n = 500 points for D u , a training pool of 100 and a test set of 500 points. [sent-157, score-0.107]

78 The learning curves in figure 2 show that on this simple toy problem, on which the fitted VB-MoFA model represents the cluster structure in P(x) almost perfectly, the VB-MoFA MI kernel outperforms the RBF kernel for samples sizes n :::; 40. [sent-158, score-1.058]

79 2 Handwritten Digits (MNIST): Twos against threes We report results of preliminary experiments using the subset of twos and threes of the MNIST Handwritten Digits database 12 . [sent-168, score-0.204]

80 Here, n = IDul = 2000, the training pool contains 8089, the test set 2000 cases. [sent-169, score-0.107]

81 13The estimates P(xls) are obtained by integrating out the parameters (}s using the variational posterior approximation. [sent-174, score-0.209]

82 The integral is not analytic, and we use a one-step variational approximation to it . [sent-175, score-0.197]

83 allows us to assess the "purity" of the component clusters w. [sent-176, score-0.118]

84 Furthermore, we show results for the one-step variational approximation to the MI kernel 15 (MIOLD ). [sent-180, score-0.634]

85 Upper left: RBF kernel; upper middle: Baseline method; upper right: VB-MoFA MI kernel (first-order approx. [sent-209, score-0.437]

86 The fact that the first-order approximation to the MI kernel performs worse than the one-step variational approximation (although the latter may fail to be positive definite) , indicates that the former is a poorer approximation. [sent-213, score-0.683]

87 We suspect this problem to be related to the high dimensionality of the input space, in which case probability densities tend to have a large dynamic range, and mixture component responsibility estimates tend to behave very nonsmooth. [sent-215, score-0.108]

88 Thus, it seems to be necessary to extend the basic MI kernel framework by new scaling mechanisms in order to produce a smoother encoding of the prior assumptions. [sent-216, score-0.543]

89 14The baseline algorithm is based on the assumption that, given the component index s, the input point x and the label t are independent. [sent-217, score-0.114]

90 Only the conditional probabilities P(t ls) are learned, while P(xls) and pes) is obtained from the VB-MoFA model fitted to unlabeled data only. [sent-218, score-0.258]

91 Thus, success/failure of this method should be closely related to the degree of purity of the component clusters w. [sent-219, score-0.176]

92 15This is somewhat inconsistent, since we use a kernel function which might not be positive definite in a context (GP classification) which requires a covariance function. [sent-223, score-0.641]

93 16Note also that RBF kernel matrices can be evaluated significantly faster than such using the VB-MoFA MI kernel. [sent-224, score-0.437]

94 Discussion The present work is probably most closely related to the Fisher kernel (see subsection 2. [sent-226, score-0.552]

95 Haussler [3] contains a wealth of material about kernel design for discrete objects x. [sent-230, score-0.437]

96 Watkins [9] mentions that expressions like Q in (1) are valid kernels for discrete x and countable parameter spaces. [sent-231, score-0.133]

97 Very recently we came across [11], which essentially describes a special case of the diagonal mixture kernel (see subsection 2. [sent-232, score-0.686]

98 He is interested in distance learning, does not apply his method to kernel machines and does not give a Bayesian interpretation. [sent-235, score-0.505]

99 We have presented a general framework for kernel learning from unlabeled data and described an approximate implementation using VB-MoFA models. [sent-236, score-0.674]

100 A straightforward application of this technique to high-dimensional real-world data did not prove successful, and in future work we will explore new ideas for extending the basic MI kernel framework in order to be able to deal with high-dimensional input spaces. [sent-237, score-0.473]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mi', 0.474), ('kernel', 0.437), ('mts', 0.174), ('prned', 0.174), ('unlabeled', 0.165), ('variational', 0.148), ('xio', 0.145), ('du', 0.14), ('kernels', 0.133), ('rbf', 0.13), ('bayesian', 0.12), ('covariance', 0.117), ('oidu', 0.116), ('subsection', 0.115), ('classification', 0.11), ('cluster', 0.089), ('clusters', 0.087), ('discriminants', 0.087), ('analyzers', 0.085), ('labeled', 0.085), ('dl', 0.081), ('mixture', 0.077), ('pool', 0.077), ('prn', 0.075), ('prior', 0.07), ('matthias', 0.069), ('fisher', 0.064), ('posterior', 0.061), ('discriminative', 0.06), ('duio', 0.058), ('epm', 0.058), ('favour', 0.058), ('idul', 0.058), ('ios', 0.058), ('mediator', 0.058), ('olx', 0.058), ('purity', 0.058), ('threes', 0.058), ('twos', 0.058), ('xls', 0.058), ('latent', 0.058), ('seeger', 0.057), ('definite', 0.057), ('fitted', 0.057), ('diagonal', 0.057), ('labels', 0.055), ('mnist', 0.055), ('baseline', 0.051), ('gp', 0.051), ('dataset', 0.051), ('odds', 0.05), ('mixtures', 0.05), ('approximation', 0.049), ('mutual', 0.044), ('rr', 0.043), ('edinburgh', 0.043), ('os', 0.043), ('yo', 0.043), ('learned', 0.043), ('vb', 0.04), ('priors', 0.04), ('supervised', 0.038), ('behaves', 0.038), ('diag', 0.038), ('curves', 0.038), ('encode', 0.038), ('distance', 0.037), ('scenario', 0.037), ('analytic', 0.037), ('tommi', 0.037), ('xm', 0.037), ('conditional', 0.036), ('implementation', 0.036), ('framework', 0.036), ('laplace', 0.035), ('flat', 0.035), ('ls', 0.035), ('chris', 0.035), ('ii', 0.035), ('middle', 0.034), ('david', 0.033), ('label', 0.032), ('factor', 0.032), ('handwritten', 0.032), ('component', 0.031), ('contour', 0.031), ('machines', 0.031), ('somewhat', 0.03), ('training', 0.03), ('ed', 0.03), ('report', 0.03), ('hypothesis', 0.029), ('nips', 0.029), ('fitting', 0.029), ('generative', 0.029), ('family', 0.029), ('powerful', 0.029), ('seen', 0.029), ('technical', 0.028), ('fit', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

2 0.21451892 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller

Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called

3 0.21365611 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

4 0.20489912 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

5 0.20284101 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

Author: Manfred Opper, Robert Urbanczik

Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1

6 0.19749486 170 nips-2001-Spectral Kernel Methods for Clustering

7 0.16784793 21 nips-2001-A Variational Approach to Learning Curves

8 0.1594955 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

9 0.15784353 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

10 0.15211102 74 nips-2001-Face Recognition Using Kernel Methods

11 0.14729035 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

12 0.13710293 144 nips-2001-Partially labeled classification with Markov random walks

13 0.12304697 56 nips-2001-Convolution Kernels for Natural Language

14 0.11952552 172 nips-2001-Speech Recognition using SVMs

15 0.11560779 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

16 0.11547352 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

17 0.11404213 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

18 0.11279272 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

19 0.11032802 60 nips-2001-Discriminative Direction for Kernel Classifiers

20 0.10977282 171 nips-2001-Spectral Relaxation for K-means Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.31), (1, 0.198), (2, -0.094), (3, -0.212), (4, 0.072), (5, 0.146), (6, 0.087), (7, 0.019), (8, -0.07), (9, 0.154), (10, -0.022), (11, -0.011), (12, -0.046), (13, -0.224), (14, -0.067), (15, 0.026), (16, 0.024), (17, 0.11), (18, -0.079), (19, 0.006), (20, 0.009), (21, 0.081), (22, -0.142), (23, 0.002), (24, 0.2), (25, 0.061), (26, 0.095), (27, 0.027), (28, -0.067), (29, -0.051), (30, -0.158), (31, 0.063), (32, -0.014), (33, -0.046), (34, -0.074), (35, 0.007), (36, -0.01), (37, -0.062), (38, -0.103), (39, -0.095), (40, -0.001), (41, 0.018), (42, -0.024), (43, 0.034), (44, 0.02), (45, 0.009), (46, -0.13), (47, 0.062), (48, 0.009), (49, -0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9684881 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

2 0.7531814 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller

Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called

3 0.64328712 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

Author: Manfred Opper, Robert Urbanczik

Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1

4 0.62507206 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

5 0.61643064 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

Author: Qi Zhang, Sally A. Goldman

Abstract: We present a new multiple-inst ance (MI) learning technique (EMDD) that combines EM with the diverse density (DD) algorithm. EM-DD is a general-purpose MI algorithm that can be applied with boolean or real-value labels and makes real-value predictions. On the boolean Musk benchmarks, the EM-DD algorithm without any tuning significantly outperforms all previous algorithms. EM-DD is relatively insensitive to the number of relevant attributes in the data set and scales up well to large bag sizes. Furthermore, EMDD provides a new framework for MI learning, in which the MI problem is converted to a single-instance setting by using EM to estimate the instance responsible for the label of the bag. 1

6 0.60308188 134 nips-2001-On Kernel-Target Alignment

7 0.59853178 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

8 0.5504092 74 nips-2001-Face Recognition Using Kernel Methods

9 0.54100889 170 nips-2001-Spectral Kernel Methods for Clustering

10 0.51233369 113 nips-2001-Learning a Gaussian Process Prior for Automatically Generating Music Playlists

11 0.50798225 155 nips-2001-Quantizing Density Estimators

12 0.4986833 21 nips-2001-A Variational Approach to Learning Curves

13 0.47118962 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

14 0.46481508 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

15 0.46328735 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

16 0.4583675 56 nips-2001-Convolution Kernels for Natural Language

17 0.45602673 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

18 0.4337565 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

19 0.42419735 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

20 0.42097539 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.23), (14, 0.079), (17, 0.037), (19, 0.033), (27, 0.132), (30, 0.043), (38, 0.03), (59, 0.039), (72, 0.094), (79, 0.053), (83, 0.017), (91, 0.134)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89914083 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

Author: Michael L. Littman, Michael J. Kearns, Satinder P. Singh

Abstract: We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games. 1

same-paper 2 0.84740204 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

3 0.81137824 84 nips-2001-Global Coordination of Local Linear Models

Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton

Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as:  RP&¤§¢  Q  ¡ I 0 (  3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡    ¡   ¥ ¡     ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨

4 0.70463139 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.6988712 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

6 0.69801456 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

7 0.69581115 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

8 0.69472218 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

9 0.6938256 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

10 0.69266462 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

11 0.69133914 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

12 0.69130653 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

13 0.69107825 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

14 0.69048357 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

15 0.68998265 134 nips-2001-On Kernel-Target Alignment

16 0.68750525 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

17 0.68547821 155 nips-2001-Quantizing Density Estimators

18 0.68512422 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

19 0.68471473 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

20 0.68393201 36 nips-2001-Approximate Dynamic Programming via Linear Programming