jmlr jmlr2008 jmlr2008-70 knowledge-graph by maker-knowledge-mining

70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces


Source: pdf

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. [sent-14, score-0.713]

2 Keywords: kernel methods, feature space, dimension reduction, effective dimensionality 1. [sent-16, score-0.619]

3 This result is based on recent approximation bounds for the eigenvectors of the kernel matrix which show that if a function can be reconstructed using only a few kernel PCA components asymptotically, then the same already holds in a finite sample setting, even for small sample sizes. [sent-33, score-1.052]

4 This finding underlines the efficient use of data that is made by kernel machines if the kernel works well for the learning problem. [sent-35, score-0.832]

5 We can visualize the contributions of the individual a kernel PCA components1 to the class membership by plotting the absolute values of scalar products between the labels and the kernel PCA components. [sent-41, score-1.002]

6 We can observe that the contributions are concentrated in the leading kernel PCA directions, but a large fraction of the information is contained in the later components as well. [sent-43, score-0.659]

7 One first projects onto the subspace spanned by a number of leading kernel PCA components, trains a linear classifier (for example, by least squares regression) and then measures the prediction error on the test set. [sent-48, score-0.702]

8 The test error is large either if the considered subspace did not capture all of the relevant information, or if it already contained too much noise leading to overfitting. [sent-49, score-0.571]

9 If the minimal test error is on par with a state-of-the-art method independently trained using the same kernel then the subspace has successfully captured all of the relevant information. [sent-50, score-0.717]

10 Therefore, we see that the later components only contain noise, and the relevant information is contained in the leading kernel PCA components. [sent-55, score-0.788]

11 Roughly, o instead of the covariance matrix, one considers the eigenvalues and eigenvectors of the kernel matrix, which is built from all pairwise evaluations of the kernel matrix on the inputs. [sent-59, score-1.101]

12 Principal values (variances) are still given by the eigenvalues of the kernel matrix, but principal directions (which would be potentially infinite-dimensional vectors) are replaced by principal components, which are scalar products with the principal directions. [sent-60, score-0.803]

13 50 100 150 200 250 kernel PCA components 300 350 400 (b) Contributions of kernel PCA components. [sent-68, score-0.918]

14 5 0 0 50 100 150 200 250 300 number of kernel PCA components 350 400 −2 −3 (c) Training and test errors using only leading kernel PCA components. [sent-74, score-1.037]

15 Our claim—that the relevant information about a learning problem is contained in the space spanned by the leading kernel PCA components—is similar to the idea that the information about the learning problem is contained in the kernel PCA components with the largest contributions. [sent-81, score-1.319]

16 Instead, we show that the leading kernel PCA components (sorted by corresponding principal value) contain 1877 ¨ B RAUN , B UHMANN AND M ULLER the relevant information. [sent-83, score-0.761]

17 Roughly speaking, the relevant dimensionality of the data set corresponds to the complexity of the learning problem when viewed through the “lens” of the kernel function. [sent-89, score-0.657]

18 We summarize the main contributions of this paper: (1) We provide theoretical bounds showing that the relevant information (defined in Section 2) is actually contained in the leading projected kernel principal components under appropriate conditions. [sent-96, score-0.836]

19 (2) We propose an algorithm which estimates the relevant dimensionality and related estimates of the data set and permits to analyze the appropriateness of a kernel for the data set, and thus to perform model selection among different kernels. [sent-97, score-0.767]

20 (3) We validate the accuracy of the estimates experimentally by showing that non-regularized methods perform on the reduced feature space on par with state-of-the-art kernel methods. [sent-98, score-0.528]

21 Summarizing all the pairwise scalar products results in the (normalized) kernel matrix K with entries k(Xi , X j )/n. [sent-116, score-0.563]

22 ,Yn ) and ¨ the kernel PCA components which are introduced next. [sent-120, score-0.502]

23 with the vectors directly, the principal directions are represented using the points Xi of the data set: n vm = ∑ αi Φ(Xi ), i=1 where αi = [um ]i /lm , [um ]i is the ith component of the mth eigenvector of the kernel matrix K, and lm the corresponding eigenvalue. [sent-136, score-0.727]

24 2 Still, vm can usually not be computed explicitly such that one instead works with kernel PCA components fm (x) = Φ(x), vm . [sent-137, score-0.622]

25 As we have seen in the introduction, it seems that only a finite number of leading kernel PCA components are necessary to represent the relevant information about the learning problem up to a small error. [sent-139, score-0.713]

26 Lemma 1 The mth kernel PCA component f m evaluated on the Xi s is equal to the mth eigenvector of the kernel matrix K: ( f m (X1 ), . [sent-153, score-1.091]

27 Consequently, the sample vectors are orthogonal, and the projection of a vector Y ∈ Rn onto the leading d kernel PCA components is given by πd (Y ) = ∑d um umY. [sent-157, score-0.751]

28 m=1 Proof The mth kernel PCA component for a point X j in the training set is fm (X j ) = Φ(X j ), vm = 1 n 1 n ∑ Φ(X j ), Φ(Xi) [um ]i = lm ∑ k(X j , Xi )[um ]i . [sent-158, score-0.642]

29 lm Since K is a symmetric matrix, its eigenvectors um are orthonormal, and the projection of Y onto the space spanned by the first d kernel PCA components is given by ∑d um umY . [sent-161, score-0.962]

30 m=1 Since the kernel PCA components are orthogonal, the coefficients of a vector Y ∈ R n with respect to the basis u1 , . [sent-162, score-0.502]

31 the basis formed from the kernel PCA components the kernel PCA coefficients. [sent-169, score-0.918]

32 The projection of Y to a kernel PCA component can be thought of as the least squares regression of Y using only the direction along the kernel PCA component in feature space. [sent-171, score-0.899]

33 Using the kernel PCA coefficients, we can extend the projected labels to new points via d ˆ Y (x) = ∑ zm fm (x), m=1 which amounts to the prediction of least squares regression on the reduced feature space. [sent-172, score-0.583]

34 The Label Vector and Kernel PCA Components In the introduction, we have discussed an example which suggests that a small number of leading kernel PCA components might suffice to capture the relevant information about the output variable. [sent-174, score-0.755]

35 In this setting, we are now interested in showing that G is contained in the leading kernel PCA components, such that projecting G onto the leading kernel PCA components leads to only negligible error. [sent-207, score-1.188]

36 However, as we will see below, unregularized methods perform on par with kernel methods with capacity control using the estimated relevant information vector G. [sent-214, score-0.623]

37 The location of G with respect to the kernel PCA components is characterized by scalar products with the eigenvectors of the kernel matrix. [sent-217, score-1.121]

38 ˜ As a consequence, the eigenvalues and eigenvectors of Tk , which are equal to those of the kernel matrix, converge to those of Tk (see Koltchinskii and Gin´ , 2000; Koltchinskii, 1998). [sent-221, score-0.646]

39 The decay rate of the eigenvalues depends on the interplay between the kernel and the distribution PX . [sent-232, score-0.671]

40 As a rule of thumb, the eigenvalues decay quickly when the kernel is smooth at the scale of the data. [sent-234, score-0.639]

41 As we will see, most of the information about g is then contained in a few kernel PCA components. [sent-236, score-0.491]

42 Intuitively speaking, (2) translates to asymptotic representability of the learning problem: As n → ∞, it becomes possible to represent the optimal labels using the kernel function k. [sent-241, score-0.532]

43 1883 ¨ B RAUN , B UHMANN AND M ULLER In view of our original concern, the bound shows that the relevant information vector G (as introduced in Section 2) is contained in a number of leading kernel PCA components up to a negligible error. [sent-265, score-0.818]

44 Since this rate is related to the smoothness of the kernel function, the dimension is small for smooth kernels whose leading eigenfunctions ψ i permit good approximation of g. [sent-267, score-0.65]

45 3 The Noise To study the relationship between the noise and the eigenvectors of the kernel matrix, no asymptotic arguments are necessary. [sent-269, score-0.675]

46 The key insight is that the eigenvectors are independent of the noise in the labels, such that the noise vector N is typically evenly distributed over all coefficients u i N: Let U be the matrix whose ith column is equal to ui . [sent-270, score-0.519]

47 The practical relevance of these observations is that the relevant information and noise part have radically different properties with respect to the kernel PCA components, allowing us to practically estimate the number of relevant dimension for a given kernel and data set. [sent-279, score-1.284]

48 For example, a kernel might fail to provide an efficient representation of the learning problem, leading to an embedding requiring many kernel PCA components to capture the information on Y . [sent-284, score-1.071]

49 Therefore, in order to make practical use of the presented insights, we need to devise a method to estimate the number of relevant kernel PCA components for a given concrete data set and choice of kernel. [sent-286, score-0.658]

50 1 Relevant Dimension Estimation (RDE) The most basic estimate is the number of relevant kernel PCA components. [sent-298, score-0.572]

51 This decomposition carries over to the kernel PCA coefficients z i = ui Y = ui G + ui N. [sent-303, score-0.83]

52 Recall that U is the matrix whose ˜ ˜ columns are the eigenvectors of the kernel matrix ui such that z = U Y = U G + U N = G + N. [sent-310, score-0.727]

53 dˆ = argmin(− log (d)) = argmin 1 2 n n 1≤d≤n 1≤d≤n (4) Due to numerical instabilities of kernel PCA components corresponding to small eigenvalues, the choice of d should be restricted to 1 ≤ d ≤ n < n: The coefficients zi are computed by taking scalar products with eigenvectors ui . [sent-330, score-0.843]

54 This algorithm only depends on our theoretical results to the extent that it searches for subspaces spanned by leading kernel PCA components. [sent-340, score-0.538]

55 As stated in Lemma 1, the projection of Y onto the space spanned by the d leading kernel PCA components is given by ∑d ui ui Y , where ui are the eigenvectors of the kernel matrix. [sent-342, score-1.613]

56 2 Denoising the Labels and Estimating the Noise Level One direct application of the dimensionality estimate is the projection of Y onto the first dˆ kernel PCA components. [sent-354, score-0.619]

57 regression (5) Note that this amounts to computing the in-sample fit using kernel principal component regression (kPCR). [sent-357, score-0.493]

58 Basically, the estimation error is small if the first dˆ kernel PCA components capture most of G and dˆ is small such that most of the noise is removed. [sent-360, score-0.692]

59 3 Applications to Model Selection A highly relevant problem in the context of kernel methods is the selection of a kernel from a number of possible candidates which fits the problem best. [sent-387, score-0.961]

60 Let us first discuss how the relation between the scale of the kernel and the data set can affect the dimensionality of the embedding in feature space. [sent-394, score-0.591]

61 Figure 5 shows the dimension and noise level estimates for a classification data set (the “banana” data set), and a regression data set (the “noisy sinc function” with 100 data points for training, and 1000 data points for testing) over a range of kernel widths. [sent-396, score-0.761]

62 This quantity also measures how well the kernel can separate the noise from the relevant information, and is directly linked to the test error on an independent data set. [sent-430, score-0.73]

63 2 0 0 10 20 30 40 50 60 kernel PCA components 70 80 90 0 0 100 (c) Kernel PCA coefficients for the complex data set. [sent-502, score-0.502]

64 10 20 30 40 50 60 kernel PCA components 70 80 90 100 (d) Kernel PCA coefficients for the noisy data set. [sent-503, score-0.535]

65 Below, the kernel PCA coefficients (scalar products with eigenvectors of the kernel matrix) for the optimal kernel selected based on the RDE (TCM) estimates are plotted. [sent-510, score-1.422]

66 To assess the accuracy of the dimensionality estimate, we compare an unregularized leastsquares fit in the reduced feature space (RDE+kPCR) with kernel ridge regression (KRR) and support vector machines (SVM) on the original data set. [sent-525, score-0.562]

67 We would like to use our dimensionality and noise level estimate as a tool to examine different kernel choices. [sent-543, score-0.698]

68 The resulting kernel matrix has much smaller dimension, and also a smaller error rate (see Table 5). [sent-550, score-0.493]

69 low dimensional high dimensional low noise banana, thyroid, waveform image, ringnorm high noise breast-cancer, diabetes flare-solar, german heart, titanic splice Table 4: The data sets by noise level and complexity. [sent-665, score-0.54]

70 The reason is that the weighted degree kernel weights longer consecutive matches on the DNA differently while the rbf kernel just compares individual matches. [sent-669, score-0.866]

71 With respect to practical 1893 ¨ B RAUN , B UHMANN AND M ULLER kernel rbf rbf (binary) wdk RDE 87 11 29 est. [sent-674, score-0.516]

72 7 Table 5: Different kernels for the splice data set (for fixed kernel width w = 50). [sent-687, score-0.598]

73 One can prove that under mild assumptions on the general fit between the kernel and the learning problem, the information about the labels is always contained in the (typically small) subspace also containing most of the variance about the object features. [sent-721, score-0.611]

74 In conjunction with a sensible embedding provided by a suitable choice of the kernel function, it ensures that learning focuses on the relevant information and prevents overfitting. [sent-725, score-0.574]

75 Interestingly, RKHS type penalty terms automatically ensure that the learned function focuses on directions in which the data has large variance, automatically leading to a concentration on the leading kernel PCA components. [sent-726, score-0.58]

76 In summary, using the RDE based estimates as a diagnosis tool, it is possible to obtain more detailed insights into how well a kernel is adapted to the characteristic properties of a data set and its underlying distribution than by using integrative performance measures like test errors only. [sent-740, score-0.492]

77 3 The “True” Dimensionality of the Data We estimate the number of leading kernel PCA components necessary to capture the relevant information contained in the learning problem. [sent-742, score-0.857]

78 In our dimensionality estimate, the basis was fixed and given by leading kernel PCA components. [sent-744, score-0.61]

79 In other words, the choice of a kernel can be interpreted as implicitly specifying an appropriate basis in feature space which is able to capture G using as few basis vector as possible, and also using a subspace which contains as much of the variance of the data as possible. [sent-748, score-0.55]

80 Conclusion Both in theory and on practical data sets, we have demonstrated that the relevant information in a supervised learning scenario is contained in the leading projected kernel PCA components if the kernel matches the learning problem and is sufficiently smooth. [sent-755, score-1.204]

81 An appropriately selected kernel (b) permits a dimension reduction step that discards some irrelevant projected kernel PCA directions and thus yields a regularized model. [sent-757, score-0.921]

82 These can also be used to automatically select a suitable kernel model for the data and to extract as additional side information an estimate of the effective dimension and estimated expected error for the learning problem. [sent-759, score-0.577]

83 The kernel matrix given a kernel function k and an n-sample X1 , . [sent-803, score-0.871]

84 If k is a kernel function, k is the kernel ˜ is the kernel matrix function whose expansion has been reduced to the first r terms. [sent-811, score-1.287]

85 3 The Main Result The following five quantities occur in the bound: • ci = |{1 ≤ j ≤ r | li /2 ≤ l˜j ≤ 2li }| is the number of eigenvalues of the truncated kernel matrix which are close to the eigenvalues of the normal kernel matrix. [sent-846, score-1.21]

86 ˜ ˜ ˜ • E = K − K is the truncation error for the kernel matrix. [sent-849, score-0.5]

87 However, note that these asymptotic rates are worst case rates over certain families of kernel functions. [sent-884, score-0.536]

88 ˜ Now, in particular the convergence of Ψ+ → 1 can be quite slow in the worst case, if the eigenvalues of the kernel matrix decay quickly (see the paper by Braun, 2006, for a more thorough discussion including an artificial example of a kernel function which achieves the described rate). [sent-894, score-1.094]

89 In principle, it is possible to further relate the decay rate of the eigenvalues of the kernel matrix li to the asymptotic eigenvalue λi , for example using bounds for individual eigenvalues (Braun, 2006), or tail-sums of eigenvalues (Blanchard et al. [sent-903, score-1.103]

90 , 2005) if we wish to explicitly control the component of the relevant information vector which is not contained in the leading kernel PCA directions. [sent-905, score-0.702]

91 Usually, one would start with some specific kernel, for example an rbf-kernel, train some kernel learning algorithm using this kernel, evaluate the kernel on some test data set, and start to select different parameters. [sent-913, score-0.869]

92 6)} ˆ 18: err = 1 ∑n L(Y, Y ) ˆ n i=1 no absolute measure of the goodness of a certain kernel choice, only comparisons to other kernels, (2) there exists some dependency on the kernel learning method employed. [sent-918, score-0.861]

93 05 percentile 9 8 7 7 kernel PCA coefficients kernel PCA coefficients 8 6 5 4 3 6 5 4 3 2 2 1 1 0 0 200 400 600 kernel PCA components 800 0 0 1000 (a) Rbf-kernel on the original data. [sent-942, score-1.484]

94 2 800 1000 naive 4−bit wdk kernel PCA coefficients (subsampled) 1. [sent-945, score-0.501]

95 05 percentile 9 200 400 600 kernel PCA components 800 (c) Using a weighted-degree kernel. [sent-958, score-0.546]

96 0 0 1000 200 400 600 kernel PCA components 800 1000 (d) The mean kernel PCA coefficients of all three kernels compared (coefficients clipped to the interval from 0 to 2). [sent-959, score-0.969]

97 95 percentiles of the kernel PCA coefficients over the 20 resamples of the splice data set using the indicated kernels. [sent-963, score-0.569]

98 Incorporating domain knowledge in the encoding and finally switching to a special-purpose kernel shows that the true dimensionality of the data is in fact 1905 ¨ B RAUN , B UHMANN AND M ULLER smaller, and that the noise level, which was initially quite high, could also be lowered significantly. [sent-971, score-0.668]

99 Accurate error bounds for the eigenvalues of the kernel matrix. [sent-989, score-0.589]

100 On the convergence of eigenspaces in kernel principal components analysis. [sent-1089, score-0.55]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernel', 0.416), ('pca', 0.334), ('rde', 0.285), ('tk', 0.191), ('raun', 0.18), ('uhmann', 0.18), ('elevant', 0.169), ('imensions', 0.169), ('uller', 0.144), ('paces', 0.143), ('ui', 0.138), ('eigenvalues', 0.135), ('relevant', 0.129), ('eature', 0.128), ('dimensionality', 0.112), ('noise', 0.11), ('ernel', 0.109), ('sinc', 0.106), ('um', 0.103), ('eigenvectors', 0.095), ('splice', 0.09), ('mth', 0.09), ('px', 0.088), ('decay', 0.088), ('components', 0.086), ('tcm', 0.085), ('cients', 0.084), ('leading', 0.082), ('coef', 0.078), ('contained', 0.075), ('li', 0.069), ('scalar', 0.068), ('braun', 0.066), ('kpcr', 0.063), ('resamples', 0.063), ('denoised', 0.063), ('labels', 0.062), ('banana', 0.059), ('subspace', 0.058), ('dimension', 0.057), ('lm', 0.055), ('asymptotic', 0.054), ('coefficients', 0.053), ('kernels', 0.051), ('principal', 0.048), ('blanchard', 0.048), ('truncation', 0.046), ('tsch', 0.045), ('median', 0.045), ('percentile', 0.044), ('eigenfunctions', 0.044), ('gilles', 0.042), ('fm', 0.042), ('capture', 0.042), ('dimensions', 0.042), ('width', 0.041), ('eigenvector', 0.04), ('mikio', 0.04), ('spanned', 0.04), ('products', 0.04), ('matrix', 0.039), ('estimated', 0.039), ('estimates', 0.039), ('vm', 0.039), ('par', 0.039), ('error', 0.038), ('widths', 0.037), ('bernhard', 0.037), ('test', 0.037), ('berlin', 0.036), ('feature', 0.034), ('rbf', 0.034), ('projection', 0.033), ('rates', 0.033), ('noisy', 0.033), ('level', 0.033), ('ud', 0.032), ('interplay', 0.032), ('sii', 0.032), ('gunnar', 0.032), ('permits', 0.032), ('krr', 0.032), ('wdk', 0.032), ('zwald', 0.032), ('eigenvalue', 0.032), ('onto', 0.031), ('bound', 0.03), ('encoding', 0.03), ('diabetes', 0.029), ('titanic', 0.029), ('german', 0.029), ('err', 0.029), ('hat', 0.029), ('amounts', 0.029), ('embedding', 0.029), ('sch', 0.028), ('dna', 0.027), ('evenly', 0.027), ('legend', 0.027), ('estimate', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

2 0.17148563 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

Author: David C. Hoyle

Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference

3 0.16241233 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

4 0.13842349 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

Author: Manfred K. Warmuth, Dima Kuzmin

Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2 ) per trial, where n is the dimension of the instances. Keywords: principal component analysis, online learning, density matrix, expert setting, quantum Bayes rule

5 0.13387015 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization

6 0.12965958 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

7 0.11369937 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

8 0.1070251 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

9 0.10623007 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

10 0.10548657 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

11 0.096545555 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

12 0.085213639 86 jmlr-2008-SimpleMKL

13 0.076881371 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

14 0.064879484 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

15 0.062520787 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

16 0.062178418 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

17 0.061934453 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

18 0.058678299 58 jmlr-2008-Max-margin Classification of Data with Absent Features

19 0.057803463 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

20 0.05332632 96 jmlr-2008-Visualizing Data using t-SNE


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.316), (1, -0.176), (2, 0.025), (3, -0.073), (4, 0.269), (5, -0.099), (6, 0.038), (7, 0.205), (8, -0.113), (9, 0.098), (10, -0.074), (11, 0.063), (12, 0.204), (13, -0.02), (14, -0.013), (15, 0.159), (16, 0.07), (17, -0.005), (18, 0.069), (19, -0.152), (20, 0.137), (21, 0.059), (22, -0.099), (23, 0.038), (24, 0.102), (25, -0.043), (26, 0.119), (27, -0.176), (28, -0.069), (29, 0.011), (30, -0.042), (31, -0.044), (32, 0.041), (33, -0.099), (34, 0.039), (35, 0.002), (36, 0.128), (37, 0.06), (38, -0.062), (39, -0.01), (40, -0.026), (41, 0.049), (42, 0.129), (43, 0.006), (44, -0.111), (45, -0.126), (46, -0.058), (47, -0.015), (48, 0.058), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96499288 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

2 0.67639399 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

3 0.58447093 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

4 0.54029804 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization

5 0.52453744 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

Author: David C. Hoyle

Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference

6 0.49681926 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

7 0.4633455 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

8 0.37329689 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

9 0.37010455 86 jmlr-2008-SimpleMKL

10 0.36423573 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

11 0.36136711 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

12 0.35503632 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

13 0.31770065 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

14 0.28493342 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

15 0.28317425 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

16 0.28037024 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

17 0.26784822 96 jmlr-2008-Visualizing Data using t-SNE

18 0.26005629 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

19 0.25672689 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

20 0.24702859 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (5, 0.017), (9, 0.013), (40, 0.503), (54, 0.025), (58, 0.049), (60, 0.012), (66, 0.053), (76, 0.014), (78, 0.024), (88, 0.064), (92, 0.036), (94, 0.046), (99, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97326106 54 jmlr-2008-Learning to Select Features using their Properties

Author: Eyal Krupka, Amir Navot, Naftali Tishby

Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features

same-paper 2 0.96178097 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

3 0.61676455 58 jmlr-2008-Max-margin Classification of Data with Absent Features

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa

4 0.59478158 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

5 0.56090927 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering

Author: Eyal Krupka, Naftali Tishby

Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck

6 0.54664147 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

7 0.54207724 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

8 0.53638238 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

9 0.53244334 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

10 0.52993232 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

11 0.52353823 57 jmlr-2008-Manifold Learning: The Price of Normalization

12 0.51613981 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

13 0.51444995 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

14 0.51172376 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

15 0.50977707 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

16 0.50937122 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

17 0.50064617 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

18 0.49732861 86 jmlr-2008-SimpleMKL

19 0.49082515 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines    (Special Topic on Model Selection)

20 0.48881182 56 jmlr-2008-Magic Moments for Structured Output Prediction