nips nips2006 nips2006-65 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. [sent-12, score-1.446]
2 Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. [sent-13, score-0.157]
3 Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. [sent-15, score-0.267]
4 Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. [sent-16, score-0.565]
5 1 Introduction Kernel machines use a kernel function as a non-linear mapping of the original data into a highdimensional feature space; this mapping is often referred to as empirical kernel map [6, 11, 8, 9]. [sent-17, score-0.957]
6 By virtue of the empirical kernel map, the data is ideally transformed such that a linear discriminative function can separate the classes with low generalization error, say via a canonical hyperplane with large margin. [sent-18, score-0.505]
7 This result is based on recent approximation bounds dealing with the eigenvectors of the kernel matrix which show that if a function can be reconstructed using only a few kernel PCA components asymptotically, then the same already holds in a finite sample setting, even for small sample sizes. [sent-22, score-1.164]
8 This finding underlines the efficient use of data that is made by kernel machines using a kernel suited to the problem. [sent-24, score-0.9]
9 While the number of data points stays constant for a given problem, a smart choice of kernel permits to make better use of the available data at a favorable “data point per effective dimension”-ratio, even for infinitedimensional feature spaces. [sent-25, score-0.662]
10 Figure 1 shows the first six kernel PCA components for an example data set. [sent-29, score-0.616]
11 Above each plot, the variance of the data along this direction and the contribution of this Figure 1: Although the data set is embedded into a high-dimensional manifold, not all directions contain interesting information. [sent-30, score-0.174]
12 Above the first six kernel PCA components are plotted. [sent-31, score-0.576]
13 Of these, only the fourths is highly relevant for the learning problem. [sent-32, score-0.154]
14 component to the class labels are plotted (normalized such that the maximal possible contribution is one1 ). [sent-35, score-0.186]
15 The dimensionality of the data set in feature space is characteristic for the relation between a data set and a kernel. [sent-40, score-0.215]
16 Roughly speaking, the relevant dimensionality of the data set corresponds to the complexity of the learning problem when viewed through the lens of the kernel function. [sent-41, score-0.702]
17 This notion of complexity relates the number of data points required by the learning problem and the noise, as a small relevant dimensionality enables the de-noising of the data set to obtain an estimate of the true class labels, making the learning process much more stable. [sent-42, score-0.396]
18 This combination of dimension and noise estimate allows us to distinguish among data sets showing weak performance which might either be complex or noisy. [sent-43, score-0.303]
19 To summarize the main contributions of this paper: (1) We provide theoretical bounds showing that the relevant information (defined in section 2) is actually contained in the leading projected kernel principal components under appropriate conditions. [sent-44, score-1.074]
20 (2) We propose an algorithm which estimates the relevant dimensionality of the data set and permits to analyze the appropriateness of a kernel for the data set, and thus to perform model selection among different kernels. [sent-45, score-0.838]
21 (3) We show how the dimension estimate can be used in conjunction with kernel PCA to perform effective de-noising. [sent-46, score-0.63]
22 Our contribution is to foster an understanding about a data set and to gain better insights of whether a mediocre classification result is due to intrinsic high dimensionality of the data or overwhelming noise level. [sent-50, score-0.348]
23 In kernel methods, the data is non-linearly mapped into some feature space F via the feature map Φ. [sent-62, score-0.584]
24 Scalar products in F can be computed by the kernel k in closed form: Φ(x), Φ(x ) = k(x, x ). [sent-63, score-0.507]
25 Summarizing all the pairwise scalar products results in the (normalized) kernel matrix K with entries k(Xi , Xj )/n. [sent-64, score-0.627]
26 We wish to summarize the information contained in the class label vector Y = (Y1 , . [sent-65, score-0.16]
27 The idea is that since E(Y |X) = P (Y = 1|X) − P (Y = −1|X), the sign of G contains the relevant information on the true class membership by telling us which class is more probable. [sent-73, score-0.244]
28 The observed class label vector can be written as Y = G − N with N = G − Y denoting the noise in the class labels. [sent-74, score-0.235]
29 We want to study the relation of G with respect to the kernel PCA components. [sent-75, score-0.43]
30 The following lemma relates projections of G to the eigenvectors of the kernel matrix K: Lemma 1 The kth kernel PCA component fk evaluated on the Xi s is equal to the kth eigenvector2 of the kernel matrix K: (fk (X1 ), . [sent-76, score-1.802]
31 Consequently, the projection of a vector d Y ∈ Rn to the leading d kernel PCA components is given by πd (Y ) = k=1 uk uk Y. [sent-80, score-0.986]
32 n Proof The kernel PCA directions are given as (see [10]) vk = i=1 αi Φ(Xi ), where αi = [uk ]i /lk , [uk ]i denoting the ith component of uk , and lk , uk being the eigenvalues and eigenvectors of the kernel matrix K. [sent-81, score-1.77]
33 Thus, the kth PCA component for a point Xj in the training set is fk (Xj ) = Φ(Xj ), vk = 1 lk n Φ(Xj ), Φ(Xi ) [uk ]i = i=1 1 lk n k(Xj , Xi )[uk ]i . [sent-82, score-0.461]
34 i=1 The sum computes the jth component of Kuk = lk uk , because uk is an eigenvector of K. [sent-83, score-0.561]
35 lk Since the uk are orthogonal (K is a symmetric matrix), the projection of Y to the space spanned by d the first d kernel PCA components is given by i=1 ui ui Y . [sent-85, score-1.335]
36 3 A Bound on the Contribution of Single Kernel PCA Components As we have just shown, the location of G is characterized by its scalar products with the eigenvectors of the kernel matrix. [sent-86, score-0.664]
37 In this section, we will apply results from [1, 2] which deal with the asymptotic convergence of spectral properties of the kernel matrix to show that the decay rate of the scalar products are linked to the decay rate of the kernel PCA principal values. [sent-87, score-1.447]
38 It is clear that we cannot expect G to generally locate favorably with respect to the kernel PCA components, but only when there is some kind of match between G and the chosen kernel. [sent-88, score-0.43]
39 Kernel PCA is closely linked to the spectral properties of the kernel matrix, and it is known [3, 4] that the eigenvalues and the projections to eigenspaces converge. [sent-90, score-0.84]
40 Their asymptotic limits are given as the eigenvalues λi and eigenfunctions ψi of the integral operator Tk f = k( · , x)f (x)PX (dx) defined on L2 (PX ), where PX is the marginal measure of PX ×Y which generates our samples. [sent-91, score-0.396]
41 The eigenvalues and eigenfunctions also ∞ occur in the well-known Mercer’s formula: By Mercer’s theorem, k(x, x ) = i=1 λi ψi (x)ψi (x ). [sent-92, score-0.308]
42 3 Under this condition, the scalar products decay as quickly as the eigenvalues, because g, ψi = λi αi = O(λi ). [sent-96, score-0.238]
43 Because of the known convergence of spectral projections, we can expect the same behavior asymptotically 2 As usual, the eigenvectors are arranged in descending order by corresponding eigenvalue. [sent-97, score-0.191]
44 A number of recent results on the spectral properties of the kernel matrix [1, 2] specifically deal with error bounds for small eigenvalues and their associated spectral projections. [sent-105, score-0.91]
45 Using these results, we obtain the following bound on ui G. [sent-106, score-0.3]
46 Typically, the bound initially scales with li until the non-scaling part dominates the bound for larger i. [sent-114, score-0.167]
47 However, note that all terms which do not scale with li will typically be small: for smooth kernels, the eigenvalues quickly decay to zero as r → ∞. [sent-116, score-0.327]
48 Put differently, the bound shows that the relevant information vector G (as introduced in Section 2) is contained in a number of leading PCA components up to a negligible error. [sent-119, score-0.525]
49 The number of dimensions depends on the asymptotic coefficients αi and the decay rate of the asymptotic eigenvalues of k. [sent-120, score-0.473]
50 Since this rate is related to the smoothness of the kernel function, the dimension will be small for smooth kernels whose leading eigenfunctions ψi permit good approximation of g. [sent-121, score-0.846]
51 Since G is not known, we can only observe the contributions of the kernel PCA components to Y , which can be written as Y = G + N (see Section 2). [sent-124, score-0.604]
52 The contributions ui Y will thus be formed as a superposition of ui Y = ui G + ui N . [sent-125, score-1.074]
53 Therefore, the kernel PCA coefficients s = ui Y will have the shape of an evenly distributed noise floor ui N from which the coefficients ui G of the relevant information protrude (see Figure 2(b) for an example). [sent-127, score-1.44]
54 We thus propose the following algorithm: Given a fixed kernel k, we estimate the true dimension by fitting a two component model to the coordinates of the label vector. [sent-128, score-0.694]
55 (a) contains the kernel PCA component contributions (dots), and the training and test error by projecting the data set to the given number of leading kernel PCA components. [sent-139, score-1.214]
56 (b) shows the negative log-likelihood of the two component model used to estimate the dimensionality of the data. [sent-140, score-0.173]
57 i (1) i=d+1 Model Selection for Kernel Choice For different kernels, we again use the likelihood and select the kernel which leads to the best fit in terms of the likelihood. [sent-143, score-0.43]
58 If the kernel width does not match the scale of the structure of the data set, the fit of the two component model will be inferior: for very small or very large kernels, the kernel PCA coefficients of Y have no clear structure, such that the likelihood will be small. [sent-144, score-0.956]
59 For example, for Gaussian kernels, for very small kernel widths, noise is interpreted as relevant information, such that there appears to be no noise, only very high-dimensional data. [sent-145, score-0.684]
60 On the other hand, for very large kernel widths, any structure will be indistinguishable from noise such that the problem appears to be very noisy with almost no structure. [sent-146, score-0.53]
61 Experimental Error Estimation The estimated dimension can be used to estimate the noise level present in the data set. [sent-148, score-0.359]
62 The idea is to measure the error between the projected label ˆ vector G = πd (Y ), which approximates the true label information G. [sent-149, score-0.219]
63 The resulting number n 1 ˆ ˆ err = n i=1 1{[G]i = Yi } is an estimate of the fraction of misclassified examples in the training set, and therefore an estimate for the noise level in the class labels. [sent-150, score-0.223]
64 ˆ A Note on Consistency Both the estimate of G and the noise level are consistent if the estimated dimension d scales sub-linearly with n. [sent-151, score-0.319]
65 The argument can be sketched as follows: since the kernel PCA components do not depend on Y , the noise N contained in Y is projected to a random subspace 1 d 1 of dimension d. [sent-152, score-0.944]
66 Empirically, d was found to be rather stable, but in principle, the condition n N on d could even be enforced by adding a small sub-linear term (for example, estimated dimension d). [sent-154, score-0.18]
67 We can see that every kernel PCA component contributes to the observed class label vector. [sent-157, score-0.627]
68 For each of the data sets, we de-noise the data set using a family of rbf kernels by projecting the class labels to the estimated number of leading kernel PCA components. [sent-164, score-0.891]
69 The kernel width is also selected automatically using the achieved log-likelihood as described above. [sent-165, score-0.43]
70 The width of the rbf kernel is selected from 20 logarithmically spaced points between 10−2 and 104 for each data set. [sent-166, score-0.506]
71 For the dimension estimation task, we compare our RDE method to a dimensionality estimate based d on cross-validation. [sent-167, score-0.241]
72 More concretely, the matrix S = i=1 ui ui computes the projection to the leading d kernel PCA components. [sent-168, score-1.073]
73 Interpreting the matrix S as a linear fit matrix, the leave-oneout cross-validation error can be computed in closed form (see [12])5 , since S is diagonal with respect to the eigenvector basis ui . [sent-169, score-0.383]
74 Evaluating the cross-validation error for all dimensions and for a number of kernel parameters, one can select the best dimension and kernel parameter. [sent-170, score-1.077]
75 For the de-noising task, we have compared a (unregularized) least-squares fit in the reduced feature space (kPCR) against kernel ridge regression (KRR) and support vector machines (SVM) on the same data set. [sent-174, score-0.579]
76 Also note that the estimated error rates match the actually observed error rates quite well, although there is a tendency to under-estimate the true error. [sent-177, score-0.16]
77 Finally, inspecting the estimated dimension and noise level reveals that the data sets breast-cancer, diabetis, flare-solar, german, and titanic all have only moderately large dimensionalities. [sent-178, score-0.381]
78 On the other hand, the data set image seems to be particularly noise free, given that one can achieve a small error in spite of the large dimensionality. [sent-180, score-0.192]
79 6 Conclusion Both in theory and on practical data sets, we have shown that the relevant information in a supervised learning scenario is contained in the leading projected kernel PCA components if the kernel matches the learning problem. [sent-182, score-1.405]
80 The theory provides a consistent estimation for the expected class labels and the noise level. [sent-183, score-0.189]
81 Practically, our RDE algorithm automatically selects the appropriate kernel model for the data and extracts as additional side information an estimate of the effective dimension and estimated expected 5 This applies only to the 2-norm. [sent-185, score-0.726]
82 However, as the performance of 2-norm based methods like kernel ridge regression on classification problems show, the 2-norm is also informative on the classification performance. [sent-186, score-0.482]
83 data set banana breast-cancer diabetis flare-solar german heart image ringnorm splice thyroid titanic twonorm waveform dim 24 2 9 10 12 4 272 36 92 17 4 2 14 dim (cv) 26 2 9 10 12 5 368 37 89 18 6 2 23 est. [sent-187, score-0.359]
84 4 Figure 3: Estimated dimensions and error rates for the benchmark data sets from [7]. [sent-292, score-0.171]
85 error rate” is the estimated error rate on the training set by comparing the de-noise class labels to the true class labels. [sent-296, score-0.294]
86 An interesting future direction lies in combining these results with generalization bounds which are also based on the notion of an effective dimension, this time, however, with respect to some regularized hypothesis class (see, for example, [13]). [sent-302, score-0.2]
87 Linking the effective dimension of the data set with the dimension of a learning algorithm, one could obtain data dependent bounds in a natural way with the potential to be tighter than bounds which are based on the abstract capacity of a hypothesis class. [sent-303, score-0.511]
88 Accurate error bounds for the eigenvalues of the kernel matrix. [sent-312, score-0.727]
89 Learning bounds for kernel regression using effective data dimensionality. [sent-348, score-0.58]
90 On the convergence of eigenspaces in kernel principal components analysis. [sent-351, score-0.591]
91 A Proof of Theorem 1 First, let us collect the definitions concerning kernel functions. [sent-353, score-0.465]
92 Let k be a Mercer kernel with ∞ k(x, x ) = i=1 λi ψi (x)ψi (x ), and k(x, x) ≤ K < ∞. [sent-354, score-0.43]
93 Eigenvalues of K are li , and its eigenvectors are ui . [sent-359, score-0.403]
94 The r kernel k is approximated by the truncated kernel matrix is kr (x, x ) = i=1 λi ψi (x)ψi (x ), and its kernel matrix is denoted by Kr , whose eigenvalues are mi . [sent-360, score-1.655]
95 We will measure the amount of clustering ci of the eigenvalues by the number of eigenvalues of Kr between li /2 and 2li . [sent-362, score-0.468]
96 The matrix √ containing the sample vectors of the first r eigenfunctions ψi of k is given by [Ψr ]i = ψ (Xi )/ n, 1 ≤ i ≤ n, 1 ≤ ≤ r. [sent-363, score-0.179]
97 Since the eigenfunctions are orthogonal asymptotically, we can expect that the sample vectors of the eigenfunctions converge to either 0 or 1. [sent-364, score-0.272]
98 For kernel with bounded diagonal, it holds that with probability larger than 1−δ ([1], Lemma 3. [sent-376, score-0.43]
99 135 in [1] bounds Er from which we can derive the asymptotic 1 2KΛr = Λr + Λr O(n− 2 ), nδ assuming that K will be reasonably small (for example, for rbf-kernels, K = 1). [sent-380, score-0.161]
100 Finally, we obtain 1 1 √ |ui f (X)| = 2li ar ci (1 + O(rn− 4 )) n 1 + 2rar Λr (1 + O(rn− 4 )) + rar 1 Λr O(n− 2 ). [sent-382, score-0.241]
wordName wordTfidf (topN-words)
[('kernel', 0.43), ('pca', 0.331), ('ui', 0.252), ('rde', 0.184), ('uk', 0.176), ('eigenvalues', 0.172), ('relevant', 0.154), ('eigenfunctions', 0.136), ('dimension', 0.124), ('rar', 0.123), ('lk', 0.117), ('components', 0.108), ('kr', 0.107), ('cr', 0.102), ('noise', 0.1), ('leading', 0.096), ('kpcr', 0.092), ('px', 0.09), ('asymptotic', 0.088), ('decay', 0.084), ('tr', 0.082), ('eigenvectors', 0.08), ('krr', 0.08), ('dim', 0.078), ('dimensionality', 0.078), ('scalar', 0.077), ('projected', 0.077), ('products', 0.077), ('fk', 0.073), ('bounds', 0.073), ('mika', 0.072), ('li', 0.071), ('spectral', 0.07), ('contained', 0.07), ('rn', 0.068), ('contributions', 0.066), ('ar', 0.065), ('linked', 0.064), ('atr', 0.061), ('diabetis', 0.061), ('titanic', 0.061), ('kth', 0.061), ('kernels', 0.06), ('toy', 0.059), ('permits', 0.058), ('feature', 0.057), ('component', 0.056), ('estimated', 0.056), ('tsch', 0.056), ('xj', 0.056), ('aj', 0.054), ('mikio', 0.053), ('eigenspaces', 0.053), ('ci', 0.053), ('directions', 0.053), ('tk', 0.052), ('mercer', 0.052), ('ridge', 0.052), ('error', 0.052), ('projections', 0.051), ('contributes', 0.051), ('ller', 0.049), ('negligible', 0.049), ('overwhelming', 0.049), ('bound', 0.048), ('label', 0.045), ('widths', 0.045), ('saying', 0.045), ('class', 0.045), ('lies', 0.045), ('lemma', 0.044), ('projecting', 0.044), ('labels', 0.044), ('coef', 0.043), ('matrix', 0.043), ('sch', 0.042), ('contribution', 0.041), ('asymptotically', 0.041), ('splice', 0.041), ('cients', 0.041), ('dimensions', 0.041), ('data', 0.04), ('classi', 0.04), ('cv', 0.039), ('estimate', 0.039), ('xn', 0.038), ('benchmark', 0.038), ('six', 0.038), ('analyze', 0.038), ('effective', 0.037), ('vk', 0.037), ('practically', 0.037), ('par', 0.037), ('rbf', 0.036), ('eigenvector', 0.036), ('xi', 0.036), ('lkopf', 0.035), ('subspace', 0.035), ('hyperplane', 0.035), ('concerning', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
2 0.30513659 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus
Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.
3 0.16579896 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
Author: J.t. Lindgren, Aapo Hyvärinen
Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1
4 0.16458993 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
5 0.16366933 79 nips-2006-Fast Iterative Kernel PCA
Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan
Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems confirm that our methods converge substantially faster than conventional KHA. 1
6 0.15670447 138 nips-2006-Multi-Task Feature Learning
7 0.14999329 149 nips-2006-Nonnegative Sparse PCA
8 0.13796706 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
9 0.13146345 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
10 0.13121384 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
11 0.12924239 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
12 0.12555599 7 nips-2006-A Local Learning Approach for Clustering
13 0.11707184 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
14 0.11202601 60 nips-2006-Convergence of Laplacian Eigenmaps
15 0.10332007 80 nips-2006-Fundamental Limitations of Spectral Clustering
16 0.10097577 82 nips-2006-Gaussian and Wishart Hyperkernels
17 0.094029598 110 nips-2006-Learning Dense 3D Correspondence
18 0.092377871 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
19 0.09112075 77 nips-2006-Fast Computation of Graph Kernels
20 0.090826176 128 nips-2006-Manifold Denoising
topicId topicWeight
[(0, -0.321), (1, 0.118), (2, -0.007), (3, 0.241), (4, -0.058), (5, 0.018), (6, -0.039), (7, 0.035), (8, 0.089), (9, 0.109), (10, -0.15), (11, 0.051), (12, 0.088), (13, -0.358), (14, 0.094), (15, -0.135), (16, -0.082), (17, 0.063), (18, -0.118), (19, 0.076), (20, -0.018), (21, 0.055), (22, -0.02), (23, -0.056), (24, 0.002), (25, 0.019), (26, 0.016), (27, 0.0), (28, 0.035), (29, 0.062), (30, -0.203), (31, 0.051), (32, -0.126), (33, 0.034), (34, -0.072), (35, -0.046), (36, -0.073), (37, -0.004), (38, 0.015), (39, -0.013), (40, 0.039), (41, 0.065), (42, -0.076), (43, -0.05), (44, 0.118), (45, -0.06), (46, -0.017), (47, 0.079), (48, -0.007), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.97713506 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
2 0.88292587 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus
Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.
3 0.75729632 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
Author: Jerónimo Arenas-garcía, Kaare B. Petersen, Lars K. Hansen
Abstract: In this paper we are presenting a novel multivariate analysis method. Our scheme is based on a novel kernel orthonormalized partial least squares (PLS) variant for feature extraction, imposing sparsity constrains in the solution to improve scalability. The algorithm is tested on a benchmark of UCI data sets, and on the analysis of integrated short-time music features for genre prediction. The upshot is that the method has strong expressive power even with rather few features, is clearly outperforming the ordinary kernel PLS, and therefore is an appealing method for feature extraction of labelled data. 1
4 0.72385675 82 nips-2006-Gaussian and Wishart Hyperkernels
Author: Risi Kondor, Tony Jebara
Abstract: We propose a new method for constructing hyperkenels and define two promising special cases that can be computed in closed form. These we call the Gaussian and Wishart hyperkernels. The former is especially attractive in that it has an interpretable regularization scheme reminiscent of that of the Gaussian RBF kernel. We discuss how kernel learning can be used not just for improving the performance of classification and regression methods, but also as a stand-alone algorithm for dimensionality reduction and relational or metric learning. 1
5 0.69158739 149 nips-2006-Nonnegative Sparse PCA
Author: Ron Zass, Amnon Shashua
Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1
6 0.69061947 79 nips-2006-Fast Iterative Kernel PCA
7 0.5744949 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
8 0.56922787 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
9 0.56108248 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
10 0.51987255 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
11 0.50717133 138 nips-2006-Multi-Task Feature Learning
12 0.50531399 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
13 0.47870693 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
14 0.47337607 77 nips-2006-Fast Computation of Graph Kernels
15 0.47168639 105 nips-2006-Large Margin Component Analysis
16 0.46604085 5 nips-2006-A Kernel Method for the Two-Sample-Problem
17 0.45097935 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
18 0.4331367 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
19 0.41133264 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
20 0.41132373 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
topicId topicWeight
[(1, 0.161), (3, 0.034), (7, 0.065), (9, 0.082), (12, 0.019), (20, 0.022), (22, 0.09), (44, 0.081), (57, 0.08), (65, 0.082), (69, 0.031), (70, 0.153), (71, 0.022), (90, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.89308929 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
2 0.89280891 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
Author: Graham Mcneill, Sethu Vijayakumar
Abstract: Correspondence algorithms typically struggle with shapes that display part-based variation. We present a probabilistic approach that matches shapes using independent part transformations, where the parts themselves are learnt during matching. Ideas from semi-supervised learning are used to bias the algorithm towards finding ‘perceptually valid’ part structures. Shapes are represented by unlabeled point sets of arbitrary size and a background component is used to handle occlusion, local dissimilarity and clutter. Thus, unlike many shape matching techniques, our approach can be applied to shapes extracted from real images. Model parameters are estimated using an EM algorithm that alternates between finding a soft correspondence and computing the optimal part transformations using Procrustes analysis.
3 0.83974463 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
4 0.82331008 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1
5 0.82238489 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
6 0.82078743 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
7 0.81894261 61 nips-2006-Convex Repeated Games and Fenchel Duality
8 0.81775099 20 nips-2006-Active learning for misspecified generalized linear models
9 0.81764781 175 nips-2006-Simplifying Mixture Models through Function Approximation
10 0.81527168 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
11 0.81217998 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
12 0.81151956 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
13 0.81121546 79 nips-2006-Fast Iterative Kernel PCA
14 0.80934364 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
15 0.8086434 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
16 0.80842763 117 nips-2006-Learning on Graph with Laplacian Regularization
17 0.80794436 167 nips-2006-Recursive ICA
18 0.80604327 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
19 0.80538046 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
20 0.80435866 194 nips-2006-Towards a general independent subspace analysis