jmlr jmlr2011 jmlr2011-60 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Umut Ozertem, Deniz Erdogmus
Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift
Reference: text
sentIndex sentText sentNum sentScore
1 We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. [sent-7, score-0.943]
2 This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. [sent-8, score-2.114]
3 Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. [sent-12, score-0.802]
4 Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift 1. [sent-14, score-1.17]
5 In fact, this forms the basic idea behind the original principal curve definition by Hastie (1984); Hastie and Stuetzle (1989). [sent-19, score-0.802]
6 At the time, Hastie and Stuetzle’s proposition of self consistent principal curves (Hastie, 1984; Hastie and Stuetzle, 1989) pointed out a different track for nonlinear dimensionality reduction. [sent-39, score-0.776]
7 They defined self-consistency over the local conditional data expectations, and generalized the selfconsistency property of the principal line into nonlinear structures to introduce the concept of principal curves. [sent-40, score-1.168]
8 Hastie and Stuetzle define the principal curve as an infinitely differentiable finite length curve that passes through the middle of the data. [sent-41, score-1.057]
9 Hastie and Stuetzle’s major theoretical contributions are the following: (i) they show that if a straight line is self-consistent, it is a principal component (ii) based on the MSE criterion, selfconsistent principal curves are saddle points of the distance function. [sent-43, score-1.254]
10 They use this second property to develop an algorithm that starts from the principal line and iteratively finds the principal curve by minimizing the average squared distance of the data points and the curve (Hastie, 1984; Hastie and Stuetzle, 1989). [sent-44, score-1.604]
11 Although they cannot prove the convergence of their algorithm, Hastie and Stuetzle claim that principal curves are by definition a fixed point of their algorithm, and if the projection step of their algorithm is replaced with least squares line fitting, the algorithm converges to the principal line. [sent-45, score-1.302]
12 Banfield and Raftery extend the Hastie-Stuetzle principal curve algorithm to closed curves and and propose an algorithm that reduces the estimation bias (Banfield and Raftery, 1992). [sent-47, score-0.963]
13 Delicado’s method is based on the total variance and conditional means and finds the principal curve of oriented points of the data set. [sent-50, score-0.802]
14 Probabilistic principal curves approach, which uses a cubic spline over a mixture of Gaussians to estimate the principal curves/surfaces (Chang and Grosh, 2002), is known to be among the most successful methods to overcome the common problem of bias introduced in the regions of high curvature. [sent-52, score-1.276]
15 Verbeek and coworkers used local principal lines to construct principal curves (Verbeek et al. [sent-53, score-1.269]
16 Many principal curve approaches in the literature, including the original Hastie-Stuetzle algorithm, are based on the idea of minimizing mean square projection error. [sent-59, score-0.876]
17 Kegl and colleagues provide a regularized version of Hastie’s definition by bounding the total length of the principal curve to avoid overfitting (Kegl et al. [sent-61, score-0.802]
18 Sandilya and Kulkarni define the regularization in another way by constraining bounds on the turns of the principal curve (Sandilya and Kulkarni, 2002). [sent-63, score-0.802]
19 Similar to Kegl’s principal curve definition of bounded length, they also show that principal curves with bounded turn always exist if the data distribution has finite second moments. [sent-64, score-1.483]
20 At this point, note that the original Hastie-Stuetzle definition requires the principal curve not to intersect itself, which is quite restrictive, and perhaps, Kegl’s Principal Graph algorithm is the only approach in the principal curves literature that can handle self-intersecting data. [sent-66, score-1.483]
21 Overall, the original principal curve definition by Hastie and Stuetzle forms a strong basis for many, possibly all, principal curve algorithms. [sent-67, score-1.604]
22 Here we take a bold step by defining the principal curves with no explicit smoothness constraint at all; we assume that smoothness of principal curves/surfaces is inherent in the smoothness of the underlying probability density (estimate). [sent-70, score-1.311]
23 Providing the definition in terms of data probability density allows us to link open ended problems of principal curve fitting literature—like optimal regularization constraints and outlier robustness—to well established principles in density estimation literature. [sent-71, score-1.039]
24 , the same formulas apply to both kernel density and mixture model estimates with minor modifications), (iii) the convergence of the algorithm to a point on the principal surface with appropriate dimensionality is guaranteed for any initial point, since mean-shift is a convergent procedure. [sent-79, score-0.857]
25 Our principal curve definition essentially corresponds to the ridge of the probability density function. [sent-86, score-0.885]
26 Hastie’s self-consistency principle states that every point on the principal curve is the expected value of the points in the orthogonal subspace of the principal curve at that point—and this orthogonal space rotates along the curve. [sent-88, score-1.726]
27 In our view, every point on the principal surface is the local maximum, not the expected value, of the probability density in the local orthogonal subspace. [sent-89, score-0.819]
28 On the left, a comparison of the proposed principal curve projection, and the trajectories of the gradient of the pdf is presented. [sent-98, score-1.039]
29 On the right, we present the principal curve of a 7-component Gaussian mixture from two different points of view. [sent-102, score-0.85]
30 Inspired by differential geometry where principal lines of curvature are well-defined and understood, we define the principal curves and surfaces in terms of the first and second order derivatives of the assumed probability density function. [sent-105, score-1.525]
31 In practice, while principal surfaces might be useful in dimension reduction as in manifold learning, minor surfaces, valleys in the probability density function, can be useful in semi-supervised learning. [sent-163, score-0.891]
32 Consequently, at any point on the d-dimensional principal set (or critical or minor sets) in a small open ball around this point, the points in the principal set form a continuous surface. [sent-177, score-1.16]
33 In general, the order of continuous differentiability of the underlying pdf model is reflected to the emerging principal surfaces and projection trajectories accordingly. [sent-180, score-0.985]
34 Observe this inclusion property by revisiting Figure 1, as the major principal curve (show in the figures on the right) passes through all local maxima of the Gaussian mixture density. [sent-184, score-0.891]
35 If one projects points in higher dimensional surfaces to lower dimensional principal surfaces following trajectories traced by the Hessian of f (p((x))), these projection trajectories will depend on f . [sent-199, score-1.023]
36 Consequently, the local Hessian eigendirections form linear trajectories and principal surfaces become hyperplanes spanned by the eigenvectors of the Gaussian’s covariance matrix. [sent-203, score-0.917]
37 5 Existence of Principal Curves and Surfaces Considering Hastie’s principal curve definition, the existence proof of principal curves is limited to some special cases, such as elliptical or spherical distributions concentrated around a smooth curve. [sent-221, score-1.483]
38 It should also be noted that this definition of the principal curve requires the principal curve not to intersect itself. [sent-222, score-1.604]
39 (2000) and Sandilya and Kulkarni (2002) are theoretically more appealing in this context, since by their definition, the principal curve always exists if the distribution has finite second moments. [sent-224, score-0.802]
40 According to our definition, the principal curve exists as long as the data probability density is twice differentiable, such that the Hessian is nonzero. [sent-225, score-0.885]
41 However, also note that by our definition the principal curve does not exist for uniform distributions. [sent-227, score-0.802]
42 Note that, since it coincides with PCA for Gaussian distributions, our principal curve definition also has the ambiguity that occurs in PCA; the principal surface of a spherically symmetric distribution is not well-defined. [sent-231, score-1.415]
43 Conditional expectation or mean squared projection error based definitions have driven the principal curves research, but in general, the definition is limited to the nonlinear counterpart of the first principal component. [sent-232, score-1.335]
44 principal curve in the literature that we are aware of. [sent-234, score-0.802]
45 A point is on the one dimensional principal surface iff the local gradient is an eigenvector of the local Hessian—since the gradient has to be orthogonal to the other (n − 1) eigenvectors—and the corresponding (n − 1) eigenvalues are negative. [sent-249, score-0.907]
46 The iterations can be used to find the principal curve projection of any arbitrary point of interest in the feature space. [sent-281, score-0.876]
47 5 To find the principal curve projections of the data samples, a suitable way is to initialize the projection trajectories to the data samples themselves, as in mean-shift clustering. [sent-282, score-1.002]
48 The general version of SCMS algorithm that converges to the d-dimensional principal manifold is presented in Table 1, and SCMS principal curve algorithm can simply be obtained by setting d = 1. [sent-283, score-1.433]
49 2 S TATISTICAL C ONSISTENCY Our algorithmic approach has used the powerful kernel density estimation (KDE) technique to estimate principal surface structures that underly data densities. [sent-320, score-0.776]
50 In principal surfaces, however, we rely on the accurate estimation of the first and second derivatives of the multivariate data density along with the density itself. [sent-322, score-0.765]
51 3 O UTLIER ROBUSTNESS Outlier robustness is another key issue in principal curve literature. [sent-333, score-0.829]
52 Such approaches are known to be sensitive to noise, and presence of outlier data samples, of course, will bias the principal curve towards outliers. [sent-335, score-0.846]
53 In fact, this data set is similar to the illustration that Stanford and Raftery use as they propose their noise robust principal curve approach (Stanford and Raftery, 2000). [sent-386, score-0.826]
54 Considering the data set and principal curves in Figure 4 (left), Kegl asks, which of the curves is the right one. [sent-395, score-0.815]
55 In the first part, we provide comparisons with some earlier principal curve algorithms in the literature. [sent-417, score-0.802]
56 Same as the principal line, principal curves—in our definition—extend to infinity. [sent-423, score-1.094]
57 Therefore, throughout this section, rather than populating samples on the curve that extend to infinity, we prefer representing the principal curve with the data samples projected onto the principal curve, so that the curves in the plots remain in the support of the data. [sent-425, score-1.796]
58 6 Figure 5: Zig-zag data set, and Hastie-Stuetzle principal curve 4. [sent-441, score-0.802]
59 1 Comparisons with Other Principal Curve Methods In this section we present comparisons with original Hastie-Stuezle principal curve method (Hastie, 1984; Hastie and Stuetzle, 1989) and the Polygonal Line Algorithm by Kegl et al. [sent-442, score-0.802]
60 1 Z IG -Z AG DATA S ET Zig-Zag data set has been used in an earlier principal curve paper by Kegl et al. [sent-447, score-0.802]
61 Here—and also for the rest of the paper—we present the data projected onto principal curve only, that depicts the portion of the principal curve in the support of the input data. [sent-470, score-1.604]
62 6 Figure 7: Zig-zag data set, and its principal curve projections obtained for KDE with isotropic constant bandwidth (top), KDE with anisotropic and data-dependent covariance (middle), and Gaussian mixture with 4 components (bottom). [sent-525, score-1.063]
63 5 1 Figure 8: Spiral data set, and Hastie-Stuetzle principal curve Similar to the previous example, we start with the results of Hastie-Stuetzle algorithm and Kegl’s polygonal line algorithm. [sent-536, score-0.922]
64 Comparing Figure 9 and Figure 10, one can see that both Polygonal Line algorithm—with suitable parameters— and our locally defined principal curve can achieve satisfactory results. [sent-546, score-0.802]
65 At each noise level, we find the principal curve using both methods using the same noisy data set, and afterwards we take another 200 samples from the same generating curve and add same amount of radial noise to use as the test set. [sent-554, score-1.134]
66 3 L OOPS , S ELF - INTERSECTIONS , AND B IFURCATIONS Since they are specifically designed to fit smooth curves to the data, traditional principal curve fitting approaches in the literature have difficulties if there are loops, bifurcations and self intersections in the data. [sent-564, score-1.001]
67 5 1 Figure 10: Spiral data set, and KDE-SCMS principal curve 1 0. [sent-576, score-0.802]
68 5 1 Figure 11: Noisy spiral data set, and KDE-SCMS principal curve such irregularities, our definition yields a principal graph—a collection of smooth curves. [sent-586, score-1.409]
69 4 E XTENDING THE D EFINITION TO H IGHER D IMENSIONAL M ANIFOLDS The generalization of principal curves to principal surfaces and higher order manifolds is naturally achieved with our definition. [sent-591, score-1.381]
70 ) Here, note that the covariance of the helix data around the principal curve is not symmetric, and the horizontal dimension has a higher variance (and this is why the principal surface is spanned along this dimension). [sent-594, score-1.491]
71 If the helix had been symmetric around the principal curve, the principal surface would have been ill-defined. [sent-595, score-1.187]
72 We proposed to use principal curve projections as a nonparametric denoising filter at the preprocessing stage of time warping algorithms, which in general are prone to noise (Ozertem and Erdogmus, 2009). [sent-657, score-0.9]
73 The trajectories that take a point x to its projection on the d-dimensional principal surface can be used to obtain these curvilinear coordinates that specify the point with respect to some reference critical point that can be assumed to be the origin. [sent-669, score-0.808]
74 Figure 15(b) shows the principal curve of this time-frequency surface obtained by KDE-SCMS. [sent-688, score-0.868]
75 At this point, one can improve noise robustness of the clustering by using the fact that the clusters are curves in the feature space by using the spectral clustering of the principal curve projections instead of the data samples. [sent-712, score-1.124]
76 Figure 17 shows a result of KDE-SCMS for the same data set at signal to noise ratio of 5dB, along with the average normalized MSE (and ± one standard deviation) between the actual noisefree signal and the principal curve projection over 20 Monte Carlo runs. [sent-713, score-1.006]
77 The principal curve projection result can give a good estimate of the noisefree signal even in signal to noise ratio levels even lower than 0dB—where the noise power is greater than the signal power itself. [sent-714, score-1.083]
78 One significant problem with applying principal curve algorithms to skeletonization of optical characters is that, by definition, algorithms are seeking for a smooth curve. [sent-719, score-0.859]
79 Figure 18 shows the binary images along with the principal curve projection of the pixels. [sent-725, score-0.876]
80 3 Limitations, Finite Sample Effects, and the Curse of Dimensionality Since our principal curve definition assumes the pdf to be given, it depends on the reliability of the preceding density estimation step, which in general may not be an easy task. [sent-728, score-1.075]
81 Therefore, before we move on to applications on real data, in this section we will present the performance of our principal curve fitting results for various density estimates with different number of samples and dimensions. [sent-730, score-0.914]
82 In all comparisons presented below principal curve projections are obtained by the KDE-SCMS algorithm using the leave-one-out ML kernel bandwidth. [sent-733, score-0.904]
83 Indeed, results previously presented in this section show that KDE based principal curve estimation proves to be efficient in adapting to many real-life data distributions of a diverse set of applications. [sent-786, score-0.829]
84 01 0 −5 0 5 10 15 20 25 30 Signal to noise ratio Figure 17: Signal samples and their principal curve projections, normalized MSE vs signal to noise ratio (in dB). [sent-821, score-0.932]
85 In this scenario, we will compare the principal line estimator based on PCA to the principal curve based on KDE, for different number of dimensions. [sent-824, score-1.349]
86 Consider the data set {xN } Gaussian distributed in d-dimensional space, where v denotes the i=1 true principal line of this distribution, and v∗ denotes the principal line obtained by sample PCA. [sent-825, score-1.094]
87 mean squared distance between the projection of the data samples onto the true eigenvector and the principal curve projection x, E{ vT x − x }. [sent-828, score-1.03]
88 Figure 19 presents the MSE of the principal line (dashed curve) and principal curve (solid curve) projections for 2, 3, 4, 5, 10, 20, 30,and 40 dimensions, and average log MSE for 100 Monte Carlo 1277 O ZERTEM AND E RDOGMUS Figure 18: SCMS results in optical characters simulations is shown. [sent-829, score-1.424]
89 Principal line projection always results in better accuracy and the performance of principal curve projections drop exponentially for increasing dimensions. [sent-831, score-0.925]
90 As also implied in the previous section, with the comparison against PCA on a Gaussian data set, a parametric approach with a suitable model order, the algorithm would need less samples to achieve good principal curve estimates. [sent-838, score-0.831]
91 Here we will evaluate the stability of principal curve estimation with GMM-SCMS for improper model order selections in the GMM density estimation step, and compare the principal curve projection results for a Gaussian mixture with 3 components. [sent-839, score-1.863]
92 Since the true underlying density is known to have 3 components, we measure the performance as of principal curve projection results 1278 L OCALLY D EFINED P RINCIPAL C URVES AND S URFACES 2 dimensions 3 dimensions −2 −4 −3 log(MSE) −1 −3 log(MSE) −2 −5 −4 −6 −5 −7 −6 −8 3 3. [sent-840, score-1.009]
93 5 6 Number of samples in logscale Figure 19: Mean projection error in loge scale for principal line (dashed) and principal curve (solid). [sent-880, score-1.514]
94 Figure 20 shows a realization of the Gaussian mixture, and Figure 21 presents the performance of the principal curve projections for different number of components in the Gaussian mixture estimation, and results of 50 Monte Carlo simulations is shown. [sent-884, score-0.899]
95 Note that for increasing model orders, if the GMM has more number of components than the true underlying distribution, the generalization performance of the principal curve does not change significantly. [sent-885, score-0.802]
96 Discussions We proposed a novel definition that characterizes the principal curves and surfaces in terms of the gradient and the Hessian of the density estimate. [sent-887, score-0.943]
97 , the principal curve is the solution to some optimization problem without an analytical expression or property). [sent-896, score-0.802]
98 An important property of the definition is that it yields a unified framework for clustering, principal curve fitting and manifold learning. [sent-900, score-0.886]
99 Therefore, similar to existing methods in principal curves literature, the proposed method is not an alternative for proximity graph based manifold learning methods like Isomap, Laplacian eigenmaps etc. [sent-907, score-0.765]
100 Nonlinear coordinate unfolding via principal curve projections with application to bss. [sent-1061, score-0.877]
wordName wordTfidf (topN-words)
[('principal', 0.547), ('kegl', 0.273), ('curve', 0.255), ('kde', 0.252), ('scms', 0.218), ('pdf', 0.163), ('surfaces', 0.153), ('rdogmus', 0.148), ('zertem', 0.148), ('ocally', 0.14), ('rincipal', 0.14), ('urfaces', 0.14), ('curves', 0.134), ('efined', 0.12), ('polygonal', 0.12), ('urves', 0.12), ('bandwidth', 0.11), ('hessian', 0.091), ('manifold', 0.084), ('density', 0.083), ('mse', 0.082), ('stuetzle', 0.08), ('eigenvectors', 0.079), ('ozertem', 0.078), ('projection', 0.074), ('surface', 0.066), ('logscale', 0.062), ('mimo', 0.062), ('vaerenbergh', 0.062), ('pca', 0.062), ('spiral', 0.06), ('kernel', 0.053), ('signal', 0.053), ('eigenvector', 0.051), ('gmm', 0.051), ('projections', 0.049), ('trajectories', 0.048), ('mixture', 0.048), ('erdogmus', 0.048), ('comaniciu', 0.047), ('hastie', 0.046), ('outlier', 0.044), ('clustering', 0.044), ('critical', 0.042), ('shift', 0.042), ('local', 0.041), ('orthogonal', 0.041), ('raftery', 0.041), ('channel', 0.041), ('equalization', 0.04), ('subspace', 0.04), ('intersections', 0.039), ('kryzak', 0.039), ('sandilya', 0.039), ('verbeek', 0.039), ('dim', 0.038), ('dimensionality', 0.036), ('regular', 0.036), ('reassignment', 0.036), ('curvature', 0.036), ('eigenvalues', 0.035), ('stanford', 0.035), ('nonlinear', 0.033), ('iff', 0.033), ('curvilinear', 0.031), ('duchamp', 0.031), ('meer', 0.031), ('ozdemir', 0.031), ('skeletonization', 0.031), ('piecewise', 0.03), ('eigendecomposition', 0.03), ('anisotropic', 0.03), ('gaussian', 0.03), ('vt', 0.029), ('pl', 0.029), ('samples', 0.029), ('tting', 0.028), ('curse', 0.027), ('estimation', 0.027), ('robustness', 0.027), ('helix', 0.027), ('kulkarni', 0.027), ('self', 0.026), ('optical', 0.026), ('gradient', 0.026), ('unfolding', 0.026), ('tangent', 0.026), ('saddle', 0.026), ('denoising', 0.025), ('derivatives', 0.025), ('gt', 0.025), ('blind', 0.025), ('dimensions', 0.025), ('spanned', 0.025), ('minor', 0.024), ('noise', 0.024), ('covariance', 0.024), ('ban', 0.023), ('chacon', 0.023), ('delicado', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000014 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
Author: Umut Ozertem, Deniz Erdogmus
Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift
2 0.12754221 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
Author: Trine Julie Abrahamsen, Lars Kai Hansen
Abstract: Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inflation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efficiently restore generalizability in kPCA. As for PCA our analysis also suggests a simplified approximate expression. Keywords: PCA, kernel PCA, generalizability, variance renormalization
3 0.07517729 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.063851871 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
5 0.0595529 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor
Abstract: It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difficult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed (i.i.d.) case. Finally, some extensions are proposed to use the proposed criterion in support vector machines and least squares support vector machines for regression. Keywords: nonparametric regression, correlated errors, bandwidth choice, cross-validation, shortrange dependence, bimodal kernel
6 0.053035606 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
7 0.049873047 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
8 0.049190026 55 jmlr-2011-Learning Multi-modal Similarity
9 0.043865021 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
10 0.040627494 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
11 0.036643691 35 jmlr-2011-Forest Density Estimation
12 0.032003161 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
13 0.031475089 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
14 0.031089099 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
15 0.029994052 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
16 0.025985301 105 jmlr-2011-lp-Norm Multiple Kernel Learning
17 0.02544518 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
18 0.025436843 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
19 0.02426696 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
20 0.024136527 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
topicId topicWeight
[(0, 0.163), (1, -0.027), (2, 0.016), (3, -0.046), (4, -0.024), (5, -0.031), (6, 0.0), (7, -0.099), (8, -0.119), (9, -0.034), (10, 0.047), (11, 0.084), (12, 0.128), (13, -0.019), (14, -0.175), (15, 0.012), (16, 0.262), (17, 0.059), (18, -0.075), (19, -0.272), (20, 0.059), (21, 0.184), (22, 0.238), (23, 0.046), (24, 0.19), (25, 0.163), (26, -0.083), (27, -0.032), (28, -0.056), (29, -0.068), (30, -0.073), (31, -0.011), (32, 0.09), (33, 0.163), (34, 0.05), (35, -0.1), (36, -0.059), (37, -0.05), (38, 0.107), (39, -0.015), (40, -0.001), (41, 0.072), (42, -0.077), (43, -0.139), (44, 0.05), (45, 0.071), (46, -0.11), (47, -0.025), (48, 0.005), (49, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.9781372 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
Author: Umut Ozertem, Deniz Erdogmus
Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift
2 0.77519 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
Author: Trine Julie Abrahamsen, Lars Kai Hansen
Abstract: Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inflation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efficiently restore generalizability in kPCA. As for PCA our analysis also suggests a simplified approximate expression. Keywords: PCA, kernel PCA, generalizability, variance renormalization
3 0.40193778 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.36206368 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor
Abstract: It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difficult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed (i.i.d.) case. Finally, some extensions are proposed to use the proposed criterion in support vector machines and least squares support vector machines for regression. Keywords: nonparametric regression, correlated errors, bandwidth choice, cross-validation, shortrange dependence, bimodal kernel
5 0.35952491 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
6 0.34187296 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
7 0.26846367 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
8 0.21507576 55 jmlr-2011-Learning Multi-modal Similarity
9 0.19405907 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
10 0.17793325 102 jmlr-2011-Waffles: A Machine Learning Toolkit
11 0.17288902 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms
12 0.16996829 35 jmlr-2011-Forest Density Estimation
13 0.16516049 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
14 0.16473468 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
15 0.1644261 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
16 0.1625208 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
17 0.1594706 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
18 0.1574088 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
19 0.15654187 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
20 0.15040235 50 jmlr-2011-LPmade: Link Prediction Made Easy
topicId topicWeight
[(4, 0.039), (9, 0.029), (10, 0.028), (11, 0.013), (24, 0.028), (31, 0.09), (32, 0.029), (41, 0.02), (60, 0.017), (65, 0.015), (70, 0.021), (71, 0.01), (73, 0.042), (78, 0.055), (86, 0.447), (87, 0.018), (90, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.75793582 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
Author: Umut Ozertem, Deniz Erdogmus
Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift
2 0.63548821 66 jmlr-2011-Multiple Kernel Learning Algorithms
Author: Mehmet Gönen, Ethem Alpaydın
Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning
3 0.30642295 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.30154711 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
Author: Trine Julie Abrahamsen, Lars Kai Hansen
Abstract: Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inflation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efficiently restore generalizability in kPCA. As for PCA our analysis also suggests a simplified approximate expression. Keywords: PCA, kernel PCA, generalizability, variance renormalization
5 0.26875818 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
6 0.26825726 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
7 0.26686725 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
8 0.26609847 12 jmlr-2011-Bayesian Co-Training
9 0.26471901 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
10 0.26420814 91 jmlr-2011-The Sample Complexity of Dictionary Learning
11 0.26286229 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
12 0.26262036 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
13 0.26252356 35 jmlr-2011-Forest Density Estimation
14 0.26248682 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
15 0.26243061 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
16 0.26160908 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
17 0.26016924 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
18 0.25960642 16 jmlr-2011-Clustering Algorithms for Chains
19 0.25861886 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
20 0.25834748 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models