jmlr jmlr2011 jmlr2011-60 knowledge-graph by maker-knowledge-mining

60 jmlr-2011-Locally Defined Principal Curves and Surfaces

Source: pdf

Author: Umut Ozertem, Deniz Erdogmus

Abstract: Principal curves are deﬁned as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redeﬁne principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve ﬁtting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous ﬁrst and second derivatives. Therefore, we ﬁrst present our principal curve/surface deﬁnition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We redeﬁne principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. [sent-7, score-0.943]

2 This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve ﬁtting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. [sent-8, score-2.114]

3 Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. [sent-12, score-0.802]

4 Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift 1. [sent-14, score-1.17]

5 In fact, this forms the basic idea behind the original principal curve deﬁnition by Hastie (1984); Hastie and Stuetzle (1989). [sent-19, score-0.802]

6 At the time, Hastie and Stuetzle’s proposition of self consistent principal curves (Hastie, 1984; Hastie and Stuetzle, 1989) pointed out a different track for nonlinear dimensionality reduction. [sent-39, score-0.776]

7 They deﬁned self-consistency over the local conditional data expectations, and generalized the selfconsistency property of the principal line into nonlinear structures to introduce the concept of principal curves. [sent-40, score-1.168]

8 Hastie and Stuetzle deﬁne the principal curve as an inﬁnitely differentiable ﬁnite length curve that passes through the middle of the data. [sent-41, score-1.057]

9 Hastie and Stuetzle’s major theoretical contributions are the following: (i) they show that if a straight line is self-consistent, it is a principal component (ii) based on the MSE criterion, selfconsistent principal curves are saddle points of the distance function. [sent-43, score-1.254]

10 They use this second property to develop an algorithm that starts from the principal line and iteratively ﬁnds the principal curve by minimizing the average squared distance of the data points and the curve (Hastie, 1984; Hastie and Stuetzle, 1989). [sent-44, score-1.604]

11 Although they cannot prove the convergence of their algorithm, Hastie and Stuetzle claim that principal curves are by deﬁnition a ﬁxed point of their algorithm, and if the projection step of their algorithm is replaced with least squares line ﬁtting, the algorithm converges to the principal line. [sent-45, score-1.302]

12 Banﬁeld and Raftery extend the Hastie-Stuetzle principal curve algorithm to closed curves and and propose an algorithm that reduces the estimation bias (Banﬁeld and Raftery, 1992). [sent-47, score-0.963]

13 Delicado’s method is based on the total variance and conditional means and ﬁnds the principal curve of oriented points of the data set. [sent-50, score-0.802]

14 Probabilistic principal curves approach, which uses a cubic spline over a mixture of Gaussians to estimate the principal curves/surfaces (Chang and Grosh, 2002), is known to be among the most successful methods to overcome the common problem of bias introduced in the regions of high curvature. [sent-52, score-1.276]

15 Verbeek and coworkers used local principal lines to construct principal curves (Verbeek et al. [sent-53, score-1.269]

16 Many principal curve approaches in the literature, including the original Hastie-Stuetzle algorithm, are based on the idea of minimizing mean square projection error. [sent-59, score-0.876]

17 Kegl and colleagues provide a regularized version of Hastie’s deﬁnition by bounding the total length of the principal curve to avoid overﬁtting (Kegl et al. [sent-61, score-0.802]

18 Sandilya and Kulkarni deﬁne the regularization in another way by constraining bounds on the turns of the principal curve (Sandilya and Kulkarni, 2002). [sent-63, score-0.802]

19 Similar to Kegl’s principal curve deﬁnition of bounded length, they also show that principal curves with bounded turn always exist if the data distribution has ﬁnite second moments. [sent-64, score-1.483]

20 At this point, note that the original Hastie-Stuetzle deﬁnition requires the principal curve not to intersect itself, which is quite restrictive, and perhaps, Kegl’s Principal Graph algorithm is the only approach in the principal curves literature that can handle self-intersecting data. [sent-66, score-1.483]

21 Overall, the original principal curve deﬁnition by Hastie and Stuetzle forms a strong basis for many, possibly all, principal curve algorithms. [sent-67, score-1.604]

22 Here we take a bold step by deﬁning the principal curves with no explicit smoothness constraint at all; we assume that smoothness of principal curves/surfaces is inherent in the smoothness of the underlying probability density (estimate). [sent-70, score-1.311]

23 Providing the deﬁnition in terms of data probability density allows us to link open ended problems of principal curve ﬁtting literature—like optimal regularization constraints and outlier robustness—to well established principles in density estimation literature. [sent-71, score-1.039]

24 , the same formulas apply to both kernel density and mixture model estimates with minor modiﬁcations), (iii) the convergence of the algorithm to a point on the principal surface with appropriate dimensionality is guaranteed for any initial point, since mean-shift is a convergent procedure. [sent-79, score-0.857]

25 Our principal curve deﬁnition essentially corresponds to the ridge of the probability density function. [sent-86, score-0.885]

26 Hastie’s self-consistency principle states that every point on the principal curve is the expected value of the points in the orthogonal subspace of the principal curve at that point—and this orthogonal space rotates along the curve. [sent-88, score-1.726]

27 In our view, every point on the principal surface is the local maximum, not the expected value, of the probability density in the local orthogonal subspace. [sent-89, score-0.819]

28 On the left, a comparison of the proposed principal curve projection, and the trajectories of the gradient of the pdf is presented. [sent-98, score-1.039]

29 On the right, we present the principal curve of a 7-component Gaussian mixture from two different points of view. [sent-102, score-0.85]

30 Inspired by differential geometry where principal lines of curvature are well-deﬁned and understood, we deﬁne the principal curves and surfaces in terms of the ﬁrst and second order derivatives of the assumed probability density function. [sent-105, score-1.525]

31 In practice, while principal surfaces might be useful in dimension reduction as in manifold learning, minor surfaces, valleys in the probability density function, can be useful in semi-supervised learning. [sent-163, score-0.891]

32 Consequently, at any point on the d-dimensional principal set (or critical or minor sets) in a small open ball around this point, the points in the principal set form a continuous surface. [sent-177, score-1.16]

33 In general, the order of continuous differentiability of the underlying pdf model is reﬂected to the emerging principal surfaces and projection trajectories accordingly. [sent-180, score-0.985]

34 Observe this inclusion property by revisiting Figure 1, as the major principal curve (show in the ﬁgures on the right) passes through all local maxima of the Gaussian mixture density. [sent-184, score-0.891]

35 If one projects points in higher dimensional surfaces to lower dimensional principal surfaces following trajectories traced by the Hessian of f (p((x))), these projection trajectories will depend on f . [sent-199, score-1.023]

36 Consequently, the local Hessian eigendirections form linear trajectories and principal surfaces become hyperplanes spanned by the eigenvectors of the Gaussian’s covariance matrix. [sent-203, score-0.917]

37 5 Existence of Principal Curves and Surfaces Considering Hastie’s principal curve deﬁnition, the existence proof of principal curves is limited to some special cases, such as elliptical or spherical distributions concentrated around a smooth curve. [sent-221, score-1.483]

38 It should also be noted that this deﬁnition of the principal curve requires the principal curve not to intersect itself. [sent-222, score-1.604]

39 (2000) and Sandilya and Kulkarni (2002) are theoretically more appealing in this context, since by their deﬁnition, the principal curve always exists if the distribution has ﬁnite second moments. [sent-224, score-0.802]

40 According to our deﬁnition, the principal curve exists as long as the data probability density is twice differentiable, such that the Hessian is nonzero. [sent-225, score-0.885]

41 However, also note that by our deﬁnition the principal curve does not exist for uniform distributions. [sent-227, score-0.802]

42 Note that, since it coincides with PCA for Gaussian distributions, our principal curve deﬁnition also has the ambiguity that occurs in PCA; the principal surface of a spherically symmetric distribution is not well-deﬁned. [sent-231, score-1.415]

43 Conditional expectation or mean squared projection error based deﬁnitions have driven the principal curves research, but in general, the deﬁnition is limited to the nonlinear counterpart of the ﬁrst principal component. [sent-232, score-1.335]

44 principal curve in the literature that we are aware of. [sent-234, score-0.802]

45 A point is on the one dimensional principal surface iff the local gradient is an eigenvector of the local Hessian—since the gradient has to be orthogonal to the other (n − 1) eigenvectors—and the corresponding (n − 1) eigenvalues are negative. [sent-249, score-0.907]

46 The iterations can be used to ﬁnd the principal curve projection of any arbitrary point of interest in the feature space. [sent-281, score-0.876]

47 5 To ﬁnd the principal curve projections of the data samples, a suitable way is to initialize the projection trajectories to the data samples themselves, as in mean-shift clustering. [sent-282, score-1.002]

48 The general version of SCMS algorithm that converges to the d-dimensional principal manifold is presented in Table 1, and SCMS principal curve algorithm can simply be obtained by setting d = 1. [sent-283, score-1.433]

49 2 S TATISTICAL C ONSISTENCY Our algorithmic approach has used the powerful kernel density estimation (KDE) technique to estimate principal surface structures that underly data densities. [sent-320, score-0.776]

50 In principal surfaces, however, we rely on the accurate estimation of the ﬁrst and second derivatives of the multivariate data density along with the density itself. [sent-322, score-0.765]

51 3 O UTLIER ROBUSTNESS Outlier robustness is another key issue in principal curve literature. [sent-333, score-0.829]

52 Such approaches are known to be sensitive to noise, and presence of outlier data samples, of course, will bias the principal curve towards outliers. [sent-335, score-0.846]

53 In fact, this data set is similar to the illustration that Stanford and Raftery use as they propose their noise robust principal curve approach (Stanford and Raftery, 2000). [sent-386, score-0.826]

54 Considering the data set and principal curves in Figure 4 (left), Kegl asks, which of the curves is the right one. [sent-395, score-0.815]

55 In the ﬁrst part, we provide comparisons with some earlier principal curve algorithms in the literature. [sent-417, score-0.802]

56 Same as the principal line, principal curves—in our deﬁnition—extend to inﬁnity. [sent-423, score-1.094]

57 Therefore, throughout this section, rather than populating samples on the curve that extend to inﬁnity, we prefer representing the principal curve with the data samples projected onto the principal curve, so that the curves in the plots remain in the support of the data. [sent-425, score-1.796]

58 6 Figure 5: Zig-zag data set, and Hastie-Stuetzle principal curve 4. [sent-441, score-0.802]

59 1 Comparisons with Other Principal Curve Methods In this section we present comparisons with original Hastie-Stuezle principal curve method (Hastie, 1984; Hastie and Stuetzle, 1989) and the Polygonal Line Algorithm by Kegl et al. [sent-442, score-0.802]

60 1 Z IG -Z AG DATA S ET Zig-Zag data set has been used in an earlier principal curve paper by Kegl et al. [sent-447, score-0.802]

61 Here—and also for the rest of the paper—we present the data projected onto principal curve only, that depicts the portion of the principal curve in the support of the input data. [sent-470, score-1.604]

62 6 Figure 7: Zig-zag data set, and its principal curve projections obtained for KDE with isotropic constant bandwidth (top), KDE with anisotropic and data-dependent covariance (middle), and Gaussian mixture with 4 components (bottom). [sent-525, score-1.063]

63 5 1 Figure 8: Spiral data set, and Hastie-Stuetzle principal curve Similar to the previous example, we start with the results of Hastie-Stuetzle algorithm and Kegl’s polygonal line algorithm. [sent-536, score-0.922]

64 Comparing Figure 9 and Figure 10, one can see that both Polygonal Line algorithm—with suitable parameters— and our locally deﬁned principal curve can achieve satisfactory results. [sent-546, score-0.802]

65 At each noise level, we ﬁnd the principal curve using both methods using the same noisy data set, and afterwards we take another 200 samples from the same generating curve and add same amount of radial noise to use as the test set. [sent-554, score-1.134]

66 3 L OOPS , S ELF - INTERSECTIONS , AND B IFURCATIONS Since they are speciﬁcally designed to ﬁt smooth curves to the data, traditional principal curve ﬁtting approaches in the literature have difﬁculties if there are loops, bifurcations and self intersections in the data. [sent-564, score-1.001]

67 5 1 Figure 10: Spiral data set, and KDE-SCMS principal curve 1 0. [sent-576, score-0.802]

68 5 1 Figure 11: Noisy spiral data set, and KDE-SCMS principal curve such irregularities, our deﬁnition yields a principal graph—a collection of smooth curves. [sent-586, score-1.409]

69 4 E XTENDING THE D EFINITION TO H IGHER D IMENSIONAL M ANIFOLDS The generalization of principal curves to principal surfaces and higher order manifolds is naturally achieved with our deﬁnition. [sent-591, score-1.381]

70 ) Here, note that the covariance of the helix data around the principal curve is not symmetric, and the horizontal dimension has a higher variance (and this is why the principal surface is spanned along this dimension). [sent-594, score-1.491]

71 If the helix had been symmetric around the principal curve, the principal surface would have been ill-deﬁned. [sent-595, score-1.187]

72 We proposed to use principal curve projections as a nonparametric denoising ﬁlter at the preprocessing stage of time warping algorithms, which in general are prone to noise (Ozertem and Erdogmus, 2009). [sent-657, score-0.9]

73 The trajectories that take a point x to its projection on the d-dimensional principal surface can be used to obtain these curvilinear coordinates that specify the point with respect to some reference critical point that can be assumed to be the origin. [sent-669, score-0.808]

74 Figure 15(b) shows the principal curve of this time-frequency surface obtained by KDE-SCMS. [sent-688, score-0.868]

75 At this point, one can improve noise robustness of the clustering by using the fact that the clusters are curves in the feature space by using the spectral clustering of the principal curve projections instead of the data samples. [sent-712, score-1.124]

76 Figure 17 shows a result of KDE-SCMS for the same data set at signal to noise ratio of 5dB, along with the average normalized MSE (and ± one standard deviation) between the actual noisefree signal and the principal curve projection over 20 Monte Carlo runs. [sent-713, score-1.006]

77 The principal curve projection result can give a good estimate of the noisefree signal even in signal to noise ratio levels even lower than 0dB—where the noise power is greater than the signal power itself. [sent-714, score-1.083]

78 One signiﬁcant problem with applying principal curve algorithms to skeletonization of optical characters is that, by deﬁnition, algorithms are seeking for a smooth curve. [sent-719, score-0.859]

79 Figure 18 shows the binary images along with the principal curve projection of the pixels. [sent-725, score-0.876]

80 3 Limitations, Finite Sample Effects, and the Curse of Dimensionality Since our principal curve deﬁnition assumes the pdf to be given, it depends on the reliability of the preceding density estimation step, which in general may not be an easy task. [sent-728, score-1.075]

81 Therefore, before we move on to applications on real data, in this section we will present the performance of our principal curve ﬁtting results for various density estimates with different number of samples and dimensions. [sent-730, score-0.914]

82 In all comparisons presented below principal curve projections are obtained by the KDE-SCMS algorithm using the leave-one-out ML kernel bandwidth. [sent-733, score-0.904]

83 Indeed, results previously presented in this section show that KDE based principal curve estimation proves to be efﬁcient in adapting to many real-life data distributions of a diverse set of applications. [sent-786, score-0.829]

84 01 0 −5 0 5 10 15 20 25 30 Signal to noise ratio Figure 17: Signal samples and their principal curve projections, normalized MSE vs signal to noise ratio (in dB). [sent-821, score-0.932]

85 In this scenario, we will compare the principal line estimator based on PCA to the principal curve based on KDE, for different number of dimensions. [sent-824, score-1.349]

86 Consider the data set {xN } Gaussian distributed in d-dimensional space, where v denotes the i=1 true principal line of this distribution, and v∗ denotes the principal line obtained by sample PCA. [sent-825, score-1.094]

87 mean squared distance between the projection of the data samples onto the true eigenvector and the principal curve projection x, E{ vT x − x }. [sent-828, score-1.03]

88 Figure 19 presents the MSE of the principal line (dashed curve) and principal curve (solid curve) projections for 2, 3, 4, 5, 10, 20, 30,and 40 dimensions, and average log MSE for 100 Monte Carlo 1277 O ZERTEM AND E RDOGMUS Figure 18: SCMS results in optical characters simulations is shown. [sent-829, score-1.424]

89 Principal line projection always results in better accuracy and the performance of principal curve projections drop exponentially for increasing dimensions. [sent-831, score-0.925]

90 As also implied in the previous section, with the comparison against PCA on a Gaussian data set, a parametric approach with a suitable model order, the algorithm would need less samples to achieve good principal curve estimates. [sent-838, score-0.831]

91 Here we will evaluate the stability of principal curve estimation with GMM-SCMS for improper model order selections in the GMM density estimation step, and compare the principal curve projection results for a Gaussian mixture with 3 components. [sent-839, score-1.863]

92 Since the true underlying density is known to have 3 components, we measure the performance as of principal curve projection results 1278 L OCALLY D EFINED P RINCIPAL C URVES AND S URFACES 2 dimensions 3 dimensions −2 −4 −3 log(MSE) −1 −3 log(MSE) −2 −5 −4 −6 −5 −7 −6 −8 3 3. [sent-840, score-1.009]

93 5 6 Number of samples in logscale Figure 19: Mean projection error in loge scale for principal line (dashed) and principal curve (solid). [sent-880, score-1.514]

94 Figure 20 shows a realization of the Gaussian mixture, and Figure 21 presents the performance of the principal curve projections for different number of components in the Gaussian mixture estimation, and results of 50 Monte Carlo simulations is shown. [sent-884, score-0.899]

95 Note that for increasing model orders, if the GMM has more number of components than the true underlying distribution, the generalization performance of the principal curve does not change signiﬁcantly. [sent-885, score-0.802]

96 Discussions We proposed a novel deﬁnition that characterizes the principal curves and surfaces in terms of the gradient and the Hessian of the density estimate. [sent-887, score-0.943]

97 , the principal curve is the solution to some optimization problem without an analytical expression or property). [sent-896, score-0.802]

98 An important property of the deﬁnition is that it yields a uniﬁed framework for clustering, principal curve ﬁtting and manifold learning. [sent-900, score-0.886]

99 Therefore, similar to existing methods in principal curves literature, the proposed method is not an alternative for proximity graph based manifold learning methods like Isomap, Laplacian eigenmaps etc. [sent-907, score-0.765]

100 Nonlinear coordinate unfolding via principal curve projections with application to bss. [sent-1061, score-0.877]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('principal', 0.547), ('kegl', 0.273), ('curve', 0.255), ('kde', 0.252), ('scms', 0.218), ('pdf', 0.163), ('surfaces', 0.153), ('rdogmus', 0.148), ('zertem', 0.148), ('ocally', 0.14), ('rincipal', 0.14), ('urfaces', 0.14), ('curves', 0.134), ('efined', 0.12), ('polygonal', 0.12), ('urves', 0.12), ('bandwidth', 0.11), ('hessian', 0.091), ('manifold', 0.084), ('density', 0.083), ('mse', 0.082), ('stuetzle', 0.08), ('eigenvectors', 0.079), ('ozertem', 0.078), ('projection', 0.074), ('surface', 0.066), ('logscale', 0.062), ('mimo', 0.062), ('vaerenbergh', 0.062), ('pca', 0.062), ('spiral', 0.06), ('kernel', 0.053), ('signal', 0.053), ('eigenvector', 0.051), ('gmm', 0.051), ('projections', 0.049), ('trajectories', 0.048), ('mixture', 0.048), ('erdogmus', 0.048), ('comaniciu', 0.047), ('hastie', 0.046), ('outlier', 0.044), ('clustering', 0.044), ('critical', 0.042), ('shift', 0.042), ('local', 0.041), ('orthogonal', 0.041), ('raftery', 0.041), ('channel', 0.041), ('equalization', 0.04), ('subspace', 0.04), ('intersections', 0.039), ('kryzak', 0.039), ('sandilya', 0.039), ('verbeek', 0.039), ('dim', 0.038), ('dimensionality', 0.036), ('regular', 0.036), ('reassignment', 0.036), ('curvature', 0.036), ('eigenvalues', 0.035), ('stanford', 0.035), ('nonlinear', 0.033), ('iff', 0.033), ('curvilinear', 0.031), ('duchamp', 0.031), ('meer', 0.031), ('ozdemir', 0.031), ('skeletonization', 0.031), ('piecewise', 0.03), ('eigendecomposition', 0.03), ('anisotropic', 0.03), ('gaussian', 0.03), ('vt', 0.029), ('pl', 0.029), ('samples', 0.029), ('tting', 0.028), ('curse', 0.027), ('estimation', 0.027), ('robustness', 0.027), ('helix', 0.027), ('kulkarni', 0.027), ('self', 0.026), ('optical', 0.026), ('gradient', 0.026), ('unfolding', 0.026), ('tangent', 0.026), ('saddle', 0.026), ('denoising', 0.025), ('derivatives', 0.025), ('gt', 0.025), ('blind', 0.025), ('dimensions', 0.025), ('spanned', 0.025), ('minor', 0.024), ('noise', 0.024), ('covariance', 0.024), ('ban', 0.023), ('chacon', 0.023), ('delicado', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000014 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

Author: Umut Ozertem, Deniz Erdogmus

2 0.12754221 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

Author: Trine Julie Abrahamsen, Lars Kai Hansen

Abstract: Small sample high-dimensional principal component analysis (PCA) suffers from variance inﬂation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inﬂation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efﬁciently restore generalizability in kPCA. As for PCA our analysis also suggests a simpliﬁed approximate expression. Keywords: PCA, kernel PCA, generalizability, variance renormalization

3 0.07517729 48 jmlr-2011-Kernel Analysis of Deep Networks

Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller

Abstract: When training deep networks it is common knowledge that an efﬁcient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels ﬁt the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations

4 0.063851871 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a ﬁxed-rank positive semideﬁnite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of ﬁxedrank positive semideﬁnite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semideﬁnite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semideﬁnite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

5 0.0595529 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor

Abstract: It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difﬁcult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed (i.i.d.) case. Finally, some extensions are proposed to use the proposed criterion in support vector machines and least squares support vector machines for regression. Keywords: nonparametric regression, correlated errors, bandwidth choice, cross-validation, shortrange dependence, bimodal kernel

6 0.053035606 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

7 0.049873047 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

8 0.049190026 55 jmlr-2011-Learning Multi-modal Similarity

9 0.043865021 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

10 0.040627494 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

11 0.036643691 35 jmlr-2011-Forest Density Estimation

12 0.032003161 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

13 0.031475089 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

14 0.031089099 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

15 0.029994052 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

16 0.025985301 105 jmlr-2011-lp-Norm Multiple Kernel Learning

17 0.02544518 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

18 0.025436843 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

19 0.02426696 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

20 0.024136527 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.163), (1, -0.027), (2, 0.016), (3, -0.046), (4, -0.024), (5, -0.031), (6, 0.0), (7, -0.099), (8, -0.119), (9, -0.034), (10, 0.047), (11, 0.084), (12, 0.128), (13, -0.019), (14, -0.175), (15, 0.012), (16, 0.262), (17, 0.059), (18, -0.075), (19, -0.272), (20, 0.059), (21, 0.184), (22, 0.238), (23, 0.046), (24, 0.19), (25, 0.163), (26, -0.083), (27, -0.032), (28, -0.056), (29, -0.068), (30, -0.073), (31, -0.011), (32, 0.09), (33, 0.163), (34, 0.05), (35, -0.1), (36, -0.059), (37, -0.05), (38, 0.107), (39, -0.015), (40, -0.001), (41, 0.072), (42, -0.077), (43, -0.139), (44, 0.05), (45, 0.071), (46, -0.11), (47, -0.025), (48, 0.005), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9781372 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

Author: Umut Ozertem, Deniz Erdogmus

2 0.77519 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

Author: Trine Julie Abrahamsen, Lars Kai Hansen

3 0.40193778 48 jmlr-2011-Kernel Analysis of Deep Networks

Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller

4 0.36206368 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor

5 0.35952491 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

6 0.34187296 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

7 0.26846367 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

8 0.21507576 55 jmlr-2011-Learning Multi-modal Similarity

9 0.19405907 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

10 0.17793325 102 jmlr-2011-Waffles: A Machine Learning Toolkit

11 0.17288902 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms

12 0.16996829 35 jmlr-2011-Forest Density Estimation

13 0.16516049 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

14 0.16473468 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

15 0.1644261 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

16 0.1625208 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

17 0.1594706 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

18 0.1574088 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

19 0.15654187 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

20 0.15040235 50 jmlr-2011-LPmade: Link Prediction Made Easy

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.039), (9, 0.029), (10, 0.028), (11, 0.013), (24, 0.028), (31, 0.09), (32, 0.029), (41, 0.02), (60, 0.017), (65, 0.015), (70, 0.021), (71, 0.01), (73, 0.042), (78, 0.055), (86, 0.447), (87, 0.018), (90, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75793582 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

Author: Umut Ozertem, Deniz Erdogmus

2 0.63548821 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

3 0.30642295 48 jmlr-2011-Kernel Analysis of Deep Networks

Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller

4 0.30154711 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

Author: Trine Julie Abrahamsen, Lars Kai Hansen

5 0.26875818 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

6 0.26825726 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

7 0.26686725 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

8 0.26609847 12 jmlr-2011-Bayesian Co-Training

9 0.26471901 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

10 0.26420814 91 jmlr-2011-The Sample Complexity of Dictionary Learning

11 0.26286229 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

12 0.26262036 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.26252356 35 jmlr-2011-Forest Density Estimation

14 0.26248682 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

15 0.26243061 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

16 0.26160908 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

17 0.26016924 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

18 0.25960642 16 jmlr-2011-Clustering Algorithms for Chains

19 0.25861886 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

20 0.25834748 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models