jmlr jmlr2013 jmlr2013-96 knowledge-graph by maker-knowledge-mining

96 jmlr-2013-Regularization-Free Principal Curve Estimation

Source: pdf

Author: Samuel Gerber, Ross Whitaker

Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves deﬁne the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal deﬁnition leads to some technical and practical difﬁculties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difﬁculties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modiﬁcation of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Scientiﬁc Computing and Imaging Institute University of Utah Salt Lake City, UT 84112 Editor: Sridhar Mahadevan Abstract Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. [sent-5, score-0.517]

2 Principal curves deﬁne the notion of a curve passing through the middle of a distribution. [sent-6, score-0.553]

3 In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. [sent-8, score-0.962]

4 This paper demonstrates that the difﬁculties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. [sent-9, score-0.958]

5 We introduce a new objective function, facilitated through a modiﬁcation of the principal curve estimation approach, for which all critical points are principal curves and minima. [sent-10, score-1.664]

6 Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. [sent-11, score-0.809]

7 Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection 1. [sent-13, score-0.693]

8 , 2000), and locally linear embedding (Roweis and Saul, 2000) build directly on the manifold assumption and typically formulate the manifold estimation in terms of a graph embedding problem. [sent-19, score-0.458]

9 Hastie and Stuetzle (1989), motivated by the statistical properties of principal component analysis, introduced the notion of a principal curve passing through the middle of a distribution. [sent-21, score-1.179]

10 The principal curve formulation casts manifold learning as a statistical ﬁtting problem. [sent-22, score-0.99]

11 , 2005), can be implicitly or explicitly related to the principal curve formulation. [sent-25, score-0.714]

12 The principal curve approach to manifold learning can be advantageous over alternatives that exclusively estimate descriptive parameters. [sent-26, score-0.93]

13 While several methods have been proposed to estimate principal curves, a systematic approach to model selection remains an open question. [sent-29, score-0.477]

14 To facilitate the discussion, recall the formal deﬁnition of principal curves. [sent-30, score-0.442]

15 The principal curves of Y are ˜ the set G of smooth functions g that fulﬁll the self consistency property E[Y |λg (Y ) = s] = g(s). [sent-33, score-0.786]

16 Hastie and Stuetzle (1989) showed that principal curves are critical points of the expected projection distance d(g,Y )2 = E[ g(λg (Y )) − Y 2 ], that is, for g a principal curve the (Fr´ chet) derivative of e d(g,Y )2 with respect to g is zero. [sent-34, score-1.776]

17 Estimators for principal curves typically minimizes d(g,Y )2 directly over a suitable representation of g (e. [sent-35, score-0.7]

18 However, Duchamp and Stuetzle (1996) showed, for principal curves in the plane, that all critical points are saddles. [sent-38, score-0.873]

19 In fact, without additional regularization, principal curve estimation results in space-ﬁlling curves, which have both lower training and test projection distance than curves that provide a more meaningful summary of the distribution. [sent-40, score-1.212]

20 , 2005) which, due to the saddle point property, will lead to estimates that lie on the constraint boundary and typically do not satisfy the conditions for principal curves. [sent-44, score-0.527]

21 In this paper we propose a solution to address the model complexity challenge in principal curve estimation. [sent-50, score-0.714]

22 This paper contends that the model selection problem is intrinsically tied to the mean-squarederror objective function of the principal curve formulation. [sent-51, score-0.8]

23 1286 R EGULARIZATION -F REE P RINCIPAL C URVE E STIMATION Figure 1: Two curves (solid lines) with a different trade off between variation along (dashed arrows) and orthogonal to (solid arrows) the curve for two data points in identical positions. [sent-56, score-0.685]

24 A direct minimization of mean squared projection distance d(g,Y )2 = E[ g(λ(Y )) − Y 2 ] will always favor curves passing close to the data points and not necessarily move towards a principal curve. [sent-57, score-1.041]

25 Informally, this indicates why principal curves are not minima of the distance function; trading off variation orthogonal to the manifold with variation along the curve leads to curves with smaller projection distances. [sent-58, score-1.869]

26 This becomes evident in cross-validation, where a test example has no ﬁxed assigned manifold coordinate (as compared to the ﬁxed predictor variables in regression) but has manifold coordinates assigned based on the current ﬁt. [sent-59, score-0.499]

27 This would not pose a signiﬁcant problem if the principal curves would be minima of the mean squared projection distance—one would rely on the local minimum provided by the optimization. [sent-61, score-0.951]

28 However, due to the saddle point property, the optimization moves towards a curve that passes through all the training data and cross-validation does not provide any reliable indication for early stopping. [sent-62, score-0.505]

29 Thus, principal curve estimation based on minimizing the projection distance is not a viable path for automatic model selection. [sent-63, score-0.985]

30 However, the fundamental problem persists; a regularized principal curve, is estimated which will always force the principal curve conditions to be violated. [sent-65, score-1.156]

31 The regression curve that minimizes the least squares objective function, the conditional expectation given the predictor, is the desired solution. [sent-69, score-0.557]

32 For principal curves, the regularization is required even if the full density is available; the regularization serves as a crutch to ﬁx a shortcoming in the objective function. [sent-71, score-0.601]

33 This paper proposes an approach to principal curve estimation that ﬁxes this shortcoming in the objective function. [sent-72, score-0.815]

34 The proposed alternative objective function, for which 1287 G ERBER AND W HITAKER Figure 2: Illustration of the traditional approach compared to the proposed method for estimating principal curves. [sent-73, score-0.493]

35 In the traditional approach g is optimized to ﬁnd a principal curve. [sent-76, score-0.442]

36 principal curves emerge naturally as minima, alleviates the need for regularization or changes of the original principal curve deﬁnition. [sent-78, score-1.456]

37 The proposed formulation rests on the observation that minimizing d(g,Y )2 allows principal curves g to move away from saddle points and improve their ﬁt by violating the self consistency property. [sent-79, score-1.034]

38 Instead of having the projection index λg controlled by the curve g, we ﬁx g ≡ E[Y |λ(Y )] to the conditional expectation given λ and let λ be an arbitrary function λ : Ω → R (with mild conditions, to be speciﬁed in Section 2). [sent-81, score-0.577]

39 To distinguish from Hastie and Stuetzle (1989), with λg ﬁxed given g, we call λ the coordinate mapping and E[Y |λ(Y )] the conditional expectation curve of λ. [sent-82, score-0.548]

40 Minimizing d(λ,Y ) = E g(λ(Y )) −Y 2 with respect to λ leads to critical points that are, indeed, principal curves, but unfortunately they are still saddle points. [sent-85, score-0.7]

41 This inspires the design of an objective function, facilitated through this level-set based formulation, that penalizes nonorthogonality of the coordinate mapping λ  q(λ,Y )2 = E  (g(λ(Y )) −Y ) , d g(s) ds 2 s=λ(Y )  . [sent-87, score-0.535]

42 Hence, the conditional expectation curve at a minima g ≡ E[Y |λ∗ (Y )] corresponds to a principal curve (see Section 2). [sent-90, score-1.302]

43 As for principal curve setting, the conditional expectation E[Y |λ(Y )] is deﬁned over a set of measure zero and is not a priori well deﬁned. [sent-98, score-0.914]

44 Proof The derivative of g is d d m(s) n(s) d g(s) = ds − g(s) ds . [sent-109, score-0.844]

45 Thus, by the implicit function theorem, v exist Λ-almost d everywhere for both λ and λ|∂Ω and it follows that ds g(s) exists Λ-almost everywhere. [sent-116, score-0.459]

46 This formulation does, indeed, lead to conditional expectations at critical points that are principal curves and the self consistency cannot be violated by deﬁnition. [sent-121, score-1.074]

47 In this case, λ can move towards conditional expectation curves with lower objective by violating the orthogonal projection requirement (rather than violating the self consistency requirement, as happens in the original formulation). [sent-123, score-0.877]

48 q(λ,Y )2 = E  g(λ(Y )) −Y, g(s) (3) ds s=λ(Y ) d For q(λ,Y ) to be well deﬁned ds g(s) has to exist Ω-almost everywhere and thus, requires the conditions of Corollary 3. [sent-125, score-0.867]

49 The next Section shows that all critical points of (3) are minima and that the corresponding conditional expectation curves are principal curves in the sense of Hastie and Stuetzle (1989). [sent-126, score-1.447]

50 1 Properties of Critical Points The following deﬁnition introduces a slightly weaker version of principal curves. [sent-128, score-0.442]

51 For principal curves which have no ambiguity points, that is, all y ∈ Ω have a unique closest point on the curve, the deﬁnition is equivalent to principal curves. [sent-129, score-1.195]

52 The weak principal curves of Y are the set Gw of functions g that fulﬁll the self consistency property E[Y |λ(Y ) = s] = g(s) with λ satisfying y − g(λ(y)), d ds g(s) s=λ(y) = 0 ∀y ∈ Ω. [sent-131, score-1.19]

53 Weak principal curves do include all principal curves but can potentially also include additional curves for λg with discontinuities, that is, curves with ambiguity points. [sent-132, score-1.969]

54 However, under restriction to continuous λ, which excludes principal curves with ambiguity points, the set of principal curves and weak principal curves are identical under this restriction. [sent-133, score-2.182]

55 Furthermore, in practical applications principal curves with ambiguity points are typically not of interest. [sent-134, score-0.799]

56 Theorem 5 The conditional expectation curves of critical points of q(λ,Y )2 are weak principal curves. [sent-135, score-1.102]

57 Thus, to show that critical points of q(λ,Y )2 are principal curves, requires d −1 ds g(s) to be orthogonal to g(s) − y almost everywhere for y in the level set λ (s). [sent-137, score-1.146]

58 Taking the Gˆ teaux derivative of q(λ,Y )2 with respect to τ yields a d q(λ + ετ,Y )2 dε = ε=0 (g(λ(y)) − y) , Ω d dε d g(s) ds s=λ(y) (g(λ(y) + ετ(y)) − y) , d g(s) ds dµ . [sent-139, score-0.88]

59 s=λ(y+ετ(y)) ε=0 It immediately follows λ is critical if it is orthogonal to its conditional expectation curve Ω-almost everywhere. [sent-140, score-0.671]

60 To exclude other possible critical points requires that there is no λ for which d g(λ(y) + ετ(y)) dε (g(λ(y)) − y) , , ε=0 d g(s) ds d d g(s) dε ds + s=λ(y) =0 s=λ(y+ετ(y)) ε=0 for all admissible τ. [sent-141, score-1.011]

61 For the variation of g(λ(y)) take the Gˆ teaux derivative of the molliﬁed expectation a d gσ (λ(y) + ετ(y)) dε ′ with Kσ (s − λ(y)) = ˜ = ′ ˜ ˜ Ω Kσ (λ(y) − λ(y))(τ(y) − τ(y)) y − gσ (λ(y)) dµ(y) ˜ ˜ ˜ ˜ Ω Kσ (λ(y) − λ(y))dµ(y) ε=0 d ˜ ds Kσ (s − λ(y)). [sent-142, score-0.621]

62 ds ′ ˜ ˜ ˜ ˜ ˜ ˜ Terms of the form Ω Kσ (s − λ(y) f (y)d y converge uniformly to Ω δ′ (s − λ(y) f (y)d y = d d ˜ ˜ ds λ−1 (s) f (y)d y which under the conditions of Lemma (2) exist. [sent-144, score-0.816]

63 Thus, ds gσ (s) converges unid d formly and with the uniform convergence of gσ → g yields the limit limσ→0 ds gσ (s) = ds g(s). [sent-145, score-1.224]

64 With α(s, τ) = limσ→0 ασ (s, τ) and β(s, τ) = limσ→0 βσ (s, τ) and taking the limit as σ → 0 yields d dε =τ(y) (g(λ(y) + ετ(y)) − y) , d g(s) ds 2 d g(s) ds + s=λ(y) + g(λ(y)) − y, τ(y) d g(s) ds d2 g(s) ds2 s=λ(y+ετ(y)) ε=0 , α(λ(y), τ) s=λ(y) + β(λ(y), τ) . [sent-146, score-1.224]

65 Therefore, a critical point for which g(λ(y)) − y, d g(s) ds =0 s=λ(y) Ω-almost everywhere, there must be a λ for which d g(s) ds 2 + g(λ(y)) − y, s=λ(y) d2 g(s) ds2 1292 + β(λ(y), τ) s=λ(y) =0 R EGULARIZATION -F REE P RINCIPAL C URVE E STIMATION Ω-almost everywhere. [sent-148, score-0.943]

66 Since β(λ(y), τ), set of λ requires that d ds g(s) s=λ(y) 2 d ds g(s) s=λ(y) and d2 g(s) ds2 s=λ(y) are constant along the level = 0 and either g(λ(y)) − y = 0 or g(λ(y)) − y, d2 g(s) ds2 + β(λ(y), τ) = 0, s=λ(y) Ω-almost everywhere. [sent-149, score-0.816]

67 It follows that for all critical points g(λ(y)) − y, d g(s) ds = 0. [sent-150, score-0.581]

68 Corollary 7 The coordinate mapping λ is a minima of q(λ,Y ) if and only if E[Y |λ(Y )] is a weak principal curve. [sent-153, score-0.663]

69 Proof If λ is minimal it follows from Theorem 5 that E[Y |λ(Y )] is a weak principal curve. [sent-154, score-0.471]

70 If E[Y |λ(Y )] is a principal curve λ has to be an orthogonal projection and thus q(λ,Y )2 = 0 is minimal. [sent-155, score-0.891]

71 2 Remarks The minima include pathological conditional expectation sets such as single points, curves with discontinuities and space-ﬁlling curves. [sent-157, score-0.599]

72 However, if Ω and p admit a minimal λ that satisﬁes the condition in Lemma 2, the conditional expectation is a well behaved curve Λ-almost everywhere. [sent-158, score-0.472]

73 d In principle, q(λ,Y )2 can be minimized by shrinking the tangent ds g(s), that is, through a reparametrization of g. [sent-159, score-0.447]

74 d At critical points (local minima), the length of the tangent ds g(s) has no effect on either the ˜ objective function or the solution. [sent-162, score-0.671]

75 Similar to principal components, choosing among the possible principal curves an alternate criterion, possibly application dependent, is required. [sent-167, score-1.142]

76 The gradient of (4) is n d d q(λ,Y )2 = ∑ (g(λ(yi )) − yi ) , g(s) ˆ dzk ds i=1 d d g(λ(yi )), g(s) dzk ds + (g(λ(yi )) − yi ) , s=λ(yi ) s=λ(yi ) d d g(s) dzk ds . [sent-180, score-1.477]

77 The parameters Z can be initialized by principal components, a spectral manifold learning method (Tenenbaum et al. [sent-184, score-0.718]

78 The minimization ˆ of d(λ,Y )2 with bandwidth selection tends towards curves with increased curvature, possibly moving away from a close by principal curve that is smoother than the initial curve (see Figures 3 and 5). [sent-199, score-1.532]

79 Even starting from smooth curves cross-validation is not a reliable indicator for estimating a good, approximating principal curve (see Figure 4). [sent-200, score-1.033]

80 This nice property is not unexpected and hinges directly on the theoretical results of the new formulation to principal curves. [sent-204, score-0.502]

81 However, the objective function q(λ,Y )2 indicates no prefˆ erence towards such a particular orthogonal λ on the training data and does not express a desire to increase the length/complexity of the curve in order to achieve a minima. [sent-206, score-0.47]

82 These two observations provide an explanation of the remarkable 1295 G ERBER AND W HITAKER property of the proposed estimation scheme that a regularization-free minimum on the training data is typically a desirable principal curve. [sent-209, score-0.524]

83 The mapping complexity of λ only depends on the complexity of the principal curve and can be independent of the number of data points. [sent-222, score-0.749]

84 (2009) report ˆ extensive results on the optimization of d(λ,Y )2 with equal or improved performance compared to several principal curve algorithms in various settings. [sent-228, score-0.746]

85 Thus, the comparison is representative to a typical principal curve estimation scheme based on minimizing the squared projection distance. [sent-229, score-0.92]

86 The results demonstrate that automatic bandwidth selection leads to good results for q(λ,Y )2 , while ˆ 2 shows the expected behavior of tending towards curves with high curvature ˆ minimizing d(λ,Y ) that pass close to the training data, even with early stopping based on held-out data. [sent-230, score-0.661]

87 15 but with Z = sin(φ) initialized close to a principal component. [sent-236, score-0.502]

88 1, the minimization ˆ of d(λ,Y ) moves away from the principal component while q(λ,Y ) approximately recovers the ˆ principal component. [sent-239, score-0.958]

89 ˆ Figure 3: Minimization of d(λ,Y )2 and q(λ,Y )2 starting from a curve close to a principal compoˆ nent with bandwidths initialized to σg = 0. [sent-241, score-0.885]

90 (left) Fitted curve with optimization path and (right) train and test error with points indicating minimal train and test error, respectively. [sent-262, score-0.46]

91 Figure 7 shows optimization results starting from a surface close to a principal plane. [sent-266, score-0.533]

92 As in the 1D case q(λ,Y )2 moves towards the principal ˆ ˆ plane, while d(λ,Y )2 results in a highly curved ﬁt that begins to ﬁll the space. [sent-268, score-0.531]

93 (left) Fitted curve with optimization path and (right) train and test error with points indicating minimal train and test error, respectively. [sent-278, score-0.46]

94 Figure 8 shows the original and noisy images and the projection on the conditional expectation manifold for each optimization. [sent-281, score-0.521]

95 Conclusion The conditional expectation curve formulation together with the introduction of a new objective function solves the model complexity problem in principal curve estimation. [sent-285, score-1.297]

96 (a) (b) (c) ˆ Figure 7: Minimization of (b) q(λ,Y )2 and (c) d(λ,Y )2 starting from an (a) initial surface close to ˆ a principal plane. [sent-290, score-0.501]

97 problem of principal curve estimation into a well-posed, or at least stable, problem. [sent-291, score-0.74]

98 Test data (2nd row) with N (0, 20) noise added, projected onto the conditional expectation manifold after minimization of (3rd ˆ row) d(λ,Y )2 and (4th row) q(λ,Y )2 on training data also with N (0, 20) noise added. [sent-300, score-0.476]

99 Dimensionality reduction and principal surfaces via kernel map manifolds. [sent-346, score-0.502]

100 Automatic parameter selection for a k-segments algorithm for computing principal curves. [sent-428, score-0.477]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('principal', 0.442), ('ds', 0.408), ('curve', 0.272), ('curves', 0.258), ('manifold', 0.216), ('bandwidth', 0.172), ('erber', 0.161), ('gerber', 0.161), ('hitaker', 0.161), ('stuetzle', 0.161), ('urve', 0.143), ('critical', 0.127), ('rincipal', 0.122), ('minima', 0.116), ('expectation', 0.112), ('projection', 0.105), ('stimation', 0.102), ('egularization', 0.102), ('ree', 0.095), ('conditional', 0.088), ('saddle', 0.085), ('bandwidths', 0.083), ('orthogonal', 0.072), ('duchamp', 0.072), ('swissroll', 0.061), ('initialized', 0.06), ('formulation', 0.06), ('distance', 0.056), ('dzk', 0.054), ('meinicke', 0.054), ('molli', 0.054), ('ambiguity', 0.053), ('self', 0.053), ('everywhere', 0.051), ('objective', 0.051), ('towards', 0.048), ('hn', 0.047), ('points', 0.046), ('yp', 0.046), ('hastie', 0.046), ('minimizing', 0.045), ('violating', 0.045), ('manifolds', 0.043), ('regularization', 0.042), ('coordinate', 0.041), ('moves', 0.041), ('automatic', 0.039), ('tangent', 0.039), ('isomap', 0.039), ('descent', 0.038), ('curvature', 0.037), ('variation', 0.037), ('cems', 0.036), ('fitted', 0.036), ('teaux', 0.036), ('utah', 0.036), ('whitaker', 0.036), ('wire', 0.036), ('surfaces', 0.036), ('yd', 0.036), ('mapping', 0.035), ('selection', 0.035), ('radial', 0.034), ('neighbor', 0.034), ('regression', 0.034), ('smooth', 0.033), ('minimization', 0.033), ('lling', 0.032), ('optimization', 0.032), ('yi', 0.031), ('surface', 0.031), ('tenenbaum', 0.03), ('squared', 0.03), ('remarkable', 0.029), ('ambient', 0.029), ('weak', 0.029), ('gradient', 0.029), ('train', 0.029), ('starting', 0.028), ('derivative', 0.028), ('rdle', 0.028), ('duke', 0.028), ('training', 0.027), ('estimation', 0.026), ('test', 0.026), ('discontinuities', 0.025), ('mukherjee', 0.025), ('hausdorff', 0.025), ('roweis', 0.024), ('gl', 0.024), ('ross', 0.024), ('shortcoming', 0.024), ('biau', 0.024), ('kernel', 0.024), ('passing', 0.023), ('plane', 0.023), ('belkin', 0.023), ('niyogi', 0.023), ('admissible', 0.022), ('spatially', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

2 0.1747652 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

Author: Partha Niyogi

Abstract: Manifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machine learning within which several semi-supervised algorithms have been constructed. Here we try to provide some theoretical understanding of this approach. Our main result is to expose the natural structure of a class of problems on which manifold regularization methods are helpful. We show that for such problems, no supervised learner can learn effectively. On the other hand, a manifold based learner (that knows the manifold or “learns” it from unlabeled examples) can learn with relatively few labeled examples. Our analysis follows a minimax style with an emphasis on ﬁnite sample results (in terms of n: the number of labeled examples). These results allow us to properly interpret manifold regularization and related spectral and geometric algorithms in terms of their potential use in semi-supervised learning. Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates

3 0.12092671 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

Author: Nakul Verma

Abstract: Low dimensional embeddings of manifold data have gained popularity in the last decade. However, a systematic ﬁnite sample analysis of manifold embedding algorithms largely eludes researchers. Here we present two algorithms that embed a general n-dimensional manifold into Rd (where d only depends on some key manifold properties such as its intrinsic dimension, volume and curvature) that guarantee to approximately preserve all interpoint geodesic distances. Keywords: manifold learning, isometric embeddings, non-linear dimensionality reduction, Nash’s embedding theorem

4 0.11368141 86 jmlr-2013-Parallel Vector Field Embedding

Author: Binbin Lin, Xiaofei He, Chiyuan Zhang, Ming Ji

Abstract: We propose a novel local isometry based dimensionality reduction method from the perspective of vector ﬁelds, which is called parallel vector ﬁeld embedding (PFE). We ﬁrst give a discussion on local isometry and global isometry to show the intrinsic connection between parallel vector ﬁelds and isometry. The problem of ﬁnding an isometry turns out to be equivalent to ﬁnding orthonormal parallel vector ﬁelds on the data manifold. Therefore, we ﬁrst ﬁnd orthonormal parallel vector ﬁelds by solving a variational problem on the manifold. Then each embedding function can be obtained by requiring its gradient ﬁeld to be as close to the corresponding parallel vector ﬁeld as possible. Theoretical results show that our method can precisely recover the manifold if it is isometric to a connected open subset of Euclidean space. Both synthetic and real data examples demonstrate the effectiveness of our method even if there is heavy noise and high curvature. Keywords: manifold learning, isometry, vector ﬁeld, covariant derivative, out-of-sample extension

5 0.082261816 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels

Abstract: We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are no longer independent and identically distributed (i.i.d.) random variables anymore. As a consequence, automated model selection criteria (data-driven procedures) break down. Therefore, we propose a simple factor method, based on bimodal kernels, to effectively deal with correlated data in the local polynomial regression framework. Keywords: nonparametric derivative estimation, model selection, empirical derivative, factor rule

6 0.082151063 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

7 0.081551298 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

8 0.07012435 59 jmlr-2013-Large-scale SVD and Manifold Learning

9 0.064794421 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

10 0.054488618 76 jmlr-2013-Nonparametric Sparsity and Regularization

11 0.053288102 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

12 0.0511446 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

13 0.044021029 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

14 0.043623202 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

15 0.042526618 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

16 0.041745268 21 jmlr-2013-Classifier Selection using the Predicate Depth

17 0.039911371 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

18 0.039701268 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

19 0.038048323 95 jmlr-2013-Ranking Forests

20 0.037168793 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.212), (1, 0.112), (2, 0.035), (3, -0.291), (4, -0.045), (5, 0.031), (6, -0.006), (7, -0.013), (8, 0.043), (9, -0.08), (10, -0.161), (11, 0.027), (12, -0.066), (13, -0.095), (14, 0.032), (15, 0.036), (16, -0.085), (17, -0.109), (18, -0.109), (19, 0.118), (20, -0.05), (21, -0.255), (22, 0.024), (23, -0.067), (24, 0.141), (25, -0.064), (26, 0.141), (27, 0.021), (28, 0.04), (29, 0.225), (30, -0.026), (31, 0.068), (32, -0.092), (33, -0.096), (34, 0.009), (35, -0.125), (36, -0.087), (37, 0.062), (38, -0.095), (39, -0.039), (40, 0.051), (41, 0.124), (42, 0.045), (43, 0.011), (44, 0.081), (45, 0.041), (46, 0.112), (47, -0.043), (48, 0.008), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98175836 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

2 0.65176666 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

Author: Partha Niyogi

3 0.48983234 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

Author: Nakul Verma

4 0.42999512 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels

5 0.39680433 86 jmlr-2013-Parallel Vector Field Embedding

Author: Binbin Lin, Xiaofei He, Chiyuan Zhang, Ming Ji

6 0.36558288 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

7 0.35328275 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

8 0.34276554 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

9 0.32729751 21 jmlr-2013-Classifier Selection using the Predicate Depth

10 0.32276192 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

11 0.31240478 76 jmlr-2013-Nonparametric Sparsity and Regularization

12 0.31073382 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

13 0.29366115 95 jmlr-2013-Ranking Forests

14 0.28815845 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

15 0.24963945 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

16 0.24855897 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

17 0.24333739 59 jmlr-2013-Large-scale SVD and Manifold Learning

18 0.22192508 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

19 0.21351403 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

20 0.19932318 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (5, 0.081), (6, 0.022), (10, 0.657), (20, 0.01), (23, 0.018), (68, 0.016), (70, 0.012), (75, 0.036), (85, 0.013), (87, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97195041 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

2 0.93650204 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

Author: Matthew J. Johnson, Alan S. Willsky

Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efﬁcient posterior inference. The methods we introduce also provide new methods for sampling inference in the ﬁnite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model

3 0.91521442 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-deﬁnite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artiﬁcial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to inﬁnity. Experiments on several artiﬁcial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-deﬁnite programming, ﬁnite sample stability, clustering error

4 0.80924946 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing speciﬁc rates of convergence under standard assumptions. We ﬁnd that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

5 0.58933997 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modiﬁcation to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modiﬁed SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

6 0.57953387 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

7 0.5741303 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

8 0.56794918 86 jmlr-2013-Parallel Vector Field Embedding

9 0.5675649 59 jmlr-2013-Large-scale SVD and Manifold Learning

10 0.56578261 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

11 0.55667001 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

12 0.5500502 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

13 0.54845381 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

14 0.54446495 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

15 0.54326707 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

16 0.54227155 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

17 0.54207093 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

18 0.53640538 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

19 0.53321445 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

20 0.53125465 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds